国科大——数据挖掘(0812课程)——考试真题

devtools/2025/3/9 20:48:33/

前沿: 此文章记录了国科大数据挖掘(0812)课程的考试真题。
注: 考试可以携带计算器,毕竟某些题需要计算log,比如:决策树等。

2016年

1. Suppose a hospital tested the age and body fat for 18 randomly selected adults with the following result:

在这里插入图片描述
1)Use smoothing by bin means to smooth the age data, using a bin depth of 6. Illustrate your steps.(5 points)

2)Partition the age into 3 bins by equal-width partitioning, and use bin boundary to smooth each bin. (5 points)

3)Use min-max normalization to transform the value 49 for age onto the range [ 0.0, 1.0 ]. (5 points)

4)Use z-score normalization to transform the value 41.2 for body fat, where the standard deviation of body fat is 9.25. (5 points)

2. Given a transaction database below, let min_support = 50% and min_confidence = 75%.

在这里插入图片描述

在这里插入图片描述

3. Given a data set below with three attributes {A, B, C} and two classes {C1, C2}. Build a decision tree, using information gain to select and split attribute. (15 points)

在这里插入图片描述

4. Consider the following data set. Use Naïve Bayesian Classifier to predict the class label for a test sample (A=0, B=1, C=0). (10 points)

在这里插入图片描述

5. Given a data set of 8 sample points. Perform K-means to generate 3 clusters. Suppose initially we assign point 1,2,3 as the center of each cluster. Note: list the clusters at each iteration. (15 points)

在这里插入图片描述

6. Suppose that a large store has a transaction database that is distributed among four locations. Transactions in each component database have the same format, namely Tj: (i1, …, im}, where Tj, is a transaction identifier, and ik (1 <=k <= m) is the identifier of an item purchased in the transaction. Propose an efficient algorithm to mine global association rules (without considering multilevel associations). You may present your algorithm in the form of an outline. Your algorithm should not require shipping all of the data to one site and should not cause excessive network communication overhead. (15 points)

2017年

1. Please briefly describe the major type of data mining techniques and their corresponding application.(10 points)
2. What is Normalization? Please describe the major Normalization methods and their corresponding pro and cons.(6 points)
3. How to overcome overfitting in decision tree? (5 points)
4. An e-mail database is a database that scores a large number of electronic mails messages. It can be viewed as a semi-structured database consisting mainly of text data.

a. (8 points) How can such an e-mail database be structured so as to facilitate
multidimensional search, such as by sender, by receiver, by subject, and by
time?

b.(10 points) Suppose you have roughly classified a set of your previous e-mail
messages as junk, unimportant, normal, or important. Describe how a data
mining system may take this as the training set to automatically classify new
e-mail messages or unclassified ones.

5. Given a transaction database below, let min_support = 30% and min_confidence = 70%:

在这里插入图片描述

Find all frequent itemsets using FP-growth method. Write up the conditional pattern base for each item, and the conditional FP-tree for each item.(15 points)

6. Figure 1 is a BP ( Backpropagation ) Neural Network. The learning rate l = 0.9, the Bias at every unit is initialized as 0, and the activation function at every unit is

在这里插入图片描述

在这里插入图片描述
a. Given a training record( x1, x2, z ) where the input x1 = 1, x2 = 0, and the class label z = 1, and the weights of the connections in the kth iteration are w11(k) = 0, w12(k) = 2, w21(k) = 2, w22(k) = 1, T1(k) = 1, T2(k) = 1, please give z(k) (Please give the calculation formulas and the relevant values). (10 points)

b. Please give the updated weights, w11, w12, w21, w22, following the weight updating formulas. (10 points)
在这里插入图片描述

7. Table 1 gives a User-Product rating matrix.

在这里插入图片描述

(1) List the top 2 most similar users of user 2 based on Euclidian Distance. (5 points)

(2) Predict User 2’s rating for Product 2.(5 points)

8. (16 points) Suppose that a large store has a transaction database that is distributed among four locations. Transactions in each component database have the same format, namely Tj: (i1, …, im}, where Tj, is a transaction identifier, and ik (1 <=k <= m) is the identifier of an item purchased in the transaction. Propose an efficient algorithm to mine global association rules (without considering multilevel associations). You may present your algorithm in the form of an outline. Your algorithm should not require shipping all of the data to one site and should not cause excessive network communication overhead.

2020年

1. 航空公司希望能够分析在其服务中的常客的旅行趋势,这样可以为公司正确定位航空市场中的常客市场。公司希望能够跟踪不同航线上旅客的季节变化情况和增长,并跟踪在不同航班上所消费的食品和饮料情况,这样可以帮助航空公司安排不同航线上的航班和食品供应。请面向航空公司的任务,设计一个数据仓库的模型。

参考:
在这里插入图片描述

2. Suppose that the data for analysis includes the attribute age: 20, 13, 15, 16, 25, 35, 36, 40, 45, 46, 52, 25, 25, 30, 21, 22, 22, 33, 33, 35, 35, 70,19,20.

(1) Use smoothing by bin means to smooth the above data, using a bin depth of 6.
(6 points)

(2) Determine outliers in the data by five-number summary? (4 points)

(3) Use min-max normalization to normalize 33. (4 points)

3. 简答

(1) What is overfitting? (4 points)

(2) How to overcome overfitting in decision tree? (4 points)

(3) Please present an attribute selection method in decision tree. (4 points)

(4) A neural network classifier may consist of multiple hidden layers. How to overcome overfitting in a neural network classifier? (6 points)

4. Given a transaction database below, let min_ support = 50% and min_ confidence = 70%:

在这里插入图片描述

(1) Find all the frequent itemsets using FP-growth method. Write up the conditional pattern base for each item, and the conditional FP-tree for each item. (10 points)

(2) Using the resulting frequent itemsets, find all the strong associations in the following rule form:
For any transaction x, buys(x, item1) ^ buys(x, item2) → buys(x, item3) [s=?%, c=?%]. (5 points)

5. Suppose you are given the following ratings by students on four different items, where ? indicates that no rating was given:

在这里插入图片描述
(1) List the top 2 most similar students of student 1 based on Euclidian Distance. (6 points)

(2) Assuming we only recommend the top 1 item to student 1, which item would
you recommend, item 2 or item 5? (8 points). (Assume similarity = 1/distance)

6. Use the following similarity matrix to perform AGNES clustering. Show your results by drawing a dendrogram (树状图). The dendrogram should clearly show the order in which the points are merged. (12 points) (The number indicates the similarity between the two points)

在这里插入图片描述

7. Data stream is continuous, ordered, fast changing data, possibly infinite. Since the volume is so huge and the pattern changes fast, it requires a "single scan algorithm (can only have one look). Please present a K-means based efficient data stream clustering algorithm, which can not only discover the clusters, but also can provide the evolution of the clusters. Please present your algorithm in pseudo code. (15 points)

2024年(回忆)

1. 航空公司希望能够分析在其服务中的常客的旅行趋势,这样可以为公司正确定位航空市场中的常客市场。公司希望能够跟踪不同航线上旅客的季节变化情况和增长,并跟踪在不同航班上所消费的食品和饮料情况,这样可以帮助航空公司安排不同航线上的航班和食品供应。请面向航空公司的任务,设计一个数据仓库的模型。

注: 考的是此题的变形,但大差不差。

参考:
在这里插入图片描述

2. 此题考查的是数据处理,印象是给了五六个数据,然后用分箱的等深和等宽计算平滑,以及某个数的归一化,包括归一化到 [0,1] 和 min-max 归一化。
3. 此题考查的是关联规则,即Apriori 算法 或者 FP-growth 算法,作业题的变形。
4. 此题考查的是决策树构建,即计算信息增益,选择最优属性,只让构建了一层,以及朴素贝叶斯对某一样本进行分类,作业题的变形。
5. 此题考查的是推荐系统,即计算与user 1 最相似的2个users,然后计算user 1中确实的rating,也是作业题的变形。
6. 此题考查的是AGNES聚类算法,不过给的是样本在二维坐标系的坐标,需要自行计算相似矩阵,随后进行聚类。
7. 设计算法。利用神经网络模型设计一个论文分类模型。

http://www.ppmy.cn/devtools/165844.html

相关文章

人工智能里的深度学习指的是什么?

深度学习&#xff08;Deep Learning, 简称DL&#xff09;是机器学习领域的一个重要分支&#xff0c;它通过构建和训练深层神经网络模型&#xff0c;从大量数据中自动学习和提取特征&#xff0c;以实现复杂任务的自动化处理和决策。以下是关于深度学习的详细介绍&#xff1a; 一…

【含文档+PPT+源码】基于Python的美食数据的设计与实现

项目介绍 本课程演示的是一款基于Python的美食数据分析系统&#xff0c;主要针对计算机相关专业的正在做毕设的学生与需要项目实战练习的 Java 学习者。 包含&#xff1a;项目源码、项目文档、数据库脚本、软件工具等所有资料 带你从零开始部署运行本套系统 该项目附带的源码…

感受数字经济春潮涌动——中电联数字经济专委会理事长刘九如一行调研北京国信华源科技公司

早春二月&#xff0c;春潮涌动。中国电子信息行业联合会数字经济专委会理事长、国创会数字经济研究院专家委主任刘九如2月21日专程赴北京国信华源科技有限公司&#xff08;以下简称国信华源&#xff09;调研&#xff0c;深入了解数字技术在水利设施及自然灾害监测预警领域的深入…

Ollama在AutoDL部署,CPU服务器做代理,实践中

## 我有两台服务器&#xff0c;一台是GPU服务器&#xff0c;另一台是CPU服务器&#xff1b; ## 我在GPU服务器上安装了Ollama&#xff0c;然后通过命令映射端口到CPU服务器&#xff1a; ssh -CNg -L 0.0.0.0:11434:127.0.0.1:11434 rootconnect.westb.seetacloud.com -p 34016…

app UI自动化测试框架都包含哪些内容?

UI自动化测试框架是指用于自动化执行用户界面(UI)相关测试的工具和库。它们可以帮助开发团队提高测试效率、发现和解决应用程序中的问题&#xff0c;并确保应用程序的正确性和稳定性。下面将详细介绍一个完整的UI自动化测试框架应该具备的内容。 1. 测试环境配置 UI自动化测试框…

Tomcat原理:HTTP协议与HTTPS协议

一、URL统一资源定位符 在介绍HTTP协议与HTTPS协议之前&#xff0c;我们首先要了解统一资源定位符URL&#xff0c;用来表示从互联网上得到的资源位置和访问这些资源的方法。 &#xff08;一&#xff09;表示方法 URL分为以下几个部分&#xff1a;协议://主机地址:端口号//文件…

基于开源库编写MQTT通讯

目录 1. MQTT是什么&#xff1f;2. 开发交互UI3. 服务器核心代码4. 客户端核心代码5. 消息订阅与发布6. 通讯测试7. MQTT与PLC通讯最后. 核心总结 1. MQTT是什么&#xff1f; MQTT&#xff08;Message Queuing Terlemetry Transport&#xff09;消息队列遥测协议&#xff1b;是…

XHR请求解密:抓取动态生成数据的方法

在如今动态页面大行其道的时代&#xff0c;传统的静态页面爬虫已无法满足数据采集需求。尤其是在目标网站通过XHR&#xff08;XMLHttpRequest&#xff09;动态加载数据的情况下&#xff0c;如何精准解密XHR请求、捕获动态生成的数据成为关键技术难题。本文将深入剖析XHR请求解密…