Data Mining Review Outline of Hunan University of Science and Technology

More on January 8, 2021: According to sources: Apriori and K-mean algorithms are tested with a high probability of comprehensive questions

1. Multiple-choice questions. (2 points for each, 20 points in total for this question)

2. Write down the main ideas of the following algorithms (5 points for each question, 20 points in total for this question)
     1. The main ideas of the ARMA model.

It can be understood as the combination of AR(p) autoregressive model and MA(q) moving average model. ARMA, as the combined financial time series model of the two, can capture momentum, mean regression effects, and shock effects.

 


     2. The main idea of ​​PageRank algorithm.


     3. The main idea of ​​the EM (maximum expectation) algorithm

The EM algorithm is an iterative optimization strategy. Each iteration in its calculation method is divided into two steps, one of which is the desired step (E step) and the other is the maximum step (M step).

The basic idea is: First, according to have given the observed data to estimate the model parameter values; then estimated based on the step of the parameter values estimated missing data values, and then based on the estimated missing data plus already before Re-estimate the parameter values ​​from the observed data, and then iterate repeatedly until finally converges , and the iteration ends.


3. Short answer questions (10 points for each question, 30 points in total for this question)
1. Classification steps and functions of KDD (Knowledge Discovery in Database)

1. Problem definition
2. Data collection
3. Data preprocessing (including five steps): data cleaning, data conversion, data description, feature selection, feature extraction 

4. Data mining

5. Model evaluation


2. The steps of data classification and its basic tasks

Two steps of data classification:

1 . Build a model

Data tuples are also called samples, instances, or objects.
The data tuples analyzed to build the model form a training data set .
A single tuple in the training data set is called a training sample . Since the class label of each training sample is provided , it is also called supervised learning.
The classification model is constructed by analyzing the training data set, which can be provided in the form of classification rules, decision trees or mathematical formulas.
 
2 . Use the model to classify
First assessment model (taxonomy) prediction accuracy.
If the accuracy of the model is considered acceptable, it can be used to classify data tuples or objects with unknown class labels .



3. Reasons and strategies of decision tree pruning

Reason: to avoid overfitting samples of the decision tree. The decision tree generated by the previous algorithm is very detailed and huge, and each attribute is considered in detail. The training samples covered by the leaf nodes of the decision tree are all "pure". Therefore, if you use this decision tree to classify the training samples, you will find that for the training samples , the tree performs well, has an extremely low error rate, and can correctly classify the samples in the training sample set. The wrong data in the training sample will also be learned by the decision tree and become a part of the decision tree, but the performance of the test data is not as good or extremely poor as expected. This is the so-called overfitting problem.

Strategy:

         Pre-pruning: pruning the decision tree by stopping the construction of the tree in advance

         After pruning, pruning is performed on the generated over-fitting decision tree, and a simplified version of the pruning decision tree can be obtained.
4. What are the methods for measuring the distance between samples and between classes?

The shortest distance method : Define the distance between the two closest elements in two classes as the distance between classes.

Longest distance method: Define the distance between the two furthest elements in two classes as the distance between classes.

Center method: Define the distance between the two centers of the two classes as the inter-class distance.

Class average method: It calculates the distance between any two elements in two classes, and integrates them into the distance between classes

The sum of squared deviations: the diameter of the class reflects the difference between the elements in the class, and can be defined as the sum of the Euclidean distances from each element in the class to the center of the class.


5. The concept of time series mining

A time series refers to a series of values ​​of the same statistical indicator arranged in the order of their occurrence (a sequence of observations on a uniform time interval).

The main purpose of time series analysis is to predict the future based on existing historical data.

Time series mining is to extract information and knowledge related to time attributes that people do not know in advance but are potentially useful from a large amount of time series data, and use it for short-term, medium-term or long-term forecasting to guide people’s social, economic, and Military and life activities. Through the objective record analysis of past historical behaviors, it reveals its inherent laws, and then completes decision-making tasks such as predicting future behaviors.

6. The significance of web data mining

1. Discover information that users are interested in from a large amount of information

2. Turn the rich information on the web into useful knowledge

3. Personalize user information

7. The main difference between AprioriAll and AprioriSome sequence mining algorithms

Book P220
IV. Comprehensive questions (15 points for each question, 30 points in total for this question)

1. The main idea of ​​the Apriori algorithm, the process of generating frequent item sets and strong association rules (required)

The core idea is: connecting step and pruning step. The connection step is self-connection, and the principle is to ensure that the first k-2 items are the same and connected in lexicographic order. The pruning step is to make all non-empty subsets of any frequent itemset also be frequent. Conversely, if a

If the non-empty subset of a candidate is not frequent, then the candidate is definitely not frequent, so it can be deleted from CK. Simply put, 1. Find frequent itemsets, the process is (1) scan (2) count (3) compare (4) generate frequent itemsets (5) join, pruning, generate candidate item sets, repeat step (1)~ (5) Until no larger frequency set can be found

Reference source: https://blog.csdn.net/lizhengnanhua/article/details/9061755?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-1.control&depth_1-utm_source=distribute.pc_relevant.none-task-blog -BlogCommendFromBaidu-1.control

Recommended video tutorial: https://www.bilibili.com/video/BV1AJ411x7sf?from=search&seid=9854295243924040456

The Apriori algorithm has two fatal performance bottlenecks :

1 . Scanning the transaction database multiple times requires a large I/O load
 
2 . May produce a huge candidate set
    

2. The main idea, classification process and results of KNN (k nearest field) classification algorithm

        K- Nearest Neighbors ( K Nearest Neighbors , KNN for short ) calculates the distance from each training data to the tuple to be classified, and takes the K training data closest to the tuple to be classified, and which category of training data is among the K data If it is in the majority, the tuple to be classified belongs to which category.

算法 4-2  K-近邻分类算法
输入:  训练数据T;近邻数目K;待分类的元组t。 
输出:  输出类别c。 
(1)N=Ø;
(2)FOR each d ∈T DO BEGIN
(3)   IF |N|≤K THEN
(4)    N=N∪{d};    
(5)   ELSE
(6)	 IF  ∃ u ∈N  such that sim(t,u)〈sim(t,d) THEN BEGIN 
(7)	     N=N-{u};
(8)	     N=N∪{d};
(9)	  END
(10)END
(11)c=class to which the most u∈N. 



3. The main idea, classification process and results of ID3 decision tree classification algorithm

ID3 algorithm: Iterative Dichotomiser 3 , iterative binary tree 3 generations, is a greedy algorithm.
The ID3 algorithm was originally a classification prediction algorithm proposed by J. Ross Quinlan at the University of Sydney in 1975. The core of the algorithm is "information entropy."
The ID3 algorithm calculates the information gain of each attribute and considers that the attribute with high information gain is a good attribute. Each time the attribute with the highest information gain is selected as the partition standard, this process is repeated until a decision tree that can perfectly classify training examples is generated.
 
The core idea of ​​ID3 algorithm:
     Before dividing each non-leaf node of the decision tree, first calculate the information gain brought by each attribute, and select the attribute with the largest information gain to divide , because the greater the information gain, the stronger the ability to distinguish samples, and the better Representativeness, obviously this is a top-down greedy strategy.
 


4. The main idea, classification process and result of C4.5 decision tree classification algorithm 5. The main idea, classification process and result of

Bayes (Bayes)

Bayesian method uses knowledge of probability and statistics to classify the sample data set, characterized by combination of prior probability and posterior probability that avoids the use of only a priori probability of prejudice in charge, but also avoids the use of a separate sample information overfitting phenomenon.
 
Naive Bayes method is based on Bayes' theorem wherein conditional independence assumption classification. That is, it is assumed that the attributes are mutually conditionally independent when the target value is given. That is to say, no attribute variable has a larger proportion to the decision result, and no attribute variable has a smaller proportion to the decision result. Although this simplification method reduces the classification effect of the Bayesian classification algorithm to a certain extent, in actual application scenarios, it greatly simplifies the complexity of the Bayesian method.
 
Bayesian classification
Let X and Y be two random variables, and get the Bayesian formula:
P(Y) is called the prior probability, P(Y|X) is called the posterior probability, and P(Y,X) is the joint probability.
Priori probability refers to the probability obtained based on past experience and analysis.
Posterior probability , something has happened, there may be many reasons for it, and it is the probability of judging which reason caused the thing when it happened.
Joint probability is the probability that two events represent a common occurrence.
 
 

6. The main idea, clustering process and results of k-mean (k-mean) clustering algorithm

      The algorithm first randomly selects k objects, and each object initially represents the average value or center of a cluster. For each remaining object, assign it to the nearest cluster according to its distance from the center of each cluster. Then recalculate the average of each cluster. This process is repeated until the criterion function converges.

example:

7. The main idea, clustering process and results of the PAM (partition around the central point) clustering algorithm (not considered for high probability)


Frequent patterns and association rules

The problem of mining association rules can be divided into two sub-problems:
1. Find all frequent itemsets: Find all frequent itemsets or the largest frequent itemsets through the Minsupport given by the user .

 2. Find association rules from frequent item sets: Find association rules in frequent items set through the Minconfidence given by the user

 

Review the outline PPT:

1. What is data mining? 

      Data mining is the process of analyzing a large amount of collected data with appropriate data mining and knowledge discovery techniques, extracting useful information and forming conclusions, and then the process of detailed research and generalization of the data

        2. What is machine learning?

              Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.

        3. What is big data?

              Refers to a large-scale data collection that greatly exceeds the capabilities of traditional database software tools in terms of acquisition, storage, management, and analysis. It has the four characteristics of massive data scale, fast data circulation, diverse data types, and low value density. .

        4. What is Web content mining?

              Integrate, generalize, and classify all kinds of information on the web pages of the site , and mine the knowledge mode contained in certain types of information

        5. What are the fitting classifications?

             Overfitting, fitting, underfitting

        6. The basic concept of clustering

             Cluster analysis: Given a group of objects, according to the description information, discover objects with common characteristics among them to form a cluster .

        7. Common distance functions

             1. Euclidean distance 2. Manhattan distance 3. Minkowski distance 4. Cosine distance 5. Jaccard distance

Guess you like

Origin blog.csdn.net/Zhongtongyi/article/details/112135298