[Data mining] final review (sample paper questions + a few knowledge points)

Table of contents

Chapter One Introduction

1. Fill in the blanks

(1) From a technical perspective, data mining is ( ). From a business perspective, data mining is ( ).

Answer: It is the process of extracting hidden, unknown but potentially useful information from a large amount of incomplete, noisy, fuzzy, and random practical application data.
A business information processing technology, its main feature is to extract, convert, analyze and model a large amount of business data, and extract key data to assist business decision-making.

(2) The information obtained by data mining has three characteristics: ( ), effective and practical.

Answer: Previously unknown.

2. Application of data mining in daily life scenarios

insert image description here

3. Distinguish between data mining and query

There is an essential difference between data mining and traditional data analysis methods (such as: query, report, online application analysis, etc.). Data mining is to mine information and discover knowledge without a clear premise.
Example:
Finding people's names in a sentence is data mining, and finding people in a table is querying

Chapter 2 Data Processing Basics

1. Fill in the blanks

(1) The data is ( ), and the attributes are divided into ( ).

A: A collection of data objects and their attributes; nominal and ordinal attributes, interval and ratio attributes.

2. Calculation questions

(1) Calculate the similarity measure

The range given by the teacher:
distance measure (Manhattan, European):
insert image description here

Similarity coefficient (cosine similarity):
insert image description here

Similarity of binary attributes (simple matching similarity relationship coefficient d, s)
Jaccard coefficient:
insert image description here

Example 1:

answer:

Example 2:

answer:

(2) Calculation of data statistical features

Record formula:
arithmetic mean,
weighted arithmetic mean
, truncated mean: discard the high-end and low-end (p/2)% data, and then calculate the mean.
Median,
quartile,
middle column: (max+min)/2
mode

answer:

3. Quiz

(1) Why data preprocessing? List three commonly used preprocessing techniques?

Answer: The purpose of data preprocessing: to provide clean, concise and accurate data, and improve mining efficiency and accuracy.

Preprocessing technology: data cleaning, data integration, data transformation, data reduction, data discretization.
①Data cleaning: the data is incomplete, noisy, and inconsistent (fill missing values, remove noise and identify discrete points, correct inconsistent values ​​in the data) ②Data integration (aggregation): aggregate the data, combine
two or multiple data sources, stored in a consistent data storage device.
③ Data Transformation: Transform data into a form suitable for mining. (Smoothing, aggregation, data generalization, normalization, data discretization)
④Data reduction: including sampling and feature selection.

4. Smoothing method for noisy data

(1) Binning:
Step 1: The data is divided into n equal-depth bins
Step 2: Use average or boundary smoothing

The deeper and wider the bin, the better the smoothing effect.

(2) Clustering: delete outliers
(3) Regression: find a suitable function

5. Data transformation

A. Standardization

Normalization is the transformation of the original metric values ​​into dimensionless values. (Scaled and mapped to a new value range)
(1) Minimum-maximum normalization (converted to [0,1] range)
(2) z-score normalization (standardization of probability theory)
(3) Decimal definition Standardization (transformed into the format of "nth power of zero tenths × 10")

B. Feature structure

New feature set for ships from original features.

C. Data discretization

Replace numeric values ​​of continuous attributes with categorical value tags. Divided into supervised and unsupervised discretization.
Unsupervised discrete methods: (1) equal width (2) equal frequency (3) based on cluster analysis.
Supervised discrete methods: (1) Entropy-based: top-down

6. Data reduction

A. Sampling

compressed lines

There are three sampling methods. With replacement, without replacement, stratification (p36)

B. Feature selection

number of compressed columns

Ideal feature subset: Every valuable non-target feature should be strongly correlated with the target feature, and the non-target features should be uncorrelated or weakly correlated.

Chapter 3 Classification and Regression

1. Fill in the blanks

(1) Methods for evaluating the accuracy of classification models include: ( ), ( ) and random sub-sampling methods.

Answer: Hold method, k-fold cross-validation.

2. True or False

(1) The regression prediction output is a continuous value ( )

Answer: √
Classification prediction output: discrete class values ​​(predict one class). The output of regression prediction is continuous value.

(2) The KNN classification method requires prior modeling. ( )

Answer: ×
KNN is a passive learning method without prior modeling. Basic steps:
1 Calculate the distance. Given a test object, calculate the distance between it and each object in the training set;
2 Find neighbors. Circle the nearest k training objects as the neighbors of the test object.
3 Do classification. Classify the test object according to the main category to which the k neighbors belong.

(3) AdaBoost algorithm is an algorithm that brings together multiple classifiers to improve classification accuracy. ( )

Answer: √

3. Computer questions

Formula:
Information Entropy:
insert image description here

Information Gain:
insert image description here
Split Information:
insert image description here
Information Gain Rate:
insert image description here
Gini Coefficient:
insert image description here
Gini Coefficient Gain:
insert image description here

(1) Use the ID3 algorithm to describe the process of building a decision tree

insert image description here
insert image description here
insert image description here

(2) Given a certain weather data set, find information gain, information gain rate, and Gini coefficient gain.

insert image description here
(1) Step:
Calculate the entropy E(S) of the data set Calculate the entropy E(Si) of the subsets
divided according to temperature Calculate the addition of E temperature(S) = (|Si|/|S|) *E(Si) And calculate the information gain Gain (S, temperature)=E(S)-E temperature(S)


insert image description here
insert image description here
insert image description here

(3) KNN book example questions

insert image description here
insert image description here

4. Quiz

(1) Write out the Bayesian formula, please give the steps of the Naive Bayesian method.

Answer: Formula: P(A|B) = P(B|A)*P(A) / P(B)
Steps:
(The official answer is as follows, I know every single word, but I can’t understand it together …)

  1. First, according to the given samples with unknown class labels, the posterior probability of each class label is calculated.
  2. According to the Bayesian formula, the calculation of the posterior probability is transformed into the calculation of the probability product of the conditional probability of each attribute of the sample and the prior probability, which are easy to calculate from the given conditions.
  3. Take the category with the highest probability among the calculation results of various categories, and classify the samples into this category.

(simplified version)

  1. First calculate the probability of each category ;
  2. Then calculate the probability of each feature of the predicted data under each classification dimension ;
  3. Calculate according to the classification dimension: classification probability * probability of each feature ;
  4. Select the largest result in step 3 as the desired result;

(2) What does "naive" mean in Naive Bayes? Briefly describe the main idea of ​​Naive Bayes.

insert image description here

Chapter 4 Cluster Analysis

1. Fill in the blanks

(1) Clustering algorithms are divided into division methods, hierarchical methods, density-based methods, graph-based methods, and model-based methods, among which k-means belongs to the ( ) method, and DBSCAN belongs to the ( ) method.

A: Division; density-based.

2. True or False

(1) One-pass clustering algorithms are able to identify clusters from arbitrary shapes. ( )

Answer: The ×
-one-pass algorithm divides the data into hyperspheres of almost the same size and cannot be used to find clusters of non-convex shapes.

(2) DBSCAN is relatively noise-resistant and capable of identifying clusters of arbitrary shapes and sizes. ( )

Answer: √
DBSCAN algorithm is based on density

(3) In cluster analysis, the greater the similarity within a cluster, the greater the difference between clusters, and the worse the clustering effect. ( )

Answer: ×
A good clustering method produces high-quality clusters: high intra-cluster similarity, low inter-cluster similarity.

3. Calculation questions

(1) k-means algorithm

algorithm:

insert image description here

topic:

insert image description here
answer:
insert image description here

4. Typical clustering methods

(1) Division method: k-means, one-pass algorithm
(2) Hierarchical method: agglomeration (bottom-up), split hierarchical clustering method (top-down), CURE, BIRCH
(3) Density-based method: DBSCAN
(4) Graph-based clustering algorithm: Chameleon, SNN
(5) Model-based method

Insufficiency of K-means

(1) The number of clusters is given in advance
(2) The selection of the initial value is extremely dependent, and the algorithm often falls into a local optimal solution
(3) The algorithm needs to continuously classify and adjust the samples
(4) The noise point and distance Cluster-sensitive
(5) cannot find clusters of non-convex shape, or clusters of various sizes or densities
(6) can only be used for data sets with numerical attributes

hierarchical clustering algorithm

Top-down and bottom-up types.
Three improved agglomerative hierarchical clustering (bottom-up) methods: BIRCH, ROCK, CURE.

Density-Based Clustering Algorithm DBSCAN

According to the density of points, there are three types of points:
(1) core points: points inside the dense area
(2) boundary points: points on the edge of the dense area
(3) noise or background points: points in the sparse area

Direct density reachable: p is within the Eps neighborhood of q
Density reachable: there are connections within the Eps range, pay attention to the directionality!
Density-connected: both p and q are density-reachable from O with respect to Eps and MinPts

DBSCAN textbook sample questions:

algorithm:
insert image description here

topic:

insert image description here
insert image description here
insert image description here

Graph-Based Clustering Algorithm Chameleon

The absolute degree of interconnection EC (the larger the EC, the higher the degree of correlation, the more it should be merged) the
relative degree of interconnection RI (the larger the RI, the connection between the two classes is not much different from the connection degree within the two classes, and it can be better ground connection)

insert image description here

Absolute tightness S
Relative tightness RC

insert image description here

5. Clustering Algorithm Evaluation

(1) Internal Quality Evaluation Standards

The internal quality evaluation standard evaluates the clustering effect by calculating the average similarity within a cluster, the average similarity between clusters, and the overall similarity.

For example:
CH indicator:

insert image description here

The larger CH is (that is, the increase of traceB and the decrease of traceW), the greater the difference between the mean values ​​of each cluster, and the better the clustering effect.
traceW min = 0, each point in a class coincides, the effect is good.

(2) External Quality Evaluation Standards

The external quality evaluation criteria are evaluated based on an existing manual classification dataset (the category of each object is already known).

Chapter V Correlation Analysis

1. Fill in the blanks

(1) Association rule mining algorithm can be divided into two steps: ① ( ), ② ( ).

Answer: ① Generate frequent itemsets: find all itemsets that meet the minimum support threshold, that is, frequent itemsets.
②Generate rules: extract rules greater than the confidence threshold from the frequent itemsets found in the previous step, that is, strong rules.

2. True or False

(1) If the itemset X is a frequent itemset, then the subset of X must be a frequent itemset ( )

Answer: √

(2) Itemsets with higher support must have higher confidence ( )

Answer: ×
insert image description here

3. Calculation questions

Aprior algorithm:

insert image description here

(1) The known shopping basket data is shown in the table below right, please complete the following tasks

insert image description here
Answer: (After checking, it was found that the 2-itemset missed {bread, eggs}: 1, {beer, eggs}: 1, {diapers, eggs}: 1, but it has little effect on the final result.) (
insert image description here
2 )
support ({bread}->{diaper}) = 3/5
confidence ({bread}->{diaper}) = 3/4 <80%
so it is not a strong association rule.

4. Application Scenarios of Association Analysis

(1) Mining the sales data of shopping malls , discovering the relationship between products, and helping shopping malls to carry out promotions and shelf placement.
(2) Mining medical diagnosis data , you can find the relationship between certain symptoms and a certain disease, and diagnose diseases for doctors
(3) Web page mining : reveal interesting links between different browsed web pages.

5. The concept of association analysis

(1) Itemset: An itemset containing k data items is called a k-itemset.
(2) Frequent itemsets: If the support of an itemset is greater than or equal to a certain threshold , it is called a frequent itemset.
(3) Support count: the number of occurrences of an item set, that is, the number of transactions that include the item set in the entire transaction data set.
(4) Association rules: Implications of the form X->Y
(5) Support:
insert image description here

(6) Confidence:
insert image description here
(7) Strong association rules: Association rules greater than the minimum support threshold and minimum confidence threshold.

6, Apriori arithmetic

Apriori property: Any subset of a frequent itemset should also be a frequent itemset.
Corollary: If an itemset is infrequent, its superset is also infrequent.
The algorithm includes two steps of connection and pruning .

7. Correlation analysis

(1) Lift (lift). Its value is greater than 1, indicating that there is a positive correlation between the two; less than 1, negative correlation; equal to 1, there is no correlation.
insert image description here
(2) Interest factor
(3) Correlation coefficient
(4) Cosine measure

8. Calculation of the number of itemsets

1. Given k items, there are 2k-1 itemsets in total.
2. There are 2k-2 candidate association rules for frequent k-itemsets (excluding L->ᴓ and ᴓ->L)

Chapter 6 Outlier Mining

1. True or False

(1) If an object does not strongly belong to any cluster, then the object is an outlier based on clustering ( ).

Answer: √

2. Calculation questions

(1) Given a two-dimensional data set, the coordinates of the points are as follows, take k=3, use the k-means algorithm to find the outlier factor OF1 of points p14 and p16, which point is more likely to be an abnormal point?

insert image description here
answer:
insert image description here
insert image description here

3. Quiz

(1) What is an outlier? Does the outlier detected by the outlier mining algorithm necessarily correspond to the actual abnormal behavior? If yes, please explain; if no, please give a counterexample.

Answer: Outliers are data that deviate from most of the data in the data set , making people suspect that the deviation of these data is not generated by random factors , but by different mechanisms .
In general, outliers may correspond to actual abnormal behavior. Since the mechanism of outlier generation is uncertain, whether the "outlier" detected by the outlier mining algorithm corresponds to the actual abnormal behavior is not to be explained and explained by the outlier mining algorithm, and can only be explained by the domain Experts to explain .
Outliers may be caused by measurement, input errors, or system operation errors, or may be determined by the inherent characteristics of the data, or caused by the abnormal behavior of the object.
For example: an age of -999 may be caused by the program processing default data and setting default values. The salary of a company's top managers is significantly higher than that of ordinary employees, which may become outlier data, but it is reasonable data. The cost of a residential telephone increased from less than 200 yuan per month to thousands of yuan, which may be caused by stolen calls or other special reasons. Apparently high charges on a credit card may be due to the card being stolen.

4. The causes of outliers

(1) Caused by measurement, input error or system operation error
(2) Determined by the inherent characteristics of the data
(3) Caused by the abnormal behavior of the object

5. Three issues to be dealt with in outlier mining

insert image description here

6. Statistical methods

Data that will not agree with the model are identified as outliers. An object is considered an outlier if its probability distribution model with respect to the data has a low probability value.
Probability distribution models are created from data by estimating user-specified distribution parameters.

Quality Control Diagram

insert image description here

insert image description here

7. Distance-based methods

(1) Outlier factor of point x: the larger OF1 is, the more outlier point x is.
insert image description here

insert image description here

Textbook example: Calculate OF1

insert image description here
insert image description here

8. Method based on relative density

(1) Local neighborhood density:
insert image description here

(2) Relative Density: Outliers are detected by comparing the density of an object with the average density of objects in its neighborhood.

insert image description here

Textbook Example: Calculate OF3

insert image description here

insert image description here

9. Clustering-based methods

Detection methods for dynamic and static data outliers:
insert image description here

10. Evaluation of Outlier Mining Methods

Mixing matrix:
insert image description here

Two indicators of the accuracy of the outlier mining method:
(1) detection rate
(2) false alarm rate
insert image description here

That's all there is to it.
END

Guess you like

Origin blog.csdn.net/qq_51669241/article/details/125154143