Summary of data mining knowledge points

1. What is the background of data mining? What is the driving force?
Four main technologies have stimulated interest in the development, application, and research of data mining techniques: the
emergence of very large-scale databases, such as commercial data warehouses, and the popularization of automatic means of collecting and recording data by computers
Advanced computer technologies, such as faster and larger High computing power and parallel architecture
Fast access to massive data, such as the application of distributed data storage systems The application of
statistical methods in the field of data processing continues to deepen
A large amount of information brings convenience to people but also brings a lot of problems:
information Redundancy, difficulty in identifying true and false information, difficulty in ensuring information security, different forms of information, and difficulty in unified processing, etc. Phenomena such as "data surplus", "information explosion" and "knowledge poverty" have emerged one after another.
Data mining first appeared at the Eleventh International Joint Artificial Intelligence Academic Conference held in 1989. Data mining ideas come from machine learning, database systems, pattern recognition, and statistics. Necessity is the mother of invention. In recent years, data mining has attracted great attention from the information industry. The main reason is that there is a large amount of data, which can be widely used, and it is urgent to convert these data into useful information and knowledge. The acquired information and knowledge can be widely used in various applications, including business management, production control, market analysis, engineering design and scientific exploration, etc.
Driving force: DRIP (Data Rich Information Poor)

2. What are the characteristics of big data?
high-volume, high-veclocity, high-variety
high capacity, high coverage, high variety

3. What is data mining?
Data->Knowledge (law)
Data mining is to discover knowledge from data, and mine interesting, useful, implicit, previous, unknown, and possibly useful patterns or knowledge from a large amount of data. Data mining is not a fully automatic process, and human participation may be required in every link.
Data mining can be defined on both technical and commercial levels. From a technical perspective, data mining is the process of extracting potentially useful information and knowledge from a large amount of data. From a commercial point of view, data mining is a kind of business information processing technology. Its main feature is to extract, transform, analyze and model a large amount of business data, and extract key data to assist business decision-making.

4. What is the general process of data mining? And what is the process of industry data integration & analysis? Give examples of data mining applications in various fields.
General process:
insert image description here
the process of industry data integration & analysis:
insert image description here
such as medical care, transportation, public safety, personalized medicine, social networking, and precision consumption.

5. What are the four main tasks of data mining? What's the difference?
1. Classification prediction: Based on a set of objects and their class labels, a classification model is constructed and used to predict the category labels of another set of objects - supervised
2. Cluster analysis: Clustering is the assignment of a set of samples to subsets (clusters) ), so that the samples in the same cluster are similar in some sense - The difference between unsupervised
clustering and classification is that clustering does not depend on pre-defined class labels - Clustering is an unsupervised data mining task.
3. Association analysis: Given a set of records, each of which contains several items from a given set, the association rules are: generate dependencies (rules), and predict whether a certain item occurs through the rules.
4. Abnormality detection: find significant deviations from normal behavior, and use the results of clustering and classification analysis
Classification:

6. Combined with classification, introduce common concepts in data mining Classification boundary: hypersurface
that can divide the problem space into regions
It performed poorly on the test set. The model "rote memorizes" the training set (remembers the properties or characteristics of the training set that are not applicable to the test set), does not understand the laws behind the data, and has poor generalization ability.
confusion matrix:
insert image description here

  1. TP (True Positive): predict the positive class as the number of positive classes, the real is 0, and the prediction is also 0
  2. FN (False Negative): Predict the positive class as the number of negative classes, the real is 0, and the prediction is 1
  3. FP (False Positive): predict the negative class as the number of positive classes, the real is 1, and the prediction is 0
  4. TN (True Negative): The negative class is predicted as the number of negative classes, the real is 1, and the prediction is also 1.
    Cost-sensitive learning: The cost-sensitive learning method is a new method in the field of machine learning. It is mainly considered in classification. How to train a classifier when different classification errors result in different penalties. For example, in medicine, "the cost of misdiagnosing a sick person as a healthy person" is different from "the cost of misdiagnosing a healthy person as a sick person";

7. Introduce data objects and data attributes
Attribute types: discrete and continuous, discrete attributes use symbols and integers as attribute values, note: binary attributes are discrete attributes, continuous attributes use real numbers as attribute values, and are usually expressed as floating-point variables
asymmetrically Attribute: It only makes sense to pay attention to a small number of non-zero attribute values, and this attribute is called an asymmetric attribute.

8. What is the disaster of dimensionality? How to explain this phenomenon? How to avoid the curse of dimensionality?
Dimensionality is the number of attributes in a data set, and it is easy to fall into the curse of dimensionality when analyzing high-dimensional data. The curse of dimensionality is the phenomenon that the effect of the model decreases when more feature dimensions are added.
Explain the phenomenon: As the number of dimensions increases, the data becomes sparser in the feature space. In the high-dimensional feature space, it is easy to learn a high-dimensional linear classifier, and the high-dimensional linear classifier is reduced to a low-dimensional nonlinear classifier. The classifier learns noise and outliers, and the method has low generalization and overfitting. combine.
How to avoid the disaster of dimensionality: The amount of training data: In theory, if there are infinitely many training samples, the disaster of dimensionality will not happen. That is, as the dimensionality increases, the number of training samples required increases exponentially. Type of model: Classifiers with nonlinear decision boundaries, such as neural networks, KNN, and decision trees, have good classification effects, but poor generalization ability. Therefore, when using these classifiers, the data dimension cannot be too high, but the amount of data needs to be increased . And if it is a classifier with good generalization ability, such as Bayesian and linear classifier, more features can be used.

9. General characteristics of data sets
Dimension: the number of attributes in a data set. When analyzing high-dimensional data, it is easy to fall into the disaster of dimensionality. An important motivation for data preprocessing is to reduce dimensions and reduce them in time.
Sparsity: In some data sets, such as data sets with asymmetric attributes, the non-zero items are less than 1%, so that only non-zero values ​​can be stored, which will greatly reduce calculation time and storage space. There are algorithms that work specifically with sparse data (sparse matrices).
Resolution: Different acquisition frequencies can obtain data with different resolutions. For example, the earth is very uneven for data with a resolution of a few meters, but it is relatively flat for data with a resolution of tens of kilometers. Data mode is resolution dependent. If the resolution is too small, the pattern may not appear. If the resolution is too high, the pattern may not be visible.

10. Types of data sets
Record data (data matrix, transaction data, text data)
encoding of text data, bag of words model: each document is expressed as a word vector,
each word is a component of the vector, and the value of each component is the The number of times a word occurs in the document. Commonly used standard forms for
graph data (world wide web, molecular structure)
sequence data (spatial data, time series, image data, video data) data sets are: data matrix

11. Data Quality
Poor data quality can negatively impact many data processing jobs

12. Common data quality issues
Noise: Data objects that are irrelevant
Outliers: Data objects whose characteristics are significantly different from most objects in the dataset.
Duplicate values: Data duplication due to different data sources.
Inconsistent data: Inconsistent format encoding of the same attribute.
Imbalanced data: The number of training examples of different categories in the value classification task is very different.

13. Measurement of data similarity and dissimilarity
Similarity measurement-[0,1]
Dissimilarity measurement-[0,+]
Binary attribute is a kind of nominal attribute, and there are only two categories or states: 0 or 1, Where 0 usually means that the attribute does not appear, and 1 means that it does.
Binary vector similarity (SMC, Jaccard coefficient)
insert image description here

Similarity between multivariate vectors Cosine similarity
insert image description here
Correlation-Pearson correlation coefficient
insert image description here
insert image description here
Mahalanobis distance
insert image description hereinsert image description here
insert image description here
14. Why data preprocessing and the main task of data preprocessing?
Data preprocessing is the most difficult task in data mining. The main tasks are: data cleaning, data integration, data specification, data transformation and discretization.

15. Data cleaning
Data cleaning includes processing irrelevant data, redundant attributes, missing data, and abnormal data.
Methods for missing data: ignore, manually fill, automatically fill missing values ​​(mean or median, model prediction or estimation, e.g. Bayesian formula, decision tree).
Outlier data, how to smooth outliers: binning, regression, clustering

16. Data transformation
Attribute type: continuous type, discrete type, ordinal type, nominal type, string type, etc.
Discretization: continuous type -> discrete type
Unsupervised discretization: equal width discretization, equal frequency discretization, k- Mean discretization
Supervised discretization

17. Sampling
Sampling is to select sample data from a data set according to certain rules. Usually, the data samples in the application scenarios are too large, and taking a small number of samples for training or verification can not only save computer resources, but also improve the experimental results under certain circumstances.
Downsampling, upsampling, and edge sampling
directly undersample the large number of category samples in the data set, and remove some samples with many categories to make the sample data of each category close.
If undersampling randomly discards samples, some important information may be lost.
Oversampling the small number of samples in the training set, that is, adding some small number of category samples to make the number of samples of each category close.
Oversampling cannot simply resample the initial (small number of categories) samples, otherwise it will lead to severe overfitting.

18. What is an unbalanced dataset? What are the disadvantages? How to avoid it?
An unbalanced dataset refers to a dataset where the number of samples in each category varies greatly. Taking the binary classification problem as an example, assuming that the number of samples of the positive class is much larger than that of the negative class, the data in this case is called unbalanced data.
If 90% of the samples in the training set belong to the same class, and our classifier classifies all samples into this class, in this case, the classifier is invalid, although the final classification accuracy is 90%. Therefore, when the data is unbalanced, the evaluation index of accuracy (Accuracy) has little reference significance. In fact, if the imbalance ratio exceeds 4:1, the classifier will be biased towards the large class.
For unbalanced data, the simplest method is to generate samples of the minority class, and the most basic method is to add new samples by random sampling from the samples of the minority class. Contrary to oversampling, undersampling is to randomly select a small number of samples from the majority class samples, and then merge the original minority class samples as a new training data set.
There are two types of random undersampling: with replacement and without replacement. Undersampling without replacement will not be re-sampled after a certain sample of the majority class is sampled, and sampling with replacement is possible.

19. How to judge whether the attributes are good or bad?
Qualitative: category histogram (discrete attribute), category distribution map (continuous attribute)
quantitative: entropy, information gain

20. What are the methods of feature subset selection?
Exhaustive
branch and bound
greedy algorithm: optimal K individual attributes, sequential forward selection, sequential reverse selection
optimization algorithm

21. Two representative feature extraction methods (dimension reduction methods):
PCA (Principal Component Analysis)
LDA (Linear Discriminant Analysis)

22. Classification overview
Techniques for deriving functions from data, a supervised method
Basic methods include: nearest neighbor, decision tree, Bayesian, support vector machine, neural network
Integrated methods: Boosting, random forest

23.k-nearest neighbor
principle:
what are the hyperparameters of analogy learning? How to tune hyperparameters? Cross-validation?
Hyperparameters include K, distance function, and the method of adjusting hyperparameters: divide the data into training set and test set, and select the parameters with the best effect on the test set. Divide a small part of the training set as a validation set to help us choose the appropriate parameters.
Cross-validation: In the case of less training data, the training set is divided into multiple stacks, and a better k value can be obtained by iterating different validation sets. Each time, one stack is selected as the validation set, and the rest are used as the training set. , to obtain multiple accuracy rates and finally take the average value.

24. Decision
tree The characteristics and advantages of decision tree: top-down tree structure, rules can be easily extracted from the constructed tree, a data set may generate many trees, ID3 should build the shortest tree.
The basic process of ID3 spanning tree?
How to prevent overfitting by pruning?
What are the criteria for attribute selection?

25. What is the premise of the Bayesian classifier? Calculation formula of Bayesian classifier?
Premise assumption: Conditional independence
insert image description here
26. SVM
insert image description here
hard interval: For a completely linearly separable data set, the classification is all accurate and there is no error. The core idea of ​​​​the linear classifier at this time is to find the maximum classification interval.
Soft interval: The data in actual work is not so clean, and a certain amount of classification errors are allowed when dividing the data set. At this time, the classification interval is soft interval.
For non-linearly separable data sets, a kernel function is introduced, which projects the data set to a higher latitude space, making the data set linearly separable.
Three treasures of SVM: interval, dual, nuclear trick

27. Neural network
insert image description here
28. Summary of classification methods
insert image description here
29. What is clustering? The difference with classification?
Clustering: Find a group of objects such that the objects in one group are similar to each other, independent of other groups of objects, the intra-cluster distance is minimized, and the inter-cluster distance is maximized.
Unsupervised learning, no labels, data-driven to generate different clusters
The problem to be solved in clustering is to gather a number of given unlabeled patterns to make them meaningful clusters. Clustering is not knowing the target in advance. In the case of how many classes there are in the database, it is hoped that all the records will be grouped into different classes or clusters, and in this classification, the similarity based on a certain measure (such as: distance) as the standard, in the same cluster Minimize between classes and maximize between different clusters

30. Clustering evaluation criteria
Sum of squared errors (SEE):
insert image description here
Silhouette
insert image description here
31. K-means clustering
insert image description here
insert image description here
Dichotomous K-means, not easily affected by initialization problems

insert image description here
32. Hierarchical clustering and DBSCAN
agglomeration and splitting
insert image description here

Core points, boundary points, noise points
Anti-noise, susceptible to hyperparameters, MinPts, Eps

33. Association rules
insert image description here
insert image description here
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/weixin_55085530/article/details/125491726