C1 |
What is data mining? What is the concept? |
Which operations are data mining operations and which operations are not Data: massive, multi-source and heterogeneous Operation: Extract interesting (important, implicit, previously unknown, potentially useful) patterns or knowledge from large amounts of data. There is a difference between data analysis and data mining. Data Mining AKA Knowledge Discovery KDD |
Data mining process |
From the perspective of data management, what is the process of data mining? What are the links? Be sure to note that it is an iterative feedback process
data integration |
The same data object described in different data sources becomes a relatively unified data information. |
Data cleaning |
Error, exception, redundancy, missing |
Enter the data warehouse |
Store data by topic |
select, transform |
Turn the data in the data warehouse into a data set related to the data mining task. Selection: select relevant data and attribute features. Transformation : the format may not meet the algorithm requirements, data dimensions; feature transformation--multiplication and division, etc... |
Get task-related data sets for us to use algorithms |
|
data mining |
Design or select an appropriate model to use on task-related data to obtain patterns |
knowledge assessment |
If not, consider all previous steps - which step or steps are inappropriate |
process of trial and error |
Data mining tasks |
classification regression |
Using historical records to predict future values--forecasting problem |
clustering |
|
Correlation analysis and association analysis-association rule mining |
|
abnormal detection |
|
predictive tasks |
|
descriptive task |
Association rule mining - collinear relationships between items |
|
|
C2 |
Main features of the dataset |
Dimension, resolution, sparsity |
Methods for identifying anomalies in data attribute values |
Drawing [box plot] and the 3σ principle of statistics |
Nominal [bipartite attribute in nominal attributes -> symmetric bipartite and asymmetric bipartite], ordinal, numerical value, how to calculate the similarity of these data types? How to calculate the similarity of data types if the attributes of the data are mixed types? 【core】 |
Similarity measurement problem of data objects [similarity between two rows] [similarity between attributes is two columns] |
Similarities and dissimilarities wax and wane |
|
nominal |
p is the number of attributes, m is the number of attributes of two objects with equal values, and pm is the number of attributes of two objects with unequal values. |
Two points require four indicators Asymmetric: The probability of taking 0 is higher: although the difference is large, the difference is inaccurate because of the high probability of taking 0 |
|
Ordinal |
Convert the value to a numeric type - sort the levels from low to high; convert the value according to the formula |
numerical value |
measure by distance Common distance
Min distance |
Manhattan distance - taxi distance - walk the zigzag line along the street - high dimension supremum distance |
|
document |
cosine similarity |
mixed type |
f: each attribute dij(f): the degree of dissimilarity on the f attribute. The front is the weight. |
|
Correlation between attributes |
Single correlation and complex correlation |
|
Positive correlation and negative correlation |
|
Linear correlation and non-linear correlation |
|
Not relevant, completely relevant, not completely relevant |
|
Draw a scatter plot correlation coefficient Linear:
Covariance |
|
Pearson correlation coefficient |
|
grade |
|
Maximum information coefficient MIC : used to measure the strong correlation between attribute variables in high-dimensional data |
|
Attributes and calculations between attributes belong to correlation analysis--method |
|
|
C3 |
What are the main steps of data preprocessing? |
Data cleaning, data integration, data transformation, data reduction |
Briefly describe the main tasks, common methods, and processes of data cleaning |
Handle missing data, smooth noise, identify or remove anomalies (abnormalities in attribute values), resolve data inconsistencies... Common methods
Missing values |
delete; interpolate |
Outliers |
|
noise |
|
inconsistent |
Entity recognition technology |
|
|
process process: The right side is the process of data cleaning. First, import data to import data, centralize relevant data, handle missing values, standardize [max-min, the goal is to unify the dimensions of feature dimensions], normalize [fit a distribution zscore after transformation], Repeatability detection, error correction and enrichment, export |
What are the commonly used discretization methods? [See downstream tasks] |
unsupervised |
binning |
|
Histogram |
|
Clustering (k-means) |
|
|
Supervised - under the guidance of class tags |
Entropy based methods |
Continuous discretization |
|
|
How to identify redundant attributes? |
Discover redundant attributes through correlation analysis Numeric properties: Correlation coefficient, covariance Nominal type: Chi-square test |
Commonly used reduction methods-the first three compress the amount of data, and PCA is unsupervised dimensionality reduction. |
return |
|
clustering |
|
sampling |
|
PCA |
|
Compression of data volume |
There are ginseng |
return |
Only keep the parameter wb. When you want to generate a data set, randomly sample x to generate y values. |
|
|
|
No ginseng |
clustering |
sample each cluster |
sampling |
With replacement, without replacement, layered |
|
|
Dimension compression |
unsupervised PCA |
Map the feature space described by the original attributes into an orthogonal matrix space, retain as much original data information as possible, and eliminate redundancy - the dimensions are independent of each other . PCA obtains the principal components by doing orthogonal matrix decomposition, and selects the top k important features as Features in the new space represent all data objects by linear combinations of the first k features |
Attribute subset selection |
Method1: Delete redundant attributes, delete unimportant ones...get the subset. Method2: Add the most important, second most important...get the subset |
Vs |
The features obtained by attribute selection have specific meanings. PCA does not have a [black box] - it may be possible to obtain very good feature extraction but with poor interpretability. |
|
|
|
olap |
Basic architecture of data warehouse |
|
Briefly describe the data model of the data warehouse and the characteristics of each model |
|
The difference between data warehouse and database |
|
|
Association rule mining |
Methods and evaluation indicators |
|
|
|
two stages |
Generation of frequent itemsets - generation of association rules |
Implementation of frequent itemsets |
Use properties to reduce the space of frequent itemsets |
Contents of association rule mining |
|
Evaluation indicators - commonly used support and confidence, are not necessarily a meaningful association rule. |
Lift |
|
clustering |
The difference between clustering and classification |
|
The principles, processes, advantages and disadvantages of kmeans and DBSCAN, and what methods can be used to solve the shortcomings of kmeans |
The k value needs to be determined |
Set different k values to calculate sse, considering the k value near the inflection point |
Selection of initial clustering centers |
The first one is chosen randomly, and the next one is chosen furthest from the current selection. |
Sensitive to noise points and anomalies [because it is sensitive to the mean] |
Use k-medoids with real data objects as centers - high complexity - replaced by data objects in clusters; use k medians |
Spherical cluster [based on distance] |
|
empty cluster |
Select the point that contributes the most to SSE as the cluster center, and select a point from the cluster that contributes the most to SSE. |
|
size: density:
Non-convex: solve: Take a larger value of k and divide it into multiple small clusters and then merge them. |
Vertical axis: The range of change of the k-th nearest neighbor distance. Horizontal axis: Data objects are encoded according to the nearest neighbor distance . The change range of the k-th nearest neighbor of most data objects is not large, and the inflection point soars - an abnormal point. When k is large, The distance is large, so we can judge k |
Evaluation indicators for clustering - supervised [same as classification] and unsupervised [normalized mutual information and silhouette coefficient] |
Normalized mutual information - Y is the cluster label, C is the real label - I (Y, C) mutual information = H ( C ) - H ( Y | C) yc The higher the dependence, the better |
|
Classification |
How to draw roc |
tpr is the recall rate |
Evaluation index--precision recall rate fscore |
|
Decision trees, Bayesian, ensemble |
Bayesian: easy to implement, relatively good results, robust, may have dependencies
and integration Only for unstable classifiers can the improvement effect be achieved |
评估框架--bootstrap cosostation??交叉验证的bootstrap |
二分类问题 正事例 |
|
异常 |
|