Summary of data warehouse and data mining

More detailed information can only be found in pdf version 

10 points for filling in the blanks,
10 points for judging and correcting errors,
8 points for calculation,
20 points for comprehensive analysis

Objective questions

10 points for filling in the blanks,
10 points for judging and correcting mistakes - correct the mistakes

MOOC--especially exam questions

 

Glossary 12 points

4, 3 points each

Commonly encountered professional terms

Short answer questions 40 points

5, 8 points each

comprehensive

draw roc curve

Questions similar to those related to calculations

C1

What is data mining? What is the concept?

Which operations are data mining operations and which operations are not
Data: massive, multi-source and heterogeneous

Operation: Extract interesting (important, implicit, previously unknown, potentially useful) patterns or knowledge from large amounts of data.

There is a difference between data analysis and data mining.
Data Mining AKA Knowledge Discovery KDD

Data mining process

From the perspective of data management, what is the process of data mining? What are the links? Be sure to note that it is an iterative feedback process
 

data integration

The same data object described in different data sources becomes a relatively unified data information.

Data cleaning

Error, exception, redundancy, missing

Enter the data warehouse

Store data by topic

select, transform

Turn the data in the data warehouse into a data set related to the data mining task.
Selection: select relevant data and attribute features. Transformation
: the format may not meet the algorithm requirements, data dimensions; feature transformation--multiplication and division, etc...

Get task-related data sets for us to use algorithms

data mining

Design or select an appropriate model to use on task-related data to obtain patterns

knowledge assessment

If not, consider all previous steps - which step or steps are inappropriate


process of trial and error

Data mining tasks

classification regression

Using historical records to predict future values--forecasting problem

clustering

Correlation analysis and association analysis-association rule mining

abnormal detection

predictive tasks

descriptive task

Association rule mining - collinear relationships between items

C2

Main features of the dataset

Dimension, resolution, sparsity

Methods for identifying anomalies in data attribute values

Drawing [box plot] and the 3σ principle of statistics

Nominal [bipartite attribute in nominal attributes -> symmetric bipartite and asymmetric bipartite], ordinal, numerical value, how to calculate the similarity of these data types? How to calculate the similarity of data types if the attributes of the data are mixed types? core

Similarity measurement problem of data objects [similarity between two rows] [similarity between attributes is two columns]

Similarities and dissimilarities wax and wane

nominal

p is the number of attributes, m is the number of attributes of two objects with equal values, and pm is the number of attributes of two objects with unequal values.

Two points
require four indicators
 

Asymmetric:
The probability of taking 0 is higher: although the difference is large, the difference is inaccurate because of the high probability of taking 0
 

Ordinal

Convert the value to a numeric type - sort the levels from low to high;
convert the value according to the formula
 

numerical value

measure by distance
 

Common distance

Min distance

Manhattan distance - taxi distance - walk the zigzag line along the street - high dimension
 

supremum distance

document

cosine similarity
 

mixed type

f: each attribute
dij(f): the degree of dissimilarity on the f attribute.
The front is the weight.

Correlation between attributes

Single correlation and complex correlation

Positive correlation and negative correlation

Linear correlation and non-linear correlation

Not relevant, completely relevant, not completely relevant

Draw a scatter plot
correlation coefficient

Linear:

Covariance

Pearson correlation coefficient

grade

Maximum information coefficient MIC : used to measure the strong correlation between attribute variables in high-dimensional data
 

Attributes and calculations between attributes belong to correlation analysis--method

C3

What are the main steps of data preprocessing?

Data cleaning, data integration, data transformation, data reduction
 

Briefly describe the main tasks, common methods, and processes of data cleaning

Handle missing data, smooth noise, identify or remove anomalies (abnormalities in attribute values), resolve data inconsistencies...
 

Common methods
 

Missing values

delete;
interpolate
 

Outliers

noise

inconsistent

Entity recognition technology

process
 

process:

The right side is the process of data cleaning. First, import data to import data, centralize relevant data, handle missing values, standardize [max-min, the goal is to unify the dimensions of feature dimensions], normalize [fit a distribution zscore after transformation], Repeatability detection, error correction and enrichment, export

What are the commonly used discretization methods? [See downstream tasks]

unsupervised

binning

Histogram

Clustering (k-means)

Supervised - under the guidance of class tags

Entropy based methods

Continuous discretization

How to identify redundant attributes?

Discover redundant attributes through correlation analysis
 

Numeric properties: Correlation coefficient, covariance
Nominal type: Chi-square test
 

Commonly used reduction methods-the first three compress the amount of data, and PCA is unsupervised dimensionality reduction.

return

clustering

sampling

PCA

Compression of data volume

There are ginseng

return

Only keep the parameter wb. When you want to generate a data set, randomly sample x to generate y values.

No ginseng

clustering

sample each cluster

sampling

With replacement, without replacement, layered

Dimension compression

unsupervised PCA

Map the feature space described by the original attributes into an orthogonal matrix space, retain as much original data information as possible, and
eliminate redundancy - the dimensions are independent of each other
. PCA obtains the principal components by doing orthogonal matrix decomposition, and selects the top k important features as Features in the new space represent all data objects by linear combinations of the first k features

Attribute subset selection

Method1: Delete redundant attributes, delete unimportant ones...get the subset.
Method2: Add the most important, second most important...get the subset
 

Vs

The features obtained by attribute selection have specific meanings. PCA does not have a [black box] - it may be possible to obtain very good feature extraction but with poor interpretability.


 

olap

Basic architecture of data warehouse

Briefly describe the data model of the data warehouse and the characteristics of each model

The difference between data warehouse and database

Association rule mining

Methods and evaluation indicators
 

 two stages

Generation of frequent itemsets - generation of association rules

Implementation of frequent itemsets

Use properties to reduce the space of frequent itemsets

Contents of association rule mining

Evaluation indicators - commonly used support and confidence, are not necessarily a meaningful association rule.

Lift

clustering

The difference between clustering and classification

The principles, processes, advantages and disadvantages of kmeans and DBSCAN, and what methods can be used to solve the shortcomings of kmeans

The k value needs to be determined

Set different k values ​​to calculate sse, considering the k value near the inflection point

Selection of initial clustering centers

The first one is chosen randomly, and the next one is chosen furthest from the current selection.

Sensitive to noise points and anomalies [because it is sensitive to the mean]

Use k-medoids with real data objects as centers - high complexity - replaced by data objects in clusters; use k medians

Spherical cluster [based on distance]

empty cluster

Select the point that contributes the most to SSE as the cluster center, and select a point from the cluster that contributes the most to SSE.
 

size:

density:
 



Non-convex:
 

solve:
 

Take a larger value of k and divide it into multiple small clusters and then merge them.

Vertical axis: The range of change of the k-th nearest neighbor distance.
Horizontal axis: Data objects are encoded according to the nearest neighbor distance
. The change range of the k-th nearest neighbor of most data objects is not large, and the inflection point soars - an abnormal point. When k is large, The distance is large,
so we can judge k

Evaluation indicators for clustering - supervised [same as classification] and unsupervised [normalized mutual information and silhouette coefficient]

Normalized mutual information - Y is the cluster label, C is the real label - I (Y, C) mutual information = H ( C ) - H ( Y | C) yc The higher the dependence, the better

Classification

How to draw roc

tpr is the recall rate
 

Evaluation index--precision recall rate fscore

Decision trees, Bayesian, ensemble

Bayesian: easy to implement, relatively good results, robust,
may have dependencies


and integration
 

Only for unstable classifiers can the improvement effect be achieved

评估框架--bootstrap cosostation??交叉验证的bootstrap

二分类问题

正事例
 

异常

异常的类型

异常的方法

基于统计、距离、密度、

Guess you like

Origin blog.csdn.net/m0_62153438/article/details/135048873