Data warehouse and data analysis study records

Data mining features:
first, data mining, data source must be true. For data analysis rather than data specifically collected.
Second, the data mining process must be massive.
Third, the query is generally random queries decision-makers (the user) proposed.
Fourth, excavated knowledge is generally not predictable, data mining discovery potential, new knowledge.

Knowledge Discovery from Database KDD is short, data mining is only one step of KDD.
KDD main steps as follows:
1. Data integration, refers to the variety of data sources combined;
2. Data cleaning refers to eliminate noise or inconsistent data;
3. Data selection and analysis means extracting from the database that process-related data;
4. data conversion, by aggregating, aggregation, reduce the dimension data conversion method and the like to be consolidated into a suitable form of data mining, data reduction, reducing the complexity of data;
5. data mining, excavation first determined task, then select the appropriate tool, excavating operation knowledge;
6. assessment mode, according to the index provided by the user, process the excavated mode assessment;
7. knowledge representation, refers to the use of visualization techniques and knowledge representation, mining knowledge to provide easily understandable to the user.

Traditional database tools belonging to the operational tools; data mining belong analytical tool.

For the classification decision tree algorithm analysis there, Bayesian algorithms, artificial neural networks.
Data mining algorithms are standard consists of five components: the model or model structure, data mining tasks, scoring function, search and optimization methods, data management strategy.
Model is a high-level entire data set, overall or summary description.
Local mode, only a small portion of the description will be made of data.

Mining tasks divided pattern mining, predictive modeling, describing modeling.
Data mining is the discovery potential of knowledge from large amounts of data, the user does not need to raise the exact problem. OLAP is an online analytical processing, it according to the issues raised by users, in-depth extract detailed information on this issue and presented to the user.

Classification and regression is known as predictive modeling, the purpose is to build a model that allows people to predict some other unknown property values ​​based on the known property values. When the predicted attributes are called classification category type, type the number of return when called.

Decision tree classification is discriminant model, used to determine the area of decision-making in each category, is a multi-layered tree structure.
Decision tree classification is divided into two steps, the first step to establish a decision tree using the training set to obtain a decision tree classification model; a second step of generating a decision tree for classifying input data.

Without regard to any input debate as any of the information needed is present:
info (T) = -Σpilog2 (PI) where i is the index, from [Sigma subscript i is started, the summation.
If the input of a variable x, as the present Category any required information: info (X, T) = ΣTi / Tinfo (Ti) where Ti represents any type of information as required for the present belongs, T is the total number.
Calculating the information gain Gain (X, T) = Info (T) - Info (X, T); input variables are sorted according to size information of the gain, a large amount of information gain precedence variable, i.e. the value of the entropy small, the higher the priority.

1. The first step decision tree classification: select Properties, as the roots;
2. pruning stage; divided into two categories after the first cut and cut;
the first cut: reducing the cost of training time, reduce test time overhead; the risk of over-fitting reduce, increase the risk of poor fitting
• after the cut: increased training time overhead, reduce test time overhead; reduce the risk of over-fitting, underfitting risk basically unchanged
after the cut is usually better than the first cut.

Bayesian classification is a typical statistical classification methods used to predict the probability of such a sample belongs to a category of how much.
Bayes formula P (B | A) = P (A | B) * P (B) / P (A) or P (B | A) = P (AB) / P (A)
used for machine learning.
Scenarios such as: spelling correction, spam filtering and so on.
If a property value and there has never been a class at the same time in the training set, the direct calculation will have problems, because the probability of even by the "erase" information provided by other attributes.

Here Insert Picture Description
Wherein the prior probability formula refers Bayes formula
plain between the attribute value is assumed mutually independent Bayesian classifier.
Naive Bayes classifier:
If the high speed requirements of the forecast, is expected to count all probability valuations, the use of "look-up"
if the data is high turnover, without any training, received a request for re-valuation of prediction;
if the data continue to increase, based on current valuations, probability sample involving new valuation correction.

Bayesian belief networks allow dependencies between attributes, using a directed acyclic graph represented, each node represents a random variable (status), the probability of each arc represents a dependency (the causal relationship between states), each it only states directly connected with the state related, not directly linked indirectly connected with the state's (but there is an indirect correlation).

SVM support vector machine svm referred to the advantages of high accuracy, simple model to describe; drawback is the long training time.
SVM is mainly used to solve binary classification problem (data only up to belong to two different categories).
Positive plane: wx1 + b = 1;
negative plane: wx2 + b = -1;
n plane and the plane is negative distance M;
X1 + X2 = AW;
| X1 - X2 | = m;

When the original sample space unable to find an optimal linear classification function, you can consider using the method of nonlinear changes, the original nonlinear problem sample space into a linear problem another high-dimensional space.

SVM is more suitable for small-scale datasets, if it is large data sets, general first problem into multiple smaller sub-problems, then apply SVM solutions to sub-problems.

Neural network has three elements: topology, connections, learn the rules.
Artificial neural network consists of a group of neurons interconnected parts. Neurons can be considered as a multi-input, single-output the information processing unit. Each connection has a weight between neurons value associated therewith. Neural network learning algorithm is an iterative optimization process of gradual modification of the value of the right.

Here Insert Picture Description
The most popular based on neural network classification algorithm is presented to the 1980s propagation algorithm.
After propagation algorithm is divided into the following steps:
1. Initialize the right;
2. Enter the forward propagation;
3. backward error propagation.
Neural network model comprises an input unit layer, an intermediate layer (also referred to as a hidden layer, may be a multilayer), the output layer unit.

Back-propagation learning algorithm: iteratively processing a set of training samples, the network compares the predicted and actual samples for each class label.
After each iteration, the modified weights are such that the variance between the actual and predicted network based minimum.

Here Insert Picture Description
Here Insert Picture Description
Initialization Right
Right network is usually initialized to small random numbers (e.g., ranging from -1.0 to 1.0, or from 0.5 to -0.5).
Each unit has an offset (bias), the bias is also initialized to small random numbers.

Convolution neural network is a multilayer neural network, specializes in machine learning problems related to image processing, especially large image.
Convolution network through a number of ways, the success of the huge amount of data to image recognition issues continue dimensionality reduction, and ultimately enable them to be trained.
Input layer, a convolution layer activation function, cell layer, fully connected layers. Wherein the cell layer and the layer of the convolution with, the group consisting of a plurality of convolution, layer by layer extracted features, a final classification is completed by several fully connected layers.
Convolution: a feature extraction on the original input.
Calculation
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
fourth step, a sliding window step 2, the step of calculating is repeated until
the fifth step, can be finally obtained at 2 filter, the convolution depth generated is output after 2.
① why every slide is two lattice?
Sliding step called stride denoted by S. S smaller, more extracted features, but generally do not take S 1, the main issues to consider time-efficient. S can not be too large, otherwise they will miss the information on the image.
② the side of the filter length greater than S, will cause intersect part, the intersection part means that after each move several times to extract features sliding window, especially in the middle area of the image to extract more frequently, less edge portion of the extraction times, how do?
General approach is to add around 0 in the image periphery, attentive students may have noticed, in the presentation of this case has been coupled with a ring 0, ie + pad 1. was added n + pad n represents 0 ring.

Here Insert Picture Description
③ size of the output characteristics of a map of the convolution is how much?
Here Insert Picture Description

Cluster analysis: identify data based on the similarity between the feature data, the data is divided into a similar class.
Often referred to as a data matrix two-mode (two-mode) matrix, and the dissimilarity matrix is called a single-mode (one-mode) matrix. This is because the former is different rows and columns represent entities, which rows and columns represent the same entity.
Many clustering algorithm based on dissimilarity matrix. If the data form is the performance of the data matrix, prior to the use of such algorithms To converted to dissimilarity matrix.
In order to standardize the measure, a method is to convert the original measure unitless value.
Dissimilarity (or similarity) between objects based on the distance between objects calculated, using the Minkowski distance generally
Here Insert Picture Description
where i = (xi1, xi2, ... , xip), j = (xj1, xj2, ..., XJP) represent two-dimensional objects p-, q is a positive integer
, when q = 1 is, d is called Manhattan distance
Here Insert Picture Description
when q = 2 represents the Euclidean distance
Here Insert Picture Description
distance function has the following properties:
1. d (i, j) ≥0: distance is a non-negative value.
2. d (i, i) = 0 : a distance between the object and its own 0.
3. d (i, j) = d (j, i): symmetric distance function.
4. d (i, j) ≤ d (i, h) + d (h, j): the direct distance from the target object j is I to be no greater than any other objects from the route.

Binary variables only two states: 0 1 or
0 indicates that the variable is empty
a variable showing the presence of the
Here Insert Picture Description
two states of the binary variables have the same weight, then the binary variable is symmetric, i.e. two values 0 1 does not have priority or
two best known objects dissimilarity coefficient between i and j are simple matching coefficient, which is defined as follows:
Here Insert Picture Description
If the outputs of the two states are not equally important, then the binary variable is asymmetric. We will be more important output, usually appear smaller probability result coded as 1, while the other will result coded as 0, both of which are taking the 0 state ignored.
Here Insert Picture Description
b represents i is 1, j is the number i is represented by c 0 is 0, j is the number represents a 1 i, j is the number 1, with

With X = {1,0,0,0,1,0,1,1}, Y = {0,0,0,1,1,1,1,1}, can be seen, the first two elements 2,3,5,7 and 8 the same property values, and the 1, 4 and 6 have different values, then the degree of dissimilarity may be identified as 3/8 = 0.375. In general, for a binary variable, the dissimilarity available "property values of the number of bits with different bit attribute / single element" logo.
Our only concern is to take both 1, and that the two have taken property 0 does not mean that the two are more similar. In this case, use the "number of different values with the attribute bits / (attribute bits of a single element - the same number of bits taken 0)" to identify the degree of difference, which is called a non-symmetric bivariate dissimilarity.

The nominal variable is a binary variable promotion, it may have more than two state values.
Assuming a nominal number of state variable M. These states can be letters, symbols, or a set of integers (e.g., 1,2, ..., M) is represented.
: Dissimilarity between two objects i and j may be calculated by simple matching method
Here Insert Picture Description
wherein the number m is the number of matches, i.e., the same value of variables i and j; and p is the number of all the variables.

Clustering algorithm advantages:..... 1 capable of handling various types of data; the second processing data distribution of various shapes; 3 does not require much input parameters; 4 can handle noisy data; 5 independent of the input data sequence ; having 6 explanatory and workability;. 7 scalability.
K-means algorithm
Given a data set X and an integer K (n), K-Means method is X into K clusters, the distance is calculated between the different samples to determine their close relationship, close They will go into the same category.
K-Means clustering method is divided into the following steps:
[1] to select the K initial cluster center point, it referred to the K Means.
[2] to calculate the distance between the center point of each and every object.
[3] assigns to each object from its nearest cluster center point do genus.
[4] recalculates the center point of each cluster.
[5] Repeat steps 2, 3, until the algorithm converges.

Published 45 original articles · won praise 9 · views 30000 +

Guess you like

Origin blog.csdn.net/zhanglinlove/article/details/102765958