Sichuan University Software College|Data Mining Course|Final Review

basic concept

data mining

The process of exploring useful patterns or knowledge from data sources.

machine learning

Machine learning is the study of computer algorithms that can automatically improve through experience, using data or past experience to optimize the performance standards of computer programs.

Supervised learning & unsupervised learning

Supervised learning derives predictive functions from labeled training data. Labeled training data means that each training instance includes input and desired output.

Given data, predict labels

Unsupervised learning infers conclusions from unlabeled training data. The most typical unsupervised learning is cluster analysis, which can be used to discover hidden patterns or group data during the exploratory data analysis stage.

Given data, find hidden structures/features

Classification & Regression

Regression problem: predict a value

Such as: predicting house prices, future weather conditions, etc.

Classification problem: label something , usually the result is a discrete value

For example: Determine whether the animal in a picture is a cat or a dog

Classification is usually based on regression , and the last layer of classification usually uses the softmax function to determine the category it belongs to.

Association rules, sequence patterns

Association Rule: an important type of rules contained in data

Sequence pattern: Given a set of different sequences, each sequence consists of different elements ordered in order, each element consists of different items

Support of association rules and support of sequence patterns

Rule X → YX \rightarrow YXThe support of Y refers toTTT containsX ∪ YX \cup YXPercentage of transactions in Y :
support = (X ∪ Y) . countn support = { {(X \cup Y).count} \over n}Support=n(XY).count

Sequence pattern support: the proportion of data containing this sequence in the sequence to the total data

Confidence of association rules

Rule X → YX \rightarrow YXThe confidence level of Y means that it containsXXX also containsYYThe number of transactions of Y accounts for all transactions that containXXPercentage of transactions in X
: Confidence = ( X ∪ Y ) . countConfidence=X.count(XY).count

Classifier accuracy

A c c u r a c y = c o r r e c t t o t a l Accuracy = { {correct} \over {total}} Accuracy=totalcorrect

Classifier precision & recall

Mixing matrix

classified as positive Classify as negative example
Actually a positive example TP FN
Actually a negative example FP TN

Precision

Precision rate = number of correctly classified positive examples / number of data classified as positive examples
p = TPTP + FP p = \frac {TP} {TP+FP}p=TP+FPTP

Recall

Recall rate = number of correctly classified positive examples / actual number of positive examples
r = TPTP + FN r = \frac {TP} {TP + FN}r=TP+FNTP

F-score

F-score is the precision rate ( ppp ) and recall rate (rrr )harmonic mean

Improvements in recall are often achieved at the expense of precision, and vice versa; therefore, F-score needs to be used

If you want to make the F-score very high, ppp andrrr needs to be very high

F = 2 1 p + 1 r = 2 p r p + r F = \frac {2} {\frac 1 p + \frac 1 r} = \frac {2pr} {p+r} F=p1+r12=p+r2pr

Overfitting & Underfitting

overfitting

  • The model complexity is higher than the actual problem. The model performs well on the training set but performs poorly on the test set.
  • The model does not understand the patterns behind the data and has poor generalization ability.
Prevent overfitting
  • Parametric regularization methods that provide regularization constraints , such as L1/L2 regularization
  • Methods to achieve lower generalization error through engineering techniques, such as early stopping ( Early stopping ) and Dropout
  • Implicit regularization methods that do not directly provide constraints , such as data enhancement , controlling model complexity , reducing the number of features , etc.

Underfitting

  • The model cannot achieve low enough error on the training set
  • The model has low complexity and performs very poorly on the training set, unable to learn the patterns behind the data.

K-fold cross validation

Data preparation: split the data set into KKK disjoint equally large data subsets

K iterations :

  1. Test set: KKthK subset
  2. Training set: remaining K − 1 K-1K1 subset of

Ensemble learning

Supervised learning algorithms are often described as performing the task of searching a hypothesis space to find a suitable hypothesis that will make good predictions for a specific problem. Even if the hypothesis space contains hypotheses that are well suited to a particular problem, it can be difficult to find a good hypothesis. Ensemble learning combines multiple hypotheses to form a (hopefully) better hypothesis.

clustering

Clustering is to divide a data set into different classes or clusters according to a certain criterion (such as distance) , so that the similarity of data objects in the same cluster is as large as possible, while data that are not in the same cluster The diversity of objects is also as great as possible.

Basic principles issues

distance measure

Minkowski distance

d i s t ( x i ⃗ , x j ⃗ ) = ( ∣ x i 1 ⃗ − x j 1 ⃗ ∣ h + ∣ x i 2 ⃗ − x j 2 ⃗ ∣ h + ⋯ + ∣ x i r ⃗ − x j r ⃗ ∣ h ) 1 h dist(\vec {x_i}, \vec {x_j}) = (|\vec {x_{i1}} - \vec {x_{j1}}|^h + |\vec {x_{i2}} - \vec {x_{j2}}|^h + \cdots + |\vec {x_{ir}} - \vec {x_{jr}}|^h)^{\frac 1 h} dist(xi ,xj )=(xi 1 xj 1 h+xi2 xj 2 h++xi r xjr h)h1

Euclidean distance

d i s t ( x i ⃗ , x j ⃗ ) = ( x i 1 ⃗ − x j 1 ⃗ ) 2 + ( x i 2 ⃗ − x j 2 ⃗ ) 2 + ⋯ + ( x i r ⃗ − x j r ⃗ ) 2 dist(\vec {x_i}, \vec {x_j}) = \sqrt {(\vec {x_{i1}} - \vec {x_{j1}})^2 + (\vec {x_{i2}} - \vec {x_{j2}})^2 + \cdots + (\vec {x_{ir}} - \vec {x_{jr}})^2} dist(xi ,xj )=(xi 1 xj 1 )2+(xi2 xj 2 )2++(xi r xjr )2

manhattan distance

d i s t ( x i ⃗ , x j ⃗ ) = ∣ x i 1 ⃗ − x j 1 ⃗ ∣ + ∣ x i 2 ⃗ − x j 2 ⃗ ∣ + ⋯ + ∣ x i r ⃗ − x j r ⃗ ∣ dist(\vec {x_i}, \vec {x_j}) = |\vec {x_{i1}} - \vec {x_{j1}}| + |\vec {x_{i2}} - \vec {x_{j2}}| + \cdots + |\vec {x_{ir}} - \vec {x_{jr}}| dist(xi ,xj )=xi 1 xj 1 +xi2 xj 2 ++xi r xjr

Weighted Euclidean distance

Each attribute has a weight that represents its importance relative to other attributes
dist ( xi ⃗ , xj ⃗ ) = w 1 ( xi 1 ⃗ − xj 1 ⃗ ) 2 + w 2 ( xi 2 ⃗ − xj 2 ⃗ ) 2 + ⋯ + wr ( xir ⃗ − xjr ⃗ ) 2 dist(\vec {x_i}, \vec {x_j}) = \sqrt {w_1(\vec {x_{i1}} - \vec {x_{j1} })^2 + w_2(\vec {x_{i2}} - \vec {x_{j2}})^2 + \cdots + w_r(\vec {x_{ir}} - \vec {x_{jr}} )^2}dist(xi ,xj )=w1(xi 1 xj 1 )2+w2(xi2 xj 2 )2++wr(xi r xjr )2

Squared Euclidean distance

Increase the weight of data points that are far away
dist ( xi ⃗ , xj ⃗ ) = ( xi 1 ⃗ − xj 1 ⃗ ) 2 + ( xi 2 ⃗ − xj 2 ⃗ ) 2 + ⋯ + ( xir ⃗ − xjr ⃗ ) 2 dist(\vec {x_i}, \vec {x_j}) = {(\vec {x_{i1}} - \vec {x_{j1}})^2 + (\vec {x_{i2}} - \vec {x_{j2}})^2 + \cdots + (\vec {x_{ir}} - \vec {x_{jr}})^2}dist(xi ,xj )=(xi 1 xj 1 )2+(xi2 xj 2 )2++(xi r xjr )2

Chebyshev distance

The Chebyshev distance definition is appropriate when it is necessary to define that two data points are "different" if they have a different attribute value
dist ( xi ⃗ , xj ⃗ ) = max ( ∣ xi 1 ⃗ − xj 1 ⃗ ∣ , ∣ xi 2 ⃗ − xj 2 ⃗ ∣ , ⋯ , ∣ xir ⃗ − xjr ⃗ ∣ ) dist(\vec {x_i}, \vec {x_j}) = max(|\vec {x_{i1} } - \vec {x_{j1}}|, |\vec {x_{i2}} - \vec {x_{j2}}|, \cdots , |\vec {x_{ir}} - \vec {x_{ jr}}|)dist(xi ,xj )=max(xi 1 xj 1 ,xi2 xj 2 ,,xi r xjr )

similarity measure

cosine similarity

By calculating the cosine of the angle between two vectors:
cos θ = x ⃗ ⋅ y ⃗ ∣ ∣ x ⃗ ∣ ∣ × ∣ ∣ y ⃗ ∣ ∣ cos \theta = \frac {\vec x \cdot \vec y} {| |\vec x|| \times ||\vec y||}cosθ=∣∣x ∣∣×∣∣y ∣∣x y
The value range of cosine similarity is [ − 1 , 1 ] [-1, 1][1,1 ] , the larger the value, the more similar it is.

Jaccard

The Jaccard index is used to measure the similarity in a sample set.

Suppose two data points x ⃗ \vec xx y ⃗ \vec yy , each of their attributes is a Boolean attribute, then:

  • order aaa is two data points with the same attribute value1 1Number of attributes of 1
  • Let bbb is xi ⃗ = 1 \vec {x_i}=1among the two data pointsxi =1 y i ⃗ = 0 \vec {y_i} = 0 yi =Number of attributes of 0
  • Let ccc is xi ⃗ = 0 \vec {x_i}=0among the two data pointsxi =0 y i ⃗ = 1 \vec {y_i} = 1 yi =Number of attributes of 1
  • Let ddd is two data points with the same attribute value0 0Number of attributes of 0

Then the Jaccard distance is:
dist ( x ⃗ , y ⃗ ) = b + ca + b + c dist(\vec x, \vec y) = \frac {b + c} {a + b + c}dist(x ,y )=a+b+cb+c

Association rule mining - Apriori

Close down property

If an itemset satisfies a certain minimum support requirement, then any non-empty subset of the itemset must satisfy the minimum support .

Rule sequence coverage, association rules are used for classification

decision tree

Basic idea

Each internal node of the decision tree represents a test of a certain attribute, each edge represents a test result, and the leaf nodes represent a certain class or the distribution of the class.

The decision-making process of the decision tree needs to start from the root node of the decision tree. The data to be tested is compared with the feature nodes in the decision tree, and the next comparison branch is selected according to the comparison results until the leaf node is used as the final decision result.

Termination condition
  1. The samples contained in the current node all belong to the same category and do not need to be divided

  2. The current attribute set is empty, or all samples have the same value on all attributes and cannot be divided.

    Mark the current node as a leaf node, and the category of the node contains the most samples.

  3. The sample set contained in the current node is empty and cannot be divided.

    Mark the current node as a leaf node, and the category is the category with the most samples in the parent node.

Reasons for pruning

The purpose of pruning is to avoid overfitting of the decision tree model .

Because the decision tree algorithm constantly divides the nodes in order to classify the training samples as accurately as possible during the learning process, this will lead to too many branches in the entire tree (that is, the model is too complex), which will lead to Overfitting .

pruning method
pre-pruning

Pre-pruning : In the process of constructing a decision tree, each node is estimated before partitioning. If the partitioning of the current node cannot improve the generalization performance of the decision tree model, the current node will not be evaluated. The node is divided and the current node is marked as a leaf node.

post-pruning

Post-pruning : First construct the entire decision tree , and then examine the non-leaf nodes from the bottom up . If the subtree corresponding to the node is replaced with a leaf node, it can bring about generalization. To improve performance, replace the subtree with a leaf node.

How to deal with continuous attributes

dichotomy

information gain

Entropy: used to describe "the degree of chaos in a system"

The higher the information entropy of a system, the more disordered it is; the lower the information entropy, the more orderly it is; the higher the information entropy, the greater the amount of information consumed to make it orderly.

Suppose variable x ⃗ = { x 1 , x 2 , ⋯ , xi , ⋯ , xn } \vec x = \{x_1, x_2, \cdots, x_i, \cdots, x_n \}x ={ x1,x2,,xi,,xn} , where the probability corresponding to each element isp ⃗ = { p 1 , p 2 , ⋯ , pi , ⋯ , pn } \vec p = \{p_1, p_2, \cdots, p_i, \cdots, p_n \}p ={ p1,p2,,pi,,pn} , then the corresponding entropy calculation formula is as follows:

e n t r o p y ( x ⃗ ) = − ∑ i = 1 n p i l o g 2 ( p i ) entropy(\vec x) = - \displaystyle \sum^n_{i=1} p_i log_2(p_i) entropy(x )=i=1npilog2(pi)

Information gain (Info-Gain) refers to the reduction in entropy (information gain measures the reduction in information confusion ):

G a i n ( S , A ) = e n t r o p y ( S ) − e n t r o p y ( A ) Gain(S, A) = entropy(S) - entropy(A) Gain(S,A)=entropy(S)entropy(A)

Among them, SSSwaAA __A represents the data division status before and after the operation respectively.

advantage
  • Perform better when facing discrete data with smaller categories
  • The generated classification rules are easy to understand and have high accuracy.
shortcoming
  • Prefer features with more values
  • In the process of constructing the tree, the data set needs to be scanned and sorted multiple times, resulting in inefficiency of the algorithm.
  • It is only suitable for data sets that can reside in memory. When the training set is too large to be accommodated in memory, the program cannot run.

Naive Bayes

Basic idea

For a given item to be classified, find out the probability of each category appearing under the condition that this item appears. Whichever one is the largest is considered to be the category to which the item to be classified belongs.

P r ( C = c j   ∣   A 1 = a 1 , A 2 = a 2 , ⋯   , A ∣ A ∣ = a ∣ A ∣ ) = P r ( C = c j )   P r ( A 1 = a 1 , A 2 = a 2 , ⋯   , A ∣ A ∣ = a ∣ A ∣   ∣   C = c j ) P r ( A 1 = a 1 , A 2 = a 2 , ⋯   , A ∣ A ∣ = a ∣ A ∣ ) = P r ( C = c j )   P r ( A 1 = a 1 , A 2 = a 2 , ⋯   , A ∣ A ∣ = a ∣ A ∣   ∣   C = c j ) ∑ k = 1 ∣ C ∣ P r ( C = c k )   P r ( A 1 = a 1 , A 2 = a 2 , ⋯   , A ∣ A ∣ = a ∣ A ∣   ∣   C = c k ) {\rm Pr}(C=c_j \space | \space A_1=a_1,A_2=a_2,\cdots,A_{|A|}=a_{|A|}) \\ = \frac { {\rm Pr}(C=c_j) \space {\rm Pr}(A_1=a_1,A_2=a_2,\cdots,A_{|A|}=a_{|A|} \space | \space C=c_j) } { {\rm Pr}(A_1=a_1,A_2=a_2,\cdots,A_{|A|}=a_{|A|})} \\ \\ = \frac { {\rm Pr}(C=c_j) \space {\rm Pr}(A_1=a_1,A_2=a_2,\cdots,A_{|A|}=a_{|A|} \space | \space C=c_j) } { \displaystyle \sum^{|C|}_{k=1} {\rm Pr}(C=c_k) \space {\rm Pr}(A_1=a_1,A_2=a_2,\cdots,A_{|A|}=a_{|A|} \space | \space C=c_k) } Pr (C=cj  A1=a1,A2=a2,,AA=aA)=Pr ( A1=a1,A2=a2,,AA=aA)Pr (C=cj) Pr ( A 1=a1,A2=a2,,AA=aA  C=cj)=k=1CPr (C=ck) Pr ( A 1=a1,A2=a2,,AA=aA  C=ck)Pr (C=cj) Pr ( A 1=a1,A2=a2,,AA=aA C =cj)

According to the conditional independence assumption , we get:

P r ( A 1 = a 1 , A 2 = a 2 , ⋯   , A ∣ A ∣ = a ∣ A ∣   ∣   C = c j ) = ∏ i = 1 ∣ A ∣ P r ( A i = a i   ∣   C = c j ) { {\rm Pr}(A_1=a_1,A_2=a_2,\cdots,A_{|A|}=a_{|A|} \space | \space C=c_j)} = \displaystyle \prod^{|A|}_{i=1} {\rm Pr}(A_i=a_i \space | \space C=c_j) Pr ( A1=a1,A2=a2,,AA=aA C =cj)=i=1APr ( Ai=ai C =cj)

Furthermore, the core formula of the Naive Bayes classification algorithm is obtained :

P r ( C = c j   ∣   A 1 = a 1 , A 2 = a 2 , ⋯   , A ∣ A ∣ ) = P r ( C = c j )   ∏ i = 1 ∣ A ∣ P r ( A i = a i   ∣   C = c j ) ∑ k = 1 ∣ C ∣ P r ( C = c k )   ∏ i = 1 ∣ A ∣ P r ( A i = a i   ∣   C = c k ) {\rm Pr}(C=c_j \space | \space A_1=a_1,A_2=a_2,\cdots,A_{|A|}) =\frac { {\rm Pr}(C=c_j) \space \displaystyle \prod^{|A|}_{i=1} {\rm Pr}(A_i=a_i \space | \space C=c_j) } { \displaystyle \sum^{|C|}_{k=1} {\rm Pr}(C=c_k) \space \displaystyle \prod^{|A|}_{i=1} {\rm Pr}(A_i=a_i \space | \space C=c_k) } Pr (C=cj  A1=a1,A2=a2,,AA)=k=1CPr (C=ck) i=1APr ( Ai=ai C =ck)Pr (C=cj) i=1APr ( Ai=ai C =cj)

Then, the prior probability P r ( C = cj ) {\rm Pr}(C=c_j) needs to be estimated from the training dataPr (C=cj) and conditional probabilityP r ( A i = ai ∣ C = cj ) {\rm Pr}(A_i=a_i \space | \space C=c_j)Pr ( Ai=ai C =cj) :
P r ( C = cj ) = the total number of examples belonging to category cj The total number of examples in the data set P r ( A i = ai ∣ C = cj ) = A i = ai and the total number of examples belonging to category cj The total number of samples of cj {\rm Pr}(C=c_j) = \frac {The total number of samples belonging to the category \space c_j \space} {The total number of samples in the data set} \\ {\rm Pr}(A_i=a_i \space | \space C=c_j) = \frac {A_i = a_i \space and the total number of examples belonging to the category \space c_j \space} {The total number of examples belonging to the category \space c_j \space}Pr (C=cj)=The total number of examples in the datasetBelongs to category c j The total number of samplesPr ( Ai=ai C =cj)=Belongs to category c j total number of samplesAi=ai and belongs to category c j The total number of samples
Finally, given a test example, determine the most likely class by computing:
c = argmaxcj P r ( C = cj ) ∏ i = 1 ∣ A ∣ P r ( A i = ai ∣ C = cj ) c = argmax_{c_j} \space {\rm Pr}(C=c_j) \space \displaystyle \prod^{|A|}_{i=1} {\rm Pr}(A_i=a_i \space | \space C=c_j)c=argmaxcj Pr (C=cj) i=1APr ( Ai=ai C =cj)

independence assumption

" Data samples are independent and identically distributed " refers to the independence between samples , and the " feature conditional independence assumption " refers to the independence between the internal features of each sample . The two independence assumptions make the model simple and the calculation simple. , although it sacrifices some accuracy, it is worth it in certain usage scenarios.

Data samples are independently and identically distributed

Independent and identical distribution between data samples means that there is no dependency relationship between each data sample point , and there is no temporal relationship (or the temporal relationship is not important), and they are obtained from the same distribution through multiple samplings .

If it is not the same distribution but is generated by multiple distributions, it is a mixture model, typically such as a Gaussian mixture model; if it is not independent and there is some relationship between the samples, then this dependence needs to be modeled into the model , such as weather conditions for several days can be modeled using Markov networks.

characteristic conditional independence assumption

Assume a sample xxx hasnnn features, the feature conditional independence hypothesis means thatthese features are independent under a specific category.

estimate the probability of producing zero

Add a small sample correction to all probabilities: let nij n_{ij}nijSame time A i = ai A_i = a_iAi=aiand C = cj C=c_jC=cjThe number of samples, let nj n_jnjFor the training data C = cj C=c_jC=cjThe total number of data, then the corrected estimate is:
P r ( A i = ai ∣ C = cj ) = nij + λ nj + λ mi {\rm Pr}(A_i=a_i \space | \space C=c_j) = \frac {n_{ij} + \lambda} {n_j + \lambda m_i}Pr ( Ai=ai  C=cj)=nj+λminij+l
Itinakami m_imiYes AAThe total number of possible values ​​of A , λ \lambdaλ is a factor, generally set toλ = 1/n \lambda = 1/nl=1/n n n n is the total number of training data), whenλ = 1 \lambda = 1l=When 1 , the Laplace continuation rate is obtained.

Polynomial Naive Bayes - Text Classification

Support vector machine (SVM)

Suppose there is a hyperplane W ⃗ x ⃗ + b = 0 \vec W \vec x + b = 0 in the spaceW x +b=0 , whereW ⃗ \vec WW is a normal vector, then point xi ⃗ \vec {x_i}xi To hyperplane W ⃗ x ⃗ + b = 0 \vec W \vec x + b = 0W x +b=The vertical Euclidean distance of 0 is:
∣ W ⃗ xi ⃗ + b ∣ ∣ ∣ W ⃗ ∣ ∣ {|\vec W \vec {x_i} + b|} \over {||\vec W||}∣∣W ∣∣W xi +b
where ∣ ∣ W ⃗ ∣ ∣ ||\vec W||∣∣W ∣∣W ⃗ \vec WW Euclidean norm of .

SVM maximizes the margin between positive and negative examples → \rightarrow Improve model robustness

For positive examples, the hyperplane H + H_+ where the support vector is locatedH+ W ⃗ x ⃗ + b = 1 \vec W \vec x + b = 1 W x +b=1 ; For negative examples, the hyperplaneH − H_-H W ⃗ x ⃗ + b = − 1 \vec W \vec x + b = -1 W x +b=1

Support vector: such that yi ( W ⃗ x ⃗ i + b ) − 1 = 0 y_i(\vec W \vec x_i + b) - 1 = 0yi(W x i+b)1=several points of 0

Define d + d_+d+with d − d_-dis the hyperplane W ⃗ x ⃗ + b = 0 \vec W \vec x + b = 0W x +b=The nearest distance between 0 and the positive and negative examples isd + = d − = 1 ∣ ∣ W ⃗ ∣ ∣ d_+ = d_- = {1 \over {||\vec W||}}d+=d=∣∣W ∣∣1, that is, margin = d + + d − = 2 ∣ ∣ W ⃗ ∣ ∣ margin = d_+ + d_- = {2 \over {||\vec W||}}margin=d++d=∣∣W ∣∣2

Maximize margin marginmargin < = > <=> <=> Minimize∣∣ W ⃗ ∣ ∣ 2 / 2 ||\vec W||^2 / 2∣∣W 2/2 < = > <=> <=> MinimizeW ⃗ TW ⃗ / 2 \vec W^T \vec W / 2W TW /2

And it needs to satisfy yi ( W ⃗ x ⃗ + b ) ≥ 1 , i = 1 , 2 , ⋯ , n y_i(\vec W \vec x + b) \ge 1, \space i=1, 2, \cdots, nyi(W x +b)1, i=1,2,,n
W ⃗ x ⃗ + b ≥ 1 ,当  y i = 1 W ⃗ x ⃗ + b ≤ − 1 ,当  y i = − 1 \vec W \vec x + b \ge 1 ,当 \space y_i = 1 \\ \vec W \vec x + b \le -1 ,当 \space y_i = -1 W x +b1 , when y i=1W x +b1 , when y i=1
, so it can be solved using the standardLagrange multipliermethod, that is:
L p = 1 2 ∣ ∣ W ⃗ ∣ ∣ − ∑ i = 1 n α i [ yi ( W ⃗ x ⃗ i + b ) − 1 ] L_p = {1 \over 2} ||\vec W|| - \displaystyle \sum^{n}_{i = 1} \alpha_i [y_i(\vec W \vec x_i + b) - 1]Lp=21∣∣W ∣∣i=1nai[yi(W x i+b)1 ]
where,α i ≥ 0 \alpha_i \ge 0ai0 is the Lagrange multiplier.

Dual problem calculation

L = 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j ( x i ⋅ x j ) − ∑ i = 1 N α i L = \frac 1 2 \displaystyle \sum^N_{i=1} \sum^N_{j=1} \alpha_i \alpha_j y_i y_j (x_i \cdot x_j) - \sum^N_{i=1} \alpha_i L=21i=1Nj=1Naiajyiyj(xixj)i=1Nai

w ⃗ ∗ = ∑ i = 1 N α i ∗ y i x i ⃗ b ∗ = y j − ∑ i = 1 N α i ∗ y i ( x i ⃗ ⋅ x j ⃗ ) \vec w^* = \displaystyle \sum^N_{i = 1} \alpha^*_i y_i \vec {x_i} \\ b^* = y_j - \displaystyle \sum^N_{i = 1} \alpha^*_i y_i (\vec {x_i} \cdot \vec {x_j}) w =i=1Naiyixi b=yji=1Naiyi(xi xj )

Kernel function

Simplify calculations

KNN

When predicting a new value xxWhen x , according to the nearest KKWhat categories are the K pointsto judgexx?Which category x belongs to.

advantage
  1. Simple and easy to use
  2. Model training time is fast (KNN algorithm is lazy )
  3. Good prediction effect
  4. Not sensitive to outliers
shortcoming
  1. High memory requirements (because the algorithm stores all training data )
  2. The prediction phase can be slow
  3. Sensitive to irrelevant features and data size

K-means clustering

Given sample set D = { x 1 ⃗ , x 2 ⃗ , ⋯ , xm ⃗ } D=\{\vec {x_1}, \vec {x_2}, \cdots, \vec {x_m} \}D={ x1 ,x2 ,,xm },“ k k The k -means algorithm divides the clusters obtained by clustering intoC = { C 1 , C 2 , ⋯ , C k } C = \{C_1, C_2, \cdots, C_k \}C={ C1,C2,,Ck} Minimize the squared error:
E = ∑ i = 1 k ∑ x ⃗ ∈ C i ∣ ∣ x ⃗ − μ i ⃗ ∣ ∣ 2 2 E = \displaystyle \sum^k_{i=1} \sum_{\vec x \in C_i} ||\vec x - \vec {\mu_i}||^2_2E=i=1kx Ci∣∣x mi 22
其中 μ i ⃗ = 1 ∣ C i ∣ ∑ x ⃗ ∈ C i x ⃗ \vec {\mu_i} = \frac 1 {|C_i|} \sum_{\vec x \in C_i} \vec x mi =Ci1x Cix is cluster C i C_iCimean vector .

选择 k 个“初始均值向量中心”
repeat
	1. 将数据集中每个数据分配到距离最近的“均值向量中心”(一般使用欧氏距离作度量)
	2. 重新计算每个簇的“均值向量中心”
until 没有数据点被重新分配给不同的聚类 | 没有聚类中心再发生变化 | 平方误差局部最小
advantage
  1. concise
  2. Efficient
shortcoming
  1. Applied to data sets where "the mean can be defined" is difficult to apply to categorical data
  2. The user needs to specify the number of clusters kk in advancek
  3. Very sensitive to outliers

hierarchical clustering

Merged (bottom-up) clustering
  • Start: the lowest level of the tree diagram
  • Process: Each time clusters in the previous layer are formed by merging the most similar (closest) clusters
  • Termination: Stop when all data points are merged into one cluster (root node clustering)

Determine the similarity between the data points of each category by calculating the distance between them (the smaller the distance, the higher the similarity), and combine the two closest data points or categories to generate Clustering tree.

Split (top-down) clustering
  • Start: a cluster (root) containing all data points
  • Process: Split the root node cluster into some sub-clusters, and each sub-cluster continues to split recursively.
  • Termination: Include only one data point in each cluster
Hierarchical clustering method
  1. single link method
    1. Distance between two clusters: the distance between the two closest data points in the two clusters
    2. Each step merges those nearest elements with the smallest distance, i.e. the two clusters with the shortest closest data points
    3. Sensitive to noisy data : may cause chain reactions to form long chains
  2. Full link method
    1. Distance between two clusters: the maximum value of the pairwise distance between all data points in the two clusters
    2. Each step merges those farthest elements with the shortest distance, i.e. the two clusters with the shortest and farthest data points
    3. sensitive to outliers
  3. average link method
    1. A compromise between the sensitivity of full-link protection to outliers and the tendency of single-link methods to form long chains.
    2. Distance between two clusters: the average of the sum of the distances between pairs of data points in the two clusters

Neural Networks

Ensemble classifier

Bagging

Bagging uses bagging sampling to obtain data subsets to train the basic learner.

Usually classification tasks are integrated by voting, while regression tasks are integrated by averaging.

The variance of the results obtained after bagging is smaller .

Specific process
  1. A training set is extracted from the original sample set . In each round, n training samples are extracted from the original sample set using the Bootstraping method (in the training set, some samples may be extracted multiple times, while some samples may not be selected once). A total of k rounds of extraction are performed to obtain k Training set (k training sets are independent of each other).
  2. Each time a training set is used to obtain a model, k training sets are used to obtain a total of k models . (Note: There is no specific classification algorithm or regression method here. We can use different classification or regression methods according to specific problems, such as decision trees, perceptrons, etc.)
  3. Get the result (all models have the same importance)
    1. For classification problems: use the k models obtained in the previous step to vote to obtain the classification results.
    2. For regression problems, calculate the mean of the above model as the final result
Boosting

Boosting refers to converting weak learners into strong learners through a collection of algorithms.

The training method is to use weighted data . Give greater weight to misclassified data in the early stages of training.

The bias of the results obtained after boosting is smaller .

Specific process
  1. Linearly combine the base models through an additive model .
  2. Each round of training increases the weights of basic models with low error rates , while reducing the weights of models with high error rates .
  3. Change the weight or probability distribution of the training data in each round, by increasing the weight of those "misclassified samples" by the weak classifier in the previous round and reducing the weight of the "paired samples" in the previous round. To make the classifier have better effect on misclassified data.
The difference between Bagging and Boosting
Sample selection

Bagging : The training set is selected with replacement from the original set , and each training set selected from the original set is independent.

Boosting : The training set in each round remains unchanged , but the weight of each sample in the training set in the classifier changes. The weights are adjusted based on the classification results of the previous round.

Sample weight

Bagging : Use uniform sampling, with equal weight for each sample

Boosting : Continuously adjust the weight of the sample according to the error rate . The greater the error rate, the greater the weight.

prediction function

Bagging : All prediction functions are equally weighted .

Boosting : Each weak classifier has a corresponding weight , and classifiers with small classification errors will have greater weights .

parallel computing

Bagging : Each prediction function can be generated in parallel

Boosting : Each prediction function can only be generated sequentially , because the latter model parameters require the results of the previous round of models.

semi-supervised learning

In many practical problems, labeled samples and unlabeled samples often exist at the same time, and there are more unlabeled samples, while there are relatively few labeled samples.

Semi-supervised learning: Under the guidance of a small number of sample labels, it can make full use of a large number of unlabeled samples to improve learning performance , avoid the waste of data resources, and at the same time solve the problem of weak generalization ability of supervised learning methods and lack of samples when there are few labeled samples. The problem of inaccurate unsupervised learning methods when guided by labels.

Semi-supervised algorithms can be divided into two categories: direct inference and inductive.

Direct push semi-supervision only contains labeled sample sets and test sample sets, and the test samples are also unlabeled samples . The direct push semi-supervised algorithm first treats test samples as unlabeled samples, then uses labeled samples and unlabeled samples to train the model, and predicts unlabeled samples during the training process. Therefore, the direct push semi-supervised algorithm can only process the current unlabeled samples (test samples) and cannot directly expand out of the sample. For new test samples, the direct-inference semi-supervised algorithm needs to retrain the model to predict its label.

Inductive semi-supervised algorithms use independent test sample sets in addition to labeled and unlabeled sample sets . Inductive semi-supervised algorithms are able to process samples in the entire sample space. Inductive semi-supervised algorithms train learning models on labeled and unlabeled samples. This model can not only predict the labels of training unlabeled samples, but also directly predict the labels of new test samples.

Guess you like

Origin blog.csdn.net/m0_46261993/article/details/129318452