Machine learning classical model

Naive Bayes (classification)

\begin{aligned}
P(C|F)=\frac{P(F|C)P(C)}{\sum_iF_i}=\frac{\prod_iP(F_i|C)P(C)}{\sum_iF_i}
\end{aligned}

  • \ (C \) : Class, category
  • \ (F \) : the Feature, features
  • \ (P (C) \) : prior probability
  • \ (P (F | C) \) : likelihood
  • \ (\ sum_iF_i \) : normalization used, the same for all categories can be ignored
  • If "all features independently of each other", is transformed into:
    \ the begin {the aligned}
    P (C | F.) \ Propto \ prod_iP (F_i | C) P (C)
    \ End {the aligned}
    where \ (P (F_i | C) \) and \ (P (C) \) can be obtained from the statistics, whereby the calculated probability of each category corresponding to find the maximum probability class.
  • "All of the features independently of each other," is a very strong assumption that, in reality, is unlikely to set up, but can greatly simplify the calculation, and studies have shown little effect on the accuracy of the classification results.

Reference: Ruan Yifeng's blog

Decision tree (classification)

The core algorithm

Select the optimal partitioning features . Branch node samples after each division included as belonging to the same category, that is, the sample node contained purity higher and higher.

How to judge the purity?

Entropy

Information quantization

How to measure (H (x) \) \ size

  1. The amount of information and the probability of events about the event. \ (H (x) \ Leftrightarrow \ frac {1} {P (x)} \)
  2. The amount of information more events are added and the amount of information. \ (H (x_1, x_2) \ Leftrightarrow H (x_1) + H (x_2) \)
  3. \(H(x)\ge0\)

About the need to construct a \ (H (x) \) function satisfies the relation: \ (H (X) = log \ FRAC. 1 {} {P (X)} = - a logP (X) \)

entropy

Entropy is seeking \ (H (x) \) in (P (x) \) \ mathematical expectation Distribution, representing the uncertainty of a system, the greater the entropy, the greater the uncertainty.
\ the aligned} {the begin
Entropy (X) = E_x [H (X)] = - \ sum_xP (X) a logP (X)
\ the aligned End {}

  • Element as a whole, the smallest entropy, \ (Entropy (X) = 0 \)
  • Elements are uniformly distributed within the system, maximum entropy, \ (Entropy (X) =. 1 \)
  • I want to divide the elements within the system, so that the same type of elements together, each division would hope that the information entropy minimize system

Information gain

Information Gain(IG)

\ the aligned} {the begin
the IG OriginalEntropy = - \ sum_i \ FRAC A_i {} {| A |} Entropy (A_i)
\ the aligned} End {
uncertainty uncertainty information entropy of the system, the classification system of the information representative of the gain is reduced Degree.

Every division hope that the information gain bigger the better.

ID3 algorithm

Information Gain employed as a measure of purity, operator information gain of each feature, selected such that the maximum information gain characteristic divided.

Tree divided

Feature type:

  1. ordinal discrete and orderly
  2. numerical (discrete nominal) Discrete and chaotic
  3. continuous continuous

Bayesian classifier model is suitable for the type of disorder discrete (discrete) features

Partitioning method:

  1. Division multiplexing
    • Independent Value
  2. dichotomy
    • combination

Discretization of continuous discrete data:

  1. Ordered discrete
    • Static partitioning outset discrete
    • Dynamic partitioning evenly spaced uniform frequency, clustering
  2. dichotomy
    • Similar IG3 division algorithm to find the largest division of IG
    • More computationally intensive

Gini Index

In addition to entropy, there are other ways to guide decision tree is divided.
\ the aligned} {the begin
GINI (T) = l- \ sum_jP (J | T) ^ 2
\ the aligned End {}
\ (P (J | T) \) : each state \ (T \) in each class \ (j \) frequency

  • If uniform distribution (corresponding to the entropy = 1), GINI Index = 0.5 maximum
  • If accounting for a certain type of element 1 (corresponding to the entropy = 0), GINI Index = 0 Minimum

Relatively easy to count, without calculation log.

Gini Split

Dividing \ (K \) partitions, each partition calculating a weighted index GINI

The corresponding information gain

\begin{aligned}
GINI_{split}=\sum_{i=1}^k\frac{n_i}{n}GINI(i)
\end{aligned}

  • \ (n_i \) : The number of the current partition element
  • (n-\) \ total number of elements in the system:

GINI divided into smaller is better

Misclassification Error

Misclassification rate

\begin{aligned}
Error(t)=1-max_iP(i|t)
\end{aligned}

  • \ (P (I | T) \) : each state \ (T \) in each class \ (I \) frequency
  • If accounting for a certain type of element 1 (corresponding to the entropy = 0, GINI Index = 0), Error = 0 Minimum
  • If uniform distribution (corresponding to the entropy = 1, GINI Index = 0.5,), Error = 0.5 maximum

QQ screenshot 20190920164010.png

Traning and Test Errors

When to stop training when the decision tree, stop premature, there is no good learning performance data, stop too late, over-fitting phenomenon.

Tree Ensemble integrated learning

numbers of nodes: complexity of model

CART regression tree (prediction)

Predictive regression continuous data.

Given training data \ (D = {(x ^ {(1)}, y ^ {(1)}), (x ^ {(2)}, y ^ {(2)}), ..., ( ^ {X (N)}, {Y ^ (N)})} \) , a CART tree learning desirable to minimize the loss function

\begin{aligned}
Loss = min_{j,s}[min_{C_1}L(y^i,C_1)+min_{C_2}L(y^i,C_2)] \newline
where \quad C_m=ave(y_i|x_i \in R_m)
\end{aligned}

  • \ (R_m \) : The input range is divided
  • \ (C_m \) : Interval \ (R_m \) corresponding to the output value
  • \ (L \) : \ (Y \) and \ (C \) distance between
  • \ (Y \) : the first few characteristic variables
  • \ (S \) : split point
  • Traversing characteristic variable \ (J \) , wherein each of the selected variable on the right division point , so that the distance and the minimum sum , and finally selecting the best results corresponding to \ (J \) and \ (S \)
  • cost functin: mean of each interval and (Y \) \ distance between
  • Commonly used to calculate the distance \ (L_2 are \) Euclidean distance
  • CART trees generally on two sub-region, \ (R_1 \) and \ (R_2 \) , because it is a recursive partitioning

Integrated learning

Bagging (Bootstrap Aggregating, aggregation algorithm guide, bagging method)

  • Given size \ (n-\) training set \ (D \) , from uniform, back to select \ (m \) of size \ (n '\) subset \ (D_i \) as the new training set. In this \ (m \) using the classification on the training set, regression algorithm, then obtain \ (m \) models, and then by averaging, taking majority or the like, can be obtained Bagging result.
  • Postharvest every good setting treatment
  • Well done in parallel

  • Bagging algorithm can be used with other classification, regression algorithm combination, improve accuracy, and stability by reducing the variance of the results, to avoid over-fitting occurs.

  • Bootstrap is in statistical sampling based. Sampling with replacement itself.

Random Forest

For many tree using Bagging CART tree algorithm to generate random composition of forests.

Boosting (enhancement algorithm)

AdaBoost

Given training data \ (D = {(x ^ {(1)}, y ^ {(1)}), (x ^ {(2)}, y ^ {(2)}), ..., ( X ^ {(N)}, Y ^ {(N)})} \) , \ (X ^ {(I)} \ in R & lt ^ n-\) , \ (Y \ in \ left \ {+. 1, - . 1 \ right \} \) . For each \ (^ {X (I)} \) , there is a corresponding weight \ (^ {W (I)} \ in R & lt \) (scalar)

  1. initialization

\begin{aligned}
w=(w^{(1)},w^{(2)},...,w^{(N)})\newline w^{(i)}=\frac{1}{N},\quad i=1,2,...,N
\end{aligned}

  1. for \ (K = 1,2, ..., K \) Iterative

\ (G_k (x) \) is a weak classifier (Naive Bayes, decision trees)

  1. Each of the weak classifiers calculate misclassification rate

\ the begin {the aligned}
E_k = P (G_k (X ^ {(I)}) is not equal to y ^ {(i)}) \ newline = \ sum_iw ^ {(i)} I (G_k (x ^ {(i) ^ {} is not equal to Y (I)})
\ the aligned End {}

  • \ (The I (for condition Condition) \) : Indicator function indicated, logical variables, equation holds in brackets is equal to 1, equal to 0 is not established. Here is the number of samples are misclassified.
  • \ (^ {W (I)} \) : initialization state is \ (\ frac {1} { N} \)
  1. The weight of each of the weak classifiers calculate the weight
    \ the aligned the begin {}
    \ alpha_k = \ FRAC. 1} {2} {LN (\ FRAC. 1-E_k {} {} E_k)
    \ the aligned End {}
  • If a relatively large error classifier, the classifier undesirable large weight
  1. Updating the training data weight \ (W ^ {(K)} \)
    \ the begin {the aligned}
    W_ {K +. 1} ^ {(I)} = \ FRAC {W_k ^ {(I)}} {Z_K} exp (- \ ^ {alpha_ky (I)} G_k (^ {X (I)})
    \ the aligned End {}
  • \ (Z_K} {\) : normalization, so \ (w_k ^ {(i) } \) add up to 1.
    \ the aligned the begin {}
    {} Z_K = \ sum_ {I}. 1 ^ = N ^ {W_k (I)} exp (- \ alpha_kG_k (^ {X (I)})
    \ the aligned End {}
  1. Obtained result
    \ the begin {the aligned}
    F (X) ^ {(I)} = \ sum_ {K =. 1} ^ K \ alpha_kG_k (X ^ {(I)})
    \ NEWLINE
    F. (X ^ {(I)}) = sign (f (x) ^ {(i)}) \ quad final classification result
    \ end {aligned}
  • sign () function, an output is greater than 0, less than 0 outputs -1

Boosting need to individually trained weak classifiers, 1) data weight initialization. 2) The new weights of each weak classifier error to count the weight data, the misclassification weight increases, the weight fraction of the reduction. 3) input data to be classified into each of the weak classifiers, the weighted results obtained by each classifier if the final result.

Gradient Boosting (gradient lift)

boosting ideas: classification of points on the reservations, the sub was wrong the next classifier to enhance learning. Every time the adjustment residuals.

Gradient Boosting: modeling each model loss function is to establish a gradient descent direction before.

There will be a loss function gradient

In seeking a gradient function space

  1. Given training data \ (D = {(x ^ {(1)}, y ^ {(1)}), (x ^ {(2)}, y ^ {(2)}), ..., ( ^ {X (N)}, {Y ^ (N)})} \) , we want to adapt to the residual before the weak classifiers

\ the begin {the aligned}
F_ {m +. 1} (X) = F_ {m} (X) + H (X)
\ End {the aligned}
Our goal is to learn a \ (h (x ^ {( i)}) \) :

\begin{aligned}
F_{m+1}(x^{(i)})=F_{m}(x^{(i)})+h(x^{(i)})=y^{(i)}
\end{aligned}

  1. Supervised learning, the purpose is to find a F $ (the X-) \ (approximation function \) \ widehat F (the X-) $, by minimizing the loss function of mathematical expectation

\begin{aligned}
\widehat F(x)=argminE_{x,y}[L(y,F(x))]
\newline
F(x)=\sum_{i=1}^M \alpha_ih_i(x)
\newline
F_0(x)=argmin_C\sum_{i=1}^NL(y^{(i)},C)
\newline
F_m(x)=F_{m-1}(x)+argmin\sum_{i=1}^N[L(y^{(i)},F_{m-1}(x^{(i)})+h_m(x^{(i)}))
\end{aligned}

  • \ (argmin \) : the value of the variable to make the latter reaches a minimum when the equation
  • \ (F_0 (the X-) \) : For the CART tree, it is the beginning, you can learn a constant C
  • With \ (F_0 (X) \) , is coupled with the lower part and the residual error function
  1. Learn residuals

\begin{aligned}
r_m^{(i)}=-[\frac{\partial L(y^{(i)},F(x^{(i)}))}{\partial F(x^{(i)})}]
\end{aligned}

  • Derivative negative gradient direction is improved
  1. Decision Tree Learning fitting residue
    learning a CART tree to fit residuals
    \ the aligned the begin} {
    C_ {} = argmin_C MJ \ sum_ {X \ in K (m, J)} L (Y ^ {(I)}, {m}. 1-F_ (^ {X (I)}) + C)
    \ the aligned End {}

Guess you like

Origin www.cnblogs.com/ColleenHe/p/11564768.html