Follow a unified framework for understanding SVM machine learning

I. Introduction

My blog is not the nature of the science blog, only to record my ideas and thought processes. Welcome to point out I think the blind spot, but hope that we can have our own understanding.

This reference to curriculum and Li Hang greatly statistical learning methods of teaching CHANG explain SVM.

Second, understanding

Unified machine learning framework (MLA):

1. model (Model)
2. Strategy (Loss)
3. Algorithm (Algorithm)

According to the above mentioned framework, SVM core is the use of Hinge Loss and nuclear methods .

SVM: Hinge Loss + Kernel Method

Model

Given data set \ ((x ^ 1, \ hat {y} ^ 1), (x ^ 2, \ hat {y} ^ 2) ... (x ^ n, \ hat {y} ^ n) \ ) , where \ (\ Hat {Y} ^ I \ in \ {. 1, -1 \} \) , and the linear function:
\ [F (X) = W ^ the Tx + B \]

\ [y = \ begin {cases
} 1, \ quad & f (x)> 0 \\ -1, & f (x) <0 \ end {cases} \] simultaneously:
When \ (\ hat {y} = 1 \ ) when, \ (F (X) \) the better; \ (\ Hat {Y} = -. 1 \) when, \ (F (X) \) as small as possible.
In summary, namely: \ (\ Hat {Y} F (X) \) better.

Loss

Structural Risk Minimization: experience + regularization term risk

Empirical risk

Mentioned above, we want to \ (\ hat {y} f (x) \) as large as possible, that is, when \ (\ hat {y} f (x) \) when the larger, the loss should be as small (Large Value, Small Loss).
1 . Consider using \ (sigmoid + cross \ entropy \ ) loss function:
\ [\ Hat {Y} = \ the begin {Cases} + 1'd, \; & F (X)> 0 \; & \ Sigma (F (X )) \ longrightarrow 1, & Loss = -ln (\ sigma (f (x))) \\ -1, \; & f (x) <0 \; & \ sigma (f (x)) \ longrightarrow 0, & Loss = -ln (1- \ sigma (f (
x))) \ end {cases} \] considering the \ (1- \ sigma (f ( x)) = 1- \ frac {1} {1 + exp (-f (X))} = \ {FRAC. 1. 1 + exp {} (F (X))} = \ Sigma (-f (X)) \)
\ [Loss = -ln (\ Sigma (\ Hat {Y} F (x))) = ln ( 1 + exp (- \ hat {y} f (x))) \]

This is the book of rate watermelon loss.
2 . Use Hinge Loss loss function:
using the time rate of loss, it is desirable \ (\ hat {y} f (x) \) better, better and better, that endless.
A different point of view, if we want to \ (\ hat {y} f (x) \) to do it well enough, that is, when \ (hat {y} \ f (x)> 1 \) when we believe that it has done well enough, then the losses to zero.

Digression: Hinge Loss like lateral learning, many times we need to learn a lot about the field, this time probably know, understand on the line; the rate of loss as a longitudinal study in their field need to study, is good enough.

\[Loss = max(0,1-\hat{y}f(x))\]

Regular items

\ [\ frac {1} { 2} || w || ^ 2 \]
In summary, the final loss function : \
[Loss = \ FRAC. 1} {2} {\ ^ 2 the lambda || W || + \ sum_ {i = 1} ^ n max (0,1- \ hat {y} ^ if (x ^ i)) \]

Loss of regular item notice is a convex function, experience loss items is a convex function, it can be solved directly with the gradient descent method.

Algorithm

Gradient descent

\[\frac{\partial L}{\partial w} = \lambda w+ \sum_{i=1}^n -\delta(\hat{y}^i f(x^i) < 1)\hat{y}^i x^i\]

\[\frac{\partial L}{\partial b} = \sum_{i=1}^n -\delta(\hat{y}^i f(x^i) < 1)\hat{y}^i\]

Wherein \ (\ delta (\ hat { y} ^ if (x ^ i) <1) \) is the indicator function.

\[w^{k+1}=w^k-\eta(\lambda w^k+ \sum_{i=1}^n -\delta(\hat{y}^i f(x^i) < 1)\hat{y}^i x^i)\]

\[b^{k+1}=b^k-\eta(\sum_{i=1}^n -\delta(\hat{y}^i f(x^i) < 1)\hat{y}^i)\]

to sum up

To the current location is what did: For a given set of data, find a hyperplane dividing them, classify, and require as much as possible to do good (strategy HingeLoss). Do not take into account the possible good (divisibility is not very good) in the current dimension or space, you can put these data points or rose-dimensional transform space has better separability in another space, so you can put the current task do better.

\ [Z = \ phi (x) \]

representation of the z transform of x (may be a high-dimensional space, may be a low-dimensional space), then re-use the above mentioned method

\[Loss = \frac{1}{2}\lambda||w||^2 + \sum_{i=1}^n max(0,1-\hat{y}^i f(z^i))\]

\[Loss = \frac{1}{2}\lambda||w||^2 + \sum_{i=1}^n max(0,1-\hat{y}^i f(\phi(x^i)))\]

Weaknesses: After x is transformed to obtain z, we first need to calculate z, then follow-up calculations, when a large dimension l z dimension, although this time separability increased, but will greatly increase the amount of calculation , but also for special cases, such as when z is infinite dimensional, z simply can not be calculated, thus leads to nuclear methods.

Spread

For a depth of neural networks do two classification tasks, we generally use the cross-entropy loss function, if the replacement of lost function hingeloss, it is the depth of the Learning Edition of SVM.
The depth of layer n-1 before the neural network as a feature transformation layer, the last layer regarded as classification layer, and summary we would say very similar, and the \ (X \) conversion, and then classified. The difference is that: we call this function SVM transform our definition , is determined , and Deep Learning in the transfer function is uncertain , through a data science out of.
In general, SVM classification tasks and depth of learning to follow the idea of unity is not necessary to distinguish them from the essence.

Third, the dual form

Write object is dual form: the \ (w, b \) as a linear combination of the data points, this can be \ (\ phi (x ^ i ) \ phi (x ^ j) \) such high dimensional computation space into into \ (\ kappa (x ^ i , x ^ j) \) in the low-dimensional space is calculated, and then the final value obtained manner kernel directly.
Implicit idea is: I do not need to understand the middle of the process (value of L dimension), just need to get their relationship on the line (kernel), the kernel function \ (\ kappa \) says that this relationship .

According to (w, b \) \ characteristic FORMULA, when \ (w ^ 0 = 0, b ^ 0 = 0 \) , it is easy to see \ (w, b \) is a linear combination of the data points given (Linear Combination)
\ [W = \ sum_. 1} ^ {n-I = \ alpha_i \ Hat IX ^ {Y} ^ I \]

\[b = \sum_{i=1}^n \beta_i \hat{y}^i\]

\[\alpha_i= \eta\{(1-\eta \lambda)^k \delta(\hat{y}^i (w^Tx^i+b<1))_{w,b->0}+(1-\eta \lambda)^{k-1} \delta(\hat{y}^i (w^Tx^i+b<1))_{w,b->1}+...\\+(1-\eta \lambda)^0 \delta(\hat{y}^i (w^Tx^i+b<1))_{w,b->k}\}\]

\[\beta_i= \eta\{\delta(\hat{y}^i (w^Tx^i+b<1))_{w,b->0}+ \delta(\hat{y}^i (w^Tx^i+b<1))_{w,b->1}+...\\+ \delta(\hat{y}^i (w^Tx^i+b<1))_{w,b->k}\}\]

To distinguish this perceptron, because here there is the regularization term, \ (\ the lambda> 0 \) , if \ (\ lambda = 0 \) , then \ (\ alpha_i = \ beta_i \ )

at this time:

\[f(x) = w^Tx+b= (\sum_{i=1}^n \alpha_i \hat{y}^i x^i)^{T}x+\sum_{i=1}^n \beta_i \hat{y}^i\]

\[f(x) = w^Tx+b= (\sum_{i=1}^n \alpha_i \hat{y}^i z^i)^{T}z+\sum_{i=1}^n \beta_i \hat{y}^i\]

\[f(x) = w^Tx+b= \sum_{i=1}^n \alpha_i \hat{y}^i \kappa (z^i,z)+\sum_{i=1}^n \beta_i \hat{y}^i\]