SVM algorithm (2)

      • 2.1 Introduction to SVM algorithm

        learning target

        • Understand the definition of SVM algorithm
        • Know soft and hard intervals

        1 SVM algorithm import

        On Valentine's Day a long time ago, the hero went to save his lover, but the devil played a game with him.

        The devil seemed to place balls of two colors on the table in a regular pattern and said:

        "You use a stick to separate them? Requirement: Try to still apply it after putting more balls in."

        image-20190812210946954

        So the hero let it go like this, did he do a good job?

        img

        Then the devil placed more balls on the table, and one ball seemed to be on the wrong side.

        img

        what to do? ?

        Make the decomposed stick thicker.

        SVM is trying to place the stick in the best position so that there is as much space as possible on both sides of the stick.

        img

        Now even if the Devils put in more balls, the sticks are still a good dividing line.

        img

        Then, there is another more important trick in the SVM toolbox . The devil saw that the hero had learned a trick, so the devil gave the hero a new challenge.

        img

        Now, the hero has no stick that can help him separate the two kinds of balls. What should he do now?

        Of course, like in all martial arts movies, the hero slaps the table and the ball flies into the air. Then, relying on the hero's lightness, the hero grabbed a piece of paper and inserted it between the two kinds of balls.

        img

        Later, the bored adults gave the objects above another name:

        Ball - "data" data

        Stick - "classifier" classification

        Maximum gap - "optimization" optimization

        Slapping the table - "kernelling" method

        Paper - "hyperplane" hyperplane

        2 SVM algorithm definition

        2.1 Definition

        SVM: The full name of SVM is supported vector machine (support vector machine), that is, finding a hyperplane to divide the samples into two categories with the largest interval.

        SVM is capable of performing linear or nonlinear classification, regression, and even outlier detection tasks. It is one of the most popular models in the field of machine learning. SVM is particularly suitable for classification of small and medium-sized complex data sets.

        image-20190812213753598

        2.2 Introduction to the maximum margin of the hyperplane

        image-20190812214123863

        The image above on the left shows the decision boundaries for three possible linear classifiers:

        The model represented by the dotted line performs so poorly that it cannot even classify correctly. The remaining two models perform perfectly on this training set, but their decision boundaries are too close to the instances, so they may not perform very well when faced with new instances .

        The solid line in the right figure represents the decision boundary of the SVM classifier , which not only separates the two classes but is as far away from the nearest training instances as possible .

        2.3 Hard intervals and soft intervals

        2.3.1 Hard interval classification

        In the process of using hyperplane to split data above, if we strictly make all instances not between the maximum interval and located on the correct side, this is a hard margin classification.

        There are two problems with hard margin classification . First, it only works when the data is linearly separable . Second, it is very sensitive to outliers .

        When there is an extra outlier in the iris data: The data on the left shows no hard margin at all, and the decision boundary shown on the right ends up being very different from what we saw before without outliers, possibly Does not generalize well.

        image-20190812215717860

        2.3.2 Soft margin classification

        To avoid these problems, it is better to use a more flexible model. The goal is to find a good balance between keeping the largest margin possible and not misclassifying the data , which is soft margin classification.


        3 Summary

        • SVM algorithm definition [understand]
          • Find a hyperplane that divides the samples into two categories with the largest distance.
        • Hard intervals and soft intervals [know]
          • hard spacing
            • Only valid if the data is linearly separable
            • Very sensitive to outliers
          • Soft interval
            • The goal is to find a good balance between keeping the largest margin possible and not misclassifying the data

2.2 Initial use of SVM algorithm API

learning target

  • Know the usage of SVM algorithm API
  • SVC ( Support Vector Classifier ), the name of the svm model used for classification

from sklearn import svm
X = [[0, 0], [1, 1]]
y = [0, 1]
clf = svm.SVC()
clf.fit(X, y)  

After fitting, this model can be used to predict new values:

clf.predict([[2., 2.]])

2.3 Principle of SVM algorithm

learning target

  • Know about linearly separable support vector machines in SVM
  • Know the derivation process of the objective function in SVM

1 Define input data

Assume that a training set on a feature space is given:

image-20190812224849027

Among them, ( xi , yi ) (x_i,y_i)(xi,yi) is called a sample point.

  • x i x_i xiis the i-th instance (sample),
  • y i y_i yixi x_ixiMarks (labels):
    • y i = 1 y_i=1 yi=1 o'clock,xi x_ixias a positive example
    • y i = − 1 y_i=-1 yi=1 hour,xi x_ixiis a negative example

As for why positive and negative are represented by (-1, 1)?

This is marked here to simplify the operation

2 Linearly separable support vector machine

Given the linearly separable training data set proposed above, the separation hyperplane obtained by maximizing the interval is: y ( x ) = w T Φ ( x ) + by(x)=w^T\Phi(x)+by(x)=wT Φ(x)+b

The corresponding classification decision function is: f ( x ) = sign ( w T Φ ( x ) + b ) f(x)=sign(w^T\Phi(x)+b)f(x)=sign(wT Φ(x)+b)

The above decision function is called a linearly separable support vector machine.

Here is an explanation of Φ ( x ) \Phi(x)Φ ( x ) this stuff.

This is a certain feature space transformation function. Its function is to map x to a higher dimension. It has a proprietary name that we will often see in the future **"kernel function"**.

For example, there are 2 features we see:

The first linear function composed of x1 and x2 can be w 1 x 1 + w 2 x 2 w_1x_1+w_2x_2w1x1+w2x2

But maybe these two features cannot describe the data well (we cannot separate the data with only these two features), so we transform the dimensions and become w 1 x 1 + w 2 x 2 + w 3 x 1 x 2 + w 4 x 1 2 + w 5 x 2 2 w_1x_1+w_2x_2+w_3x_1x_2+w_4x_1^2+w_5x_2^2w1x1+w2x2+w3x1x2+w4x12+w5x22

So we have three more features. And this is a general description of the mapping of x.

The simplest direct mapping is: Φ ( x ) = x \Phi(x)=xΦ ( x )=x

The above is the model expression of linearly separable support vector machine. We need to find such a model, or such a hyperplane y(x), which can optimally separate the two sets.

In fact, we need to find a set of parameters (w, b) so that the hyperplane function constructed can optimally separate the two sets.

The following is an optimal hyperplane:

image_1b1tqf5qo1s6r1hhq14ct1dfu5d12a.png-231.8kB

Some students may think that when the solid line wx+b=0, the dotted line is not necessarily wx+b=±1?

3 SVM calculation process and algorithm steps

3.1 Derivation of objective function

We now know what a support vector machine is. Now we are going to find this support vector machine, that is, to find an optimal hyperplane.

So we need to establish an objective function. So how to build it?

Let’s look at our hyperplane expression again: y ( x ) = w T Φ ( x ) + by(x)=w^T\Phi(x)+by(x)=wT Φ(x)+b

For convenience we let: Φ ( x ) = x \Phi(x)=xΦ ( x )=x

Then in the sample space, the dividing hyperplane can be described by the following linear equation: w T x + b = 0 w^Tx+b=0wTx+b=0

The distance from any point x in the sample space to the hyperplane (w, b) can be written as

image-20190814141310502

Assuming that hyperplane (w, b) can correctly classify training samples, then for (xi, yi) ∈ D (x_i, y_i)\in D(xi,yi)D

  • y i = + 1 y_i=+1 yi=+ 1 , thenw T xi + b > 0 w^Tx_i+b>0wTxi+b>0
  • y i = − 1 y_i=-1 yi=1,则有 w T x i + b < 0 w^Tx_i+b<0 wTxi+b<0

make

image-20190814141642787

As shown in the figure, the few training sample points closest to the hyperplane make the above equation equal, and they are called "support vectors".

The sum of the distances from two heterogeneous support vectors to the hyperplane is: img;

It's called "interval".

image-20190814141836897

To find the dividing hyperplane with the largest margin, that is, to find the parameters w and b that satisfy the constraints in the following formula, such that γ \gammaγmax .

image-20190814141642787

Right now:

img

Obviously, in order to maximize the interval, you only need to maximize [external link image transfer failed, the source site may have an anti-hotlink mechanism, it is recommended to save the image and upload it directly (img-hS2Qaf0z-1668817748533) (https://yuchen-1313787112. cos.ap-nanjing.myqcloud.com/20180607123253872)], which is equivalent to minimizing [external link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-M5LUNU9o-1668817748534) (https://yuchen-1313787112.cos.ap-nanjing.myqcloud.com/20180607123313992)]. So the above equation can be rewritten as:

img

This is the basic type of support vector machine.

3.2 Solving (understanding) the objective function

At this point, the objective function is finally established.

Then the next step is to find the optimal value of the objective function.

Because the objective function has a constraint, we can solve it using the Lagrange multiplier method .

3.2.1 Langrange multiplier method

What is the Lagrange multiplier method?

The Lagrange multipliers method is a method of finding the extreme values ​​of multivariate functions under a set of constraints .

By introducing Lagrange multipliers, the optimization problem with d variables and k constraints can be transformed into an unconstrained optimization problem with d + k variables.


After the Langrange multiplier method, we can transform the objective function into:

image_1b1tslnd0jn516gai3end5rmj5e.png-8.9kB

Among them, the second half of the above formula:

image-20200121010844799

At this point, the objective function still cannot be solved. Now our problem is a minimax problem.

3.2.2 Duality problem

We need to convert it into a dual problem and into a minimax problem:

From image-20200121003450541becomes:image-20200121011050862

How to get the dual function?

  • First, we derive the derivatives of w and b of the original objective function:

    • Original objective function:image_1b1vjnneag2c1o0esh21520kn69.png-9.6kB
    • Find the partial derivative of w:image_1b1vjp8611tvc1717jb52is1c6jm.png-10.4kB
    • Find the partial derivative of b:image_1b1vjqegou181f94117j371cq21a.png-8.1kB
  • Then resubstitute the above derivative functions of w and b into w and b of the original objective function, and the result is the dual function of the original function:

image_1b1vk20iq3vc14ld17p51a1v1tq72h.png-40.6kB

  • This dual function actually calculates image-20200121011355571the minL(w,b) m i n**L ( w , b ) part of: (because the partial derivative is obtained for w,b).
  • So what is required now is the maximum value max(a) of this function, which can be written as:

image-20200121011505169

  • Okay, now we only need to find the maximum value \alpha α for the above formula , and then substitute \alpha α into the formula for w to find the partial derivative:

image-20200204105305071

  • So find w.
  • Substituting w into the expression of the hyperplane, calculate the b value;
  • The current w and b are the parameters of the optimal hyperplane we are looking for.
3.2.3 Determination of the overall process

We use mathematical expressions to illustrate the above process:

  • 1) Give the mathematical formula of the model

  • 2) Solved parameters w and b

  • 3) Find the hyperplane:

image_1b1vld5fhblkn2j1c7h1nk1pc66e.png-5.3kB

  • 4) Obtain the classification decision function:

image_1b1vldujb18mb1g5ch8uoikios6r.png-8.2kB

2.4 Kernel method of SVM

learning target

  • Know the kernel method of SVM
  • Understand common kernel functions

[SVM + Kernel Function] has great power.

The kernel function is not unique to SVM. The kernel function can be combined with other algorithms. However, the advantages of combining the kernel function with SVM are very great.

1 What is a kernel function?

1.1 Concept of kernel function

The kernel function maps the original input space to a new feature space, so that samples that are originally linearly inseparable may be separable in the kernel space.

image-20200204154534315

  • Suppose X is the input space,
  • H is the feature space,
  • There is a mapping φ such that point x in X can be calculated to point h in H space,
  • This holds true for all points in X:

image-20190814000426859

If x, z are points in space, and the function k(x, z) satisfies the following conditions, then all are true, then k is called the kernel function, and ϕ is the mapping function, and the point represents the dot product:

image-20190814000453473

The reason why the kernel function is defined in the form of dot product is because we will use the dot product of two data when solving the parameters w and b of the SVM model. Defining it in this form can simplify the operation.

1.2 Kernel function example

1.2.1 Kernel method example 1:

Our data is two-dimensional, and our mapping function is, the square of the first value of the data is used as the first feature, and the square root of the product of the first value and the second value of the data is 2 times the second value. Two features, the square of the second value of the data is used as the third feature of the data.

image-20190814000704864

Then our kernel function is k ( v 1 , v 2 ) = ( v 1. T ⋅ v 2 ) 2 k(v1,v2) =(v1.T·v2)^2k(v1,v 2 )=( v 1 . T v 2 )2 , that is, the square of the dot product of data v1 and data v2.

image-20200204151146946

image-20200204151231279

image-20190814000952741

(Angle brackets indicate dot product)


After the above formula, the specific transformation process is:

  • v1=(x_1, x_2)=(1,1)

  • v2=(y_1, y_2)=(2,2)

  • Φ(v1) = (1,2 \sqrt{2}2 , 1)

  • Φ(v2) = (4, 4 2 4\sqrt{2} 42 , 4)

  • <Φ(v1),fΦ(v2)>=4+8+4=16

  • K(v1,v2)=(v1.T**·**v2 )^2 = (2+2)^2 = 16

    Here <> and **·** represent dot products

Link: https://zhuanlan.zhihu.com/p/261061617

2 Common kernel functions

image-20190814003133082

2.5 Case: Number Recognizer

learning target

  • Applying SVM algorithm to implement digital recognizer

1 Case background introduction

image-20190814085138232

MNIST ("Modified National Institute of Standards and Technology") is the de facto "hello world" dataset for computer vision. Since its release in 1999, this classic handwritten image dataset has become the basis for benchmarking classification algorithms. As new machine learning technologies emerge, MNIST remains a reliable resource for researchers and learners.

In this case, our goal is to correctly identify digits from a dataset of tens of thousands of handwritten images.

2 Data introduction

The data files train.csv and test.csv contain grayscale images of hand-drawn numbers from 0 to 9.

Each image has a height of 28 pixels and a width of 28 pixels, for a total of 784 pixels .

Each pixel has a single pixel value associated with it that indicates how bright or dark that pixel is, with higher numbers meaning darker. The pixel value is an integer between 0 and 255, inclusive.

The training data set (train.csv) has 785 columns. The first column is called "label" and is the number drawn by the user. The remaining columns contain the pixel values ​​of the associated image.

Each pixel column in the training set has a name like pixelx, where x is an integer between 0 and 783, inclusive. To locate this pixel on the image, assume we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then, pixelx is located on row i and column j of the 28 x 28 matrix (index zero).

For example, pixel31 represents the pixels in the fourth column from the left, and the second row from the top, as shown in the ascii diagram below.

Visually, if we omit the "pixel" prefix, the pixels that make up the image look like this:

000 001 002 003 ... 026 027
028 029 030 031 ... 054 055
056 057 058 059 ... 082 083
 | | | | ...... | |
728 729 730 731 ... 754 755
756 757 758 759 ... 782 783

image-20190814090310960

The test data set (test.csv) is the same as the training set except that it does not contain the "label" column.

3 case implementation

Reference: Case_Handwritten digit classification.ipynb

i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then, pixelx is located on row i and column j of the 28 x 28 matrix (index zero).

For example, pixel31 represents the pixels in the fourth column from the left, and the second row from the top, as shown in the ascii diagram below.

Visually, if we omit the "pixel" prefix, the pixels that make up the image look like this:

000 001 002 003 ... 026 027
028 029 030 031 ... 054 055
056 057 058 059 ... 082 083
 | | | | ...... | |
728 729 730 731 ... 754 755
756 757 758 759 ... 782 783

[External link pictures are being transferred...(img-Uuc7hMMI-1668817748544)]

The test data set (test.csv) is the same as the training set except that it does not contain the "label" column.

3 case implementation

Reference: Case_Handwritten digit classification.ipynb

Guess you like

Origin blog.csdn.net/weixin_52733693/article/details/127932291
svm