Machine Learning Basics (7) - Support Vector Machines

Support Vector Machines

1 Optimization goal

In SVM, the cost replaces the log item, and the image is very similar to the image of the log item.
insert image description here

In SVM, we no longer use λ for optimization, but use C, and the item C is put into the previous item, and 1/m is removed.
insert image description here
insert image description here

In support vector machines, to minimize our cost function:

  • When y=1, then we hope that z(θ transpose*x) can be greater than or equal to 1 as much as possible, because when z is greater than or equal to 1, our cost_1 function will be 0, which can minimize the cost function

  • When y=0, we want z to be less than or equal to -1, so that our cost function is minimized.

insert image description here

  • SVM will choose the black line as the decision boundary, and then separate the positive samples from the negative samples with the largest distance. So sometimes SVM is also called a large-pitch splitter.
    insert image description here

  • In the support vector machine, if C is set too large, then our vector machine will be very sensitive to outliers. This will result in a red rose-red line.

  • If C is relatively small, then our SVM will continue to use the black line.
    insert image description here

2 Principle of large spacer

2.1 Vector inner product mechanism

The upper axis in the figure below:

  • The length of the vector u is equal to the root sign u_1 2 +u_2 2 (Pythagorean theorem). The length of the vector v can also be obtained in the same way. Make a straight line perpendicular to u through the arrow v, and p is the mapping of v on u. Then uT*v=p*u.
  • When p<90°, our p is positive.
  • When p>90°, our p is negative at this time.
    insert image description here

2.2 The principle of support vector machine selection decision boundary

  • Assume the green line on the left figure is the decision boundary (a support vector machine would not pick this line). Using the principle of vector inner product, we can draw the mapping of two samples on θ, we can see that p (1) is a very small number at this time, and p (2) is a negative number. Then if p (i) * ||θ||>=1 or p (i) *||θ|| is to be satisfied, then we need to make ||θ|| as large as possible. Then if ||θ|| is large, it will conflict with 1/2 * ||θ 2 || of the optimization item if it is as small as possible. So the support vector machine will not choose this headline decision boundary.
    insert image description here

  • If we choose this green decision boundary, our P will become larger, so that our θ will become smaller, and in order to make θ as small as possible, our SVM will find a gap that will make P as large as possible, The ultimate goal is to make θ as small as possible.
    insert image description here

2.3 Kernel function

When we need to calculate high-order terms, it will consume a considerable amount of calculation, so here we use the kernel function to approximate.
The similar kernel function of x and is expressed as follows:

  • When x and l are very similar, then f will be very close to 1
  • When x and l are far away, then f will be very close to 0 at this time

insert image description here

  • Suppose we now have three variables, and the similar value landmarks corresponding to f1, f2, and f3 are l(1), l(2), and l(3) respectively. The parameters θ1, θ2, and θ3 are known. When our real sample is rose red, it can be seen that the real sample is very close to l(1), and very far from l(2), l(3). Therefore, this is that f1 will be very close to 1, while f2 and f3 will be very close to 0, so at this time we bring in and obtain our value is 0.5>=0.5, so the predicted value will be 1.

  • In the same way, when our actual value is very far away from these three points, our predicted value becomes θ0.

  • And because our θ3 is 0, our predicted value will be 1 only when the actual value is in the red line area (so that we can get our decision boundary) How to choose landmark: Look at the lower part of the
    insert image description here
    figure
    below Two pictures. The picture on the left is where our given samples are located. We put a landmark at the position corresponding to each sample point, and then form a vector.
    insert image description here

  • For example, if we take x (i) , then we will get f (i) _1, f (i) _2, ..., f (i) _m. Each f (i) _j represents the distance between different sample points and x (i) . So when it comes to f (i) _i, it is calculating the similarity value between x (i) and itself, which is 1.
    insert image description here

  • Then by combining these f (i) , I create a new vector that describes our feature vector x (i) .

insert image description here

  • Then use the kernel function to calculate the SVM
    insert image description here
  • When C is large, high variance (overfitting) is likely to occur at this time
  • When C is small, high bias (underfitting) is likely to occur at this time
  • When the sigmoid is large, then there will be high bias
  • When the sigmoid is small, then there will be high variance
    insert image description here

2.4 Applying SVMs

  • You can call other people's libraries when fitting the optimal θ
  • choose C
  • Select the kernel function (the linear kernel function is not using the kernel function)
  • If you use the Gaussian kernel function, you must choose the sigmoid value

insert image description here

  • When n is large relative to m, use Logistic regression or linear SVM
  • Use SVM with Gaussian kernel function when n is small and m is moderate
  • When n is small and m is large, add or create more features, and then use Logistic regression or linear SVM
  • Neural networks work well under these conditions, but are slow to train.
    insert image description here

Guess you like

Origin blog.csdn.net/weixin_44027006/article/details/124079252
Recommended