Machine Learning Basics (7) - Support Vector Machines

Support Vector Machines

1 Optimization goal

In SVM, the cost replaces the log item, and the image is very similar to the image of the log item.
insert image description here

In SVM, we no longer use λ for optimization, but use C, and the item C is put into the previous item, and 1/m is removed.
insert image description here

In support vector machines, to minimize our cost function:

When y=1, then we hope that z(θ transpose*x) can be greater than or equal to 1 as much as possible, because when z is greater than or equal to 1, our cost_1 function will be 0, which can minimize the cost function
When y=0, we want z to be less than or equal to -1, so that our cost function is minimized.

insert image description here

SVM will choose the black line as the decision boundary, and then separate the positive samples from the negative samples with the largest distance. So sometimes SVM is also called a large-pitch splitter.
In the support vector machine, if C is set too large, then our vector machine will be very sensitive to outliers. This will result in a red rose-red line.
If C is relatively small, then our SVM will continue to use the black line.

2 Principle of large spacer

2.1 Vector inner product mechanism

The upper axis in the figure below:

The length of the vector u is equal to the root sign u_1 ² +u_2 ² (Pythagorean theorem). The length of the vector v can also be obtained in the same way. Make a straight line perpendicular to u through the arrow v, and p is the mapping of v on u. Then uT*v=p*u.
When p<90°, our p is positive.
When p>90°, our p is negative at this time.

2.2 The principle of support vector machine selection decision boundary

Assume the green line on the left figure is the decision boundary (a support vector machine would not pick this line). Using the principle of vector inner product, we can draw the mapping of two samples on θ, we can see that p ⁽¹⁾ is a very small number at this time, and p ⁽²⁾ is a negative number. Then if p ⁽ⁱ⁾ * ||θ||>=1 or p ⁽ⁱ⁾ *||θ|| is to be satisfied, then we need to make ||θ|| as large as possible. Then if ||θ|| is large, it will conflict with 1/2 * ||θ ^{2 || of the optimization item if it is as small as possible.}So the support vector machine will not choose this headline decision boundary.
If we choose this green decision boundary, our P will become larger, so that our θ will become smaller, and in order to make θ as small as possible, our SVM will find a gap that will make P as large as possible, The ultimate goal is to make θ as small as possible.

2.3 Kernel function

When we need to calculate high-order terms, it will consume a considerable amount of calculation, so here we use the kernel function to approximate.
The similar kernel function of x and is expressed as follows:

When x and l are very similar, then f will be very close to 1
When x and l are far away, then f will be very close to 0 at this time

insert image description here

Suppose we now have three variables, and the similar value landmarks corresponding to f1, f2, and f3 are l(1), l(2), and l(3) respectively. The parameters θ1, θ2, and θ3 are known. When our real sample is rose red, it can be seen that the real sample is very close to l(1), and very far from l(2), l(3). Therefore, this is that f1 will be very close to 1, while f2 and f3 will be very close to 0, so at this time we bring in and obtain our value is 0.5>=0.5, so the predicted value will be 1.
In the same way, when our actual value is very far away from these three points, our predicted value becomes θ0.
And because our θ3 is 0, our predicted value will be 1 only when the actual value is in the red line area (so that we can get our decision boundary) How to choose landmark: Look at the lower part of the

figure
below Two pictures. The picture on the left is where our given samples are located. We put a landmark at the position corresponding to each sample point, and then form a vector.
For example, if we take x ⁽ⁱ⁾ , then we will get f ⁽ⁱ⁾ _1, f ⁽ⁱ⁾ _2, ..., f ⁽ⁱ⁾ _m. Each f ⁽ⁱ⁾ _j represents the distance between different sample points and x ⁽ⁱ⁾ . So when it comes to f ⁽ⁱ⁾ _i, it is calculating the similarity value between x ⁽ⁱ⁾ and itself, which is 1.
Then by combining these f ⁽ⁱ⁾ , I create a new vector that describes our feature vector x ⁽ⁱ⁾ .

insert image description here

Then use the kernel function to calculate the SVM
When C is large, high variance (overfitting) is likely to occur at this time
When C is small, high bias (underfitting) is likely to occur at this time
When the sigmoid is large, then there will be high bias
When the sigmoid is small, then there will be high variance

2.4 Applying SVMs

You can call other people's libraries when fitting the optimal θ
choose C
Select the kernel function (the linear kernel function is not using the kernel function)
If you use the Gaussian kernel function, you must choose the sigmoid value

insert image description here

When n is large relative to m, use Logistic regression or linear SVM
Use SVM with Gaussian kernel function when n is small and m is moderate
When n is small and m is large, add or create more features, and then use Logistic regression or linear SVM
Neural networks work well under these conditions, but are slow to train.