Finally arrived at machine learning techniques , and then try to keep each chapter completed and update it immediately. . Cornerstone did not insist on finishing the writing, but now I look back and don’t know what I was writing. Looking at the notes, I feel that the writing is a mess, and I feel like I’ve overturned. Improve slowly.
I heard that the technique is quite difficult, so post a post on the blog of the master to bless it:
Red Stone: I think it sums it up very well! !
Lin Xuantian's Machine Learning Techniques Notes (2)
Lin Xuantian's Machine Learning Techniques' Notes (3)
1. Linear SVM
P1 1.1
After introducing the [techniques] around three feature transforms after this course
1. How to use feature transforms and control the complexity of feature transforms: use SVM (Support Vector Machine, it sounds quite difficult)
2 .How to find predictive features and mix them together to make the model perform better: AdaBoost (stepwise enhancement method)
3. How to find and learn hidden features to make the machine perform better: Deep Learning (deep learning!!!)
P2 1.2
In PLA, we can actually have different divisions for a set of data. The above three pictures are all "correct": it is guaranteed that all points are divided correctly, and according to VC bound, Eout is the same,
but according to the human brain, the division of the rightmost picture must be better.
why? Because the data will have some noise or measurement error, the actual situation is not necessarily on ooxx, it may be distributed in the gray area, and it is also reasonable. If it is on the left picture, close to the x on the dividing line, if there is some vibration, it will be easier to run to the range of o, resulting in errors. Therefore, in order to improve the error tolerance rate (the ability to tolerate errors) (the legendary robustness?), it is necessary to call out a "stronger" line. Obviously, the strongest line is to ensure that everything is correct . The line that is farthest from the nearest point.
Of course, it can also be transformed into "fat" but not "fat", and the fatter the thread, the stronger it is. Academically, "fat" is called margin. The following is a formula to express the w that maximizes the margin: " The strongest line is the line that is the farthest from the nearest point when everything is guaranteed to be correct "
P3 1.3
began to find distance(xn,w). Previously, a w0 was added to w1~wd, but because this w0 is different from other w operations, it jumped out directly, which is b, so there is: (
here The w0(b) should be a bias item, for why there is a bias item, you have to read the watermelon book for details)
Next, find distance(x,b,w), x' and x'' are points on the plane, x is a data point (not necessarily on the hyperplane), according to wTx' + b = 0, there is wTx' = -b, the same reason: wTx'' = -b
There is a special place here, which is to prove that w is the normal vector of this hyperplane. (About the hyperplane, I read someone else's article , but he didn't seem to explain why w is a normal vector..)
Knowing the normal vector, if there is a point x' on the plane, the distance between x and x' is actually the vector xx 'The projection on w, so it is:
because this is a Hard-Margin SVM, so this line will be divided into pairs for all points, so there is:
and yn=±1, so you can take off the absolute value:
then Come down for the convenience of solving:
Definition:
Then there is:
For why it is 1, in fact, any constant is fine. Here, the barrage says that it involves the knowledge of functional intervals and geometric intervals ? ? . Look at the red stone and say that w and b are scaled at the same time, and the obtained plane is still the same, so you can control yn ( w 1 T xn + b 1 ) = 1 y_n(w1^Tx_n+b1)=1yn(w1Txn+b 1 )=1 (Oh O o??)
At this time, because the largest margin is required (to make the line wider), it is necessary to make w larger and satisfymin ( n = 1... N ) yn ( w 1 T xn + b 1 ) = 1 min_(n=1...N) y_n(w1^Tx_n+b1)=1min(n=1 . . . N ) andn(w1Txn+b 1 )=1
But it is still difficult to solve, so we relax the conditions, let yn ( w T xn + b 1 ) > = 1 y_n(w^Tx_n+b1)>=1yn(wTxn+b 1 )>=1 , and prove that after relaxation, the best solution or h will satisfyyn ( w T xn + b 1 ) = 1 y_n(w^Tx_n+b1)=1yn(wTxn+b 1 )=1
Assume to find a set of optimal solutions (b1,w1) such thatyn ( w 1 T xn + b 1 ) > 1.126 y_n(w1^Tx_n+b1)>1.126yn(w1Txn+b 1 )>1 . 1 2 6 , then we can also find a set of better solutions (b 1 1.126 \frac{b1}{1.126}1.126b 1, w 1 1.126 \frac{w1}{1.126} 1.126w1), according to margin = 1 ∣ ∣ w ∣ ∣ margin=\frac{1}{||w||}margin=∣∣w∣∣1, w/1.126 becomes smaller, so that the margin is larger. Therefore, the previous optimal solution (b1, w1) is not optimal, and there is a contradiction. So as long as there is a group solution such that yn ( w T xn + b 1 ) > 1 y_n(w^Tx_n+b1)>1yn(wTxn+b 1 )>1 , we can find a better solution such thatyn ( w T xn + b 1 ) = 1 y_n(w^Tx_n+b1)=1yn(wTxn+b 1 )=1 , so we know that the optimal solution would beyn ( w T xn + b 1 ) = 1 y_n(w^Tx_n+b1)=1yn(wTxn+b 1 )=1。
Finally, I used to seek min before. In order to unify, put 1 ∣ ∣ w ∣ ∣ \frac{1}{||w||}∣∣w∣∣1Take the inverse. Find max 1 ∣ ∣ w ∣ ∣ max\frac{1}{||w||}max∣∣w∣∣1Change to min ∣ ∣ w ∣ ∣ min||w||m i n ∣ ∣ w ∣ ∣ . Because ||w|| has a root sign, so remove the root sign and become the square of w, expressed in a matrix is wTw, and finally add1 2 \frac{1}{2}21(It feels like it was added for derivation??). Finally becomes:
the final funtime, note that the formula x1x2 can correspond to x and y in y=kx+b respectively. Then according to d = ∣ A x 1 + B x 2 + C ∣ ( A 2 + B 2 ) d=\frac{|Ax1+Bx2+C|}{\sqrt{(A^2+ B^2)}}d=(A2+B2)∣Ax1+Bx2+C∣, simplify x 1 + x 2 = 1 x1+x2=1x 1+x2 _=1 is1 ∗ x 1 + 1 ∗ x 2 − 1 = 0 1*x1+1*x2-1=01∗x 1+1∗x2 _−1=0 , thenA = 1 , B = 1 , C = − 1 A=1,B=1,C=-1A=1,B=1,C=− 1 , substitute x1 and x2 of x1 (actually x and y of x1), which is as follows:
P4 1.4
Taking this group (X, Y) as an example, (i)~(iv) can be obtained, then it can be determined that w1>=1, w2<=-1, so w1^2 + w2^2 >=2 , so there is 1 2 w T w > = 1 \frac{1}{2}w^Tw>=121wTw>=1 , assign appropriate values to w1, w2 and b, then g svm= sign ( x1 - x2 - 1 )is obtained
So, how to deal with the general case? Solve this problem:
it has two characteristics:
quadratic programming (quadratic programming/convex optimization/is a QP problem) already has a known solution, and then only substitution is enough: finally,
for non-linear The problem, just use the z space before
The difference between P5 1.5
SVM and the previous regularization (z space or something) is called contact:
It can be seen that the goals of the two are almost opposite, so SVM is also a kind of regularization, but let Ein=0.
When margin is set to 0 ( A 0 A_0A0), same as PLA. When the width is A 1.126 , if it does not meet the rules, do not choose it, it will be higher than A 0 A_0A0There are fewer types, so there are fewer situations -> (false) VC dimension is less -> better generalization.
For this sphere ρ = 0 ρ=0r=0 can shatter 3 points, so dvc= 3, ifρ = 3 2 ρ=\frac{\sqrt{3}}{2}r=23If , the radius of this circle is 3 \sqrt{3}3, because there are three points, at most one pair is on the opposite side, and there is another point that cannot be shattered, so d vc < 3 at this time. So there is:
the next lesson will introduce non-linear SVM that combines large-margin hyperplanes and feature transformation: