Suddenly there are other things, I feel that the progress is slow O~o
Kernel Logistic Regression
P18 5.1
combines Logistic Regression and Kernel.
Comparing hard and soft:
We can normalize ζ again. When a point has an error, 0<ζ < 1, and when there is no error, ζ = 0, then we can use a max function to summarize: sort it out, and find that it is the same as
before The regularization is very similar ( I don’t remember what regularization is here ), but why can’t it be replaced directly, because it is found that it has no conditions, so it is not a QP problem, so the dual and kernel can’t be used, and the max function is also difficult to get. :
Summarize the differences and connections between svm and regularization. It can be seen that margin is actually a kind of regularization
and the C of the two SVMs is also related to the λ of regularization.
Therefore we will consider linking SVM with other previous models.
P19 5.2
refers to err 0/1 , and can also draw the graph of err SVM . It can be seen that he completely covers err 0/1 , which is the upper limit of err 0/1 . According to the previous logistic regression, if an upper limit function is found, You can use this upper limit function directly to improve the original function indirectly. In addition, err SVM is still a convex upper limit, called: convex upper bound, err SVM is also called hinge error measure.
Compared with the previous err SCE , we found that the two are very similar, so we can replace
the comparison of the three. We found that we solved a logistic regularization problem , in fact, almost got the solution of SVM.
So, after solving the SVM, did you also get a LogReg solution?
How does P20 5.3
integrate LogReg and SVM? The following two methods are too biased towards one type and lose the advantage of the other side.
We can set the coefficient A and the constant B (used to translate the boundary) to fuse the two. Generally speaking, A>0, because W SVM usually does a good job, and there is no need to run counter to it, b ≈ 0 because b SVM is generally Not bad:
tidy up, so that we get a new LogReg. We replace the φ of the previous LogReg with φSVM at this time . At this time, only the variables A and B need to be adjusted. There are two steps in total:
this is the model proposed by Platt , the general steps are as follows:
The kernel SVM solves the approximate optimal solution of LogReg in the z space, and the next lecture is to find the exact best solution of LogReg in the z space!
P21 5.4
This section is a bit difficult to understand, because it combines a lot of things from before, some of which I forgot or didn't understand clearly, so I have to look back at the notes of the red stone cornerstone, but after reading it a few times, I feel familiar with it again, and try to summarize it , and then eat it with Da Niu’s notes hhh
Let's take a look at how SVM did it before. Because SVM is a quadratic programming with duality, QP is used after the duality, and then QP finds that the kernel can be used to reduce the complexity, changing from O(d ~ ) to O (d). But LogReg is not a quadratic programming (quadratic).
Think about how to use the kernel before, if w can become a linear combination of z, maybe wTz will have zTz (the inner product of z), and then you can use the kernel to calculate: refer to the following different algorithms, the expression form of w,
all It is linear z:
There is a representer theorem in mathematics:
the optimal w ∗ w_*w∗It is a linear combination of z, so it is consistent.
Simple proof: Let w ∗ = w_* =w∗= w//+ w⊥(// and ⊥ are for the zn plane), and w⊥= 0.
If the optimalw ∗ w_*w∗If w ⊥ ! = 0: 1 N ∑ n = 1 N err ( yn , w T zn ) \frac{1}{N} \sum^N_{n=1}err(y_n,
w ^Tz_n)N1∑n=1Ne r r ( andn,wTzn) , because multiplying zn in parallel is still zn, and multiplying zn vertically is 0, so no matter what err is, the terms behind the formula are the same
wTw in front of the formula, w//* w⊥=0. If w⊥! = 0, the formula will always be greater than w//* w//, which contradicts the previous saying that W* is optimal. (Contradictory method)
In this way, our w can be expressed linearly, which means that LogReg can also use the kernel, and the err function is a sigmoid function. After some simplifications, Kernel Logistic Regression (the so-called KLR in this chapter) is obtained: a brief summary: KLR is expressed by
using Theorem (representer theorem) makes it possible to use the kernel to replace zTz, and then there is only one β variable, and this β is unlimited, and then use various GD/SGD to solve the optimal β.
Then KLR's conversion of w can be viewed from another perspective:
(If you want to know, watch the video...)
Ask how many dimensions this KLR linear model has.
Because it has been converted to study only β n β_nbnvariable, so it is N-dimensional.
Summary: (Chapter 5 is finally over!!!! The last section was so hard to read)
The next section will talk about using the kernel to do general regression~