Lin Xuantian's Notes on Machine Learning Techniques (4)

I suddenly realized that I remembered a little bit more details? It feels like progress has slowed down.
Lin Xuantian's Machine Learning Techniques Notes (1)
Lin Xuantian's Machine Learning Techniques' Notes (2)
Lin Xuantian's Machine Learning Techniques' Notes (3)

P14 4.1
How to reduce the overfitting of Gaussian SVM

  1. A bit φ (conversion) is too powerful
  2. You may have noticed that I have always insisted on separating the data perfectly, which is not necessarily the best classification method, and will be affected by the inevitable noise

Recall how to avoid noise in the past:
pocket: just find a line with the smallest error.
Therefore, we can tolerate some errors like a pocket, and we can combine them, and use C to weigh errors and correctness:
insert image description here
thus we get: soft-margin SVM
[ yn ̸ = sign ( w T zn + b ) ] = 0 [y_n \not= sign(w^Tz_n+b)]=0[yn̸=sign(wTzn+b)]=When it is 0 , it means that the classification is correct. At this time, it is required that yn ( w T zn + b ) > = 1 y_n(w^Tz_n+b)>=1yn(wTzn+b)>=1 [ y n ̸ = s i g n ( w T z n + b ) ] = 1 [y_n \not= sign(w^Tz_n+b)]=1 [yn̸=sign(wTzn+b)]=When 1 , an error occurs at this time, and then the error will be recorded. C indicates the weight of the error in the whole, and at this time it is required thatyn ( w T zn + b ) > = − ∞ y_n(w^Tz_n+b) >=-\inftyyn(wTzn+b)>= .
But in this way, the blue area is not linear, and then the dual and kernel are not valid. And if the number of errors is recorded, small errors are treated as the same as large errors, and the effect will not be good. Therefore, it is necessary to determine alinearconstantζ n ζ_ngn, it becomes a record of how many mistakes were made, not the number of mistakes, so it is a QP problem again.
insert image description here
ζ n ζ_ngnIndicates the degree of error at each point, which is the violation distance in the red box. A large C means low tolerance, and a small one is the opposite. Then because of the addition of ζ n ζ_ngn, the number of variables is d ~ +1+N, because adding the rule of ζ n > = 0 and the rule of ζ_n>=0gn>=0 , so there are 2N constraints (constraint):
insert image description here


P15 4.2
Next, derive the dual problem, and then use the kernel to solve it.
Similarly, set α and β:
insert image description here
take the gradient of L to ζ as 0, use α to represent β, and get a new formula after substitution: the
insert image description here
obtained new formula is actually the same as the previous hard-margin, except that the condition A little change, here is 0 <= α <= C:
insert image description here
insert image description here
this QP problem, is N variables because there are α n α_nan, there are 2N+1 constraints: because each α n α_nanThere is a lower limit 0 and an upper limit C, so there are 2N, plus 1 ∑ \sum , there are 2N+1 constraints:
insert image description here


P16 4.3
Now we can use QP to calculate α, but the value of b is unknown, because it was calculated using the fourth condition of KKT before, but now because it is soft instead of hard, there is no such condition: compared with
insert image description here
hard , if you want to solve it, you have to solve ζ, and if you want to solve ζ, you have to know b, which is a bit like "If you can't beat the boss, you need to unlock the skills, but to unlock the skills, you need to fight the boss": so define the free support vector (point α
insert image description here
s range 0 < as < C point α_s range is 0<a_s<C_sRange is 0<as<C ), according to the second formula, letζ s = 0 ζ_s=0gs=0 , so there is a b, of course, this b is not fixed, as long as the conditions are met. And generally, there will be at least one set of SV such thatα s < C α_s <Cas<C (red stone).
insert image description here
Although there is now soft protection, the value of C cannot be randomly selected, otherwise it may overfit:
insert image description here
Now, the points on the graph can be divided into three types:
α n α_nanThe value is 0~C
①: Non-SV ( an = 0 a_n=0an=0 ): Here, according to the second formula, ζ = 0, indicating that there is no mistake at this point, butan = 0 a_n = 0an=0 means that these points are not SV, so most of these points are outside the fat boundary (very few are on the boundary)
②: free SV (0 < α n < C 0<α_n<C0<an<C ): Here, according to the second formula, ζ = 0, indicating that there is no error at this point, but these points are SV, so they are just on the boundary. Since ζ = 0, b can be calculated according to the first formula.
③: bounded SV(an = C a_n=Can=C ): These points are within the boundary (very few boundaries), where ζ represents the degree of violation of these points, that is, the distance to the boundary.
insert image description here


P17 4.4
You can use cross-validation (CV) to verify whether the selected parameters are good or not, and select the set of parameters with the smallest cv:
insert image description here
in the leave one out Cross Validation of SVM ( I don’t remember what it is, click here , roughly just take the data set is N-1, the test set is k=1), and then here is a proof that Eloocv <= SV (number)/N: because non-SV's err=0, that is, points outside the boundary do not contribute to SVM's err , and e SV <= 1, so Eloocv <= SV (number)/N
insert image description here
and then SV / N can be used to simply filter out the high ratio, because too many SVs may mean that the model is not very good. After simple exclusion, CV can be further done on other models.
insert image description here
Summarize:
insert image description here

Guess you like

Origin blog.csdn.net/Only_Wolfy/article/details/89537802