Text Classification Learning (VII) Prelude Structure Risk Minimization and VC Dimension Theory of Support Vector Machine SVM

Foreword:

After the feature extraction of text and the test using the LibSvm toolkit, the effect of the Svm algorithm is still very good. So I started to understand the principles of SVM one by one.

SVM is based on the theory of structural risk minimization and VC dimension. So this article only introduces the theoretical basis of SVM.

 

 

content: 

 

1. Generalization error bound

The ability of machine learning and its performance are measured by the generalization error bound. The so-called generalization error refers to the prediction error of machine learning on the test set other than the training set. Traditional machine learning seeks to minimize the prediction error on the training set (empirical risk, which will be discussed in detail below), and then put it into practice to predict the text of the test set, but it fails miserably. This is the poor generalization performance, and the generalization error bound refers to a limit value, which will be explained later.

Machine learning is actually predicting a model to approximate the real model. There must be an error (risk) between the real model and the real model, and this risk can of course be calculated.

Suppose we have l observations, each of which consists of two elements into a vector belonging to R n (n-dimensional space): X i   ∈ R n  (i = 1 , 2 , 3 , 4...l)   already The mapping value  Y i  corresponding to this vector is the feature vector of the -th text in the two-category text . As mentioned earlier, then Y i = {+1,-1}  consists of two categories.

Then we assume that there is an unknown distribution  P(X,Y)   is the mapping of X to Y, given X has a fixed Y i  corresponding to it, and at the same time we use machine learning to get a function  f (x , α)  indicates that input x will result in a mapping y of x, where  α  is a certain parameter of machine learning, such as a neural network with fixed results, with   α  representing weight and bias . Then the risk (error) formula for this machine learning algorithm:

 

                         

(α)  term is called expected risk ( expected risk ), we can also call it true risk, because it is a measure of the error between the machine learning and the actual test data set.

And the error between machine learning and training data set also has a name called empirical risk ( empirical risk ) which is represented by  R emp (α)   :

 

                                     

 

Here, given the parameters α and the training set {x i ,y i }, the  R emp (α)  obtained is a certain number, taking a η ( 0<= η<=1) to use 1-η The probability of machine learning error , (α)  has an upper bound that can be expressed as (Vapnik, 1995):

 

                     

               

Φ(h/l)  is called confidence risk ( VC confidence ), h is called VC dimension ( Vapnik Chervonenkis (VC) dimension ) l is the number of samples The formula for Φ(h/l) is as follows:

                   

 

     

It can be seen from the above formula that in order to minimize the true risk, statistical learning must minimize the upper bound of  (α)  , that is, minimize Φ(h/l) +   R emp (α)  , that is, the structural risk is minimized , . It can be said that  Φ(h/l) +   R emp (α)  is called structural risk, which is an actual boundary value, the upper limit of (α)   , which is the generalization error limit in statistics . So how to minimize structural risk? The VC dimension theory is mentioned below. (Note: The above formula does not need to think about the principle, it is the formula in the paper)

 

2. VC dimension

As mentioned above, h in the confidence risk Φ(h/l)  is called the VC dimension, so what is the VC dimension? for example:

Suppose we have l points, and we have a mark Y = {+1,-1} for each point, and mark these l points as +1 or -1 respectively, then there are 2 l  methods. For the functions in the function set { f(x,α) } (f(x,α) is the classification function obtained by the aforementioned machine learning), each method in 2l can find a function from the function set to go to successful marking, then the VC dimension of this function set { f(x,α) } is said to be l. That is, the maximum number of data points that the function set can label (called scatter in jargon ). To be more practical:

 

This is on the two-dimensional coordinate plane, 3 points, compare them into 2 categories, there will be 2 to the 3rd power, that is, 8 methods, we can see that we can find a straight line for each marking method. The 3 points are scattered (that is, separated by category). But when there are 4 points, you find that there are several cases where you can't find a line to separate them by category on both sides of the line. This also shows that in two-dimensional space, the VC dimension of a set of straight lines is 3.

 

The VC dimension of the Oriented Hyperplanes in  the n-dimensional space Rn is n+1. The method of proof is based on a theorem:

There is a set of m points in the n-dimensional space R , choose any point as the origin, if the remaining m-1 points are linearly separable, then the m points can be dispersed by a hyperplane set

To generalize, in the n-dimensional space R we can always select n+1 points, select one of the points as the origin, and the remaining n points are linearly separable in the n-dimensional space, so n+1 points Points can be dispersed by a set of hyperplanes in n-dimensional space.

 

The higher the VC dimension, the higher the complexity of the function set. The more complex the general function set, the higher the VC dimension, and it is possible to accurately classify each text in the training set, that is, to minimize the empirical risk, but as mentioned earlier, it is a mess in the face of samples outside the training set. . It is because the VC dimension is too high, why? Because confidence risk is ignored:

                

                            

Taking η = 0.05) l = 10,000 gives

 

It can be seen that the higher the VC dimension h, the higher the confidence risk. For any l, the confidence risk is an increasing function of h.

In general, the higher the VC dimension, the higher the confidence risk of machine learning and the worse the generalization ability. However, the opposite is also true:

Counterexample:

If the VC dimension reaches infinity, such as the K-neighbor algorithm, k=1 has only one class, then all samples will be correctly classified, his VC dimension is infinite, and the empirical risk is 0. His generalization ability is infinitely strong.

If the VC dimension reaches infinity, then the generalization error bound mentioned above is of no practical use, so the impact of the VC dimension being high to infinity is not necessarily bad.

 

3. SRM Structure Risk Minimization

In Section 1, it has been mentioned that   minimizing Φ(h/l) +   R emp (α)  is the minimization of structural risk, that is, SRM (  Structural Risk Minimization  )

The specific idea is:

By dividing the set of functions into subsets. For each subset arranged according to the VC dimension, find the minimum empirical risk in each subset, and then compromise between the subsets to consider the minimum sum of empirical risk and confidence risk, and obtain the minimum generalization error bound.

SVM is a better realization algorithm to minimize structural risk. Where exactly is it reflected, I will mention it later when I encounter it.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326393434&siteId=291194637