SVM learning - Statistical Learning Theory

       On statistical learning theory is profound, you want to figure out takes a lot of effort, involving all aspects of mathematical knowledge (such as functional analysis, advanced mathematics, probability theory, statistics .......), I have here is some basic concepts, theories neat look.

       Presence of an unknown system S, a given input sample space Xand these input samples by Sthe processed output Y. Machine learning process can be seen as such: the use of machine learning, based on Xand Yget a learning machine (also called models), machine learning outside the sample receiving training, test samples X'obtained after the output Y'can be considered unknown the system Sfor the input X'output resulting approximate, so the learning machine can be considered an Sapproximation of the inherent laws.

       In fact, the space can be generated from the input sample (vector) xseen from in an objective reality, but to determine the unknown probability distribution function F(x)in each other independently drawn out; obviously these xthrough Soutput generated Yobedience F(y|x), and we the learner should be a set of functions f(x,\beta), here's \betaa collection of parameters, such as: linear classifier set is f(x,w,b), by (w,b)different values of the parameters, we can get a set of functions; it's looking for this learning process becomes from this function can be set to identify the best approximation function of the input sample. Input xand output Yobey the joint probability distribution function F(x,y)=F(x)F(y|x), that is to say all the training data, test data (X, y)are from F(x,y)the mutually independently extracted samples.

        So how to measure whether the best approach it? Need to define a loss function: Loss(f(x,\beta),y)  (when the input is x, the measure of the output of the learner f(x,\beta)and the system Sresulting output Ydifferences between).

Remember the mathematical function of continuous random variables do not expect? Provided continuous random variable \zetaprobability density is \ Phi (x), if its function \ And = f (\ zeta), the random variable \ andmathematical expectation is defined as:

                                                                 E \ eta = EF (\ Zeta) = \ int _ {- \ infty} ^ {+ \ infty} {f (x) \ phi (x)} DX = \ int _ {- \ infty} ^ {+ \ infty} { f (x)} DF (x)

With the above concept you can get lost mathematical expectation:

                                                                 R (\ beta) =E(\beta)=\int{Loss(f(x,\beta),y)}dF(x,y)

Here R (\ beta)is the risk functional, it was also called the expected risk. Note here xand Yare known and F(x,y)unknown, but determined by different \betadetermination f(x,\beta)is unknown.

You can now be described as a learning process using empirical data (that is, our sample of (X, y)) risk functional minimization R (\ beta)process. Obviously, this F(x,y)we can not know, it would need an alternative: R(\beta)=\frac{1}{n}\sum\limits_{i=1}^{n}Loss(f(x_i,\beta),y_i), so the learning process becomes a process with the ERM function approximation minimize the risk functional function, this principle is the legendary experience risk minimization (ERM) principle of induction. For example reflects the ERM: the method of least squares (regression problems (F (x, \ beta)) ^ 2do Lossfunction) and the probability density maximum likelihood method (using estimates of -lnp(x;,\beta)doing Lossfunction), ha ha.

        The generalized learning problem, said the following: there is a probability distribution on the space of the Z- F(z), instead of using z (X, y), (Z_1, z_2 ... z_i)used to represent the independent and identically distributed samples, specific loss of function with L (z \ beta)said functional then the risk would be represented as:

                                          R (\ beta) = \ int {L (z \ beta)} dF (z), (Which \betais a set of parameters)

So experience is functional, it says:

                                          R_{emp}(\beta)=\frac{1}{n}\sum\limits_{i=1}^nL(z,\beta)

Our final learner is able to minimize the use R_{emp}of the function to approximate energy minimization Rof L (z \ beta).

1, the consistency of the learning process

        The above is a lot of representation for the following definitions and theorems prepare these definitions and theorems is the cornerstone of learning theory, I know where I'll Tell me, in addition to "steal" some of the classic drawing over.

1 Definition: learning consistency in experience minimal risk principle is: The following two sequences in accordance with the probability converge to the same limit, it says ERM principles set of functions L (z \ beta)and probability distributions F(z)are the same.

                                          R(\beta_i){----- />}\limits_{i->\infty}^{p}inf R(\beta), (Which \betais a set of parameters)

                                          R_{emp}(\beta_i){----- />}\limits_{i->\infty}^{p}inf R(\beta), (Which \betais a set of parameters)

I understand the meaning of this definition is: to satisfy this condition can ensure learning experience obtained under the principle of risk minimization in the number of training samples tends to infinity when you can expect the risk to a minimum so that it can best analog unknown system S, definition is equivalent to, the ERM provides a learning sequence L (z, \ beta_i), (i=1,2,...n)expected risk and risk experience in this sequence can converge to the minimum possible risk. This definition can be found there is a problem: If you remove a specific function from the set of functions found it inconsistent, and that is consistent set of functions is determined by a special function, which is obviously not what we want, in this case consistency is called the extraordinary consistency, and truly meaningful learning should be a non-trivial consistent set of functions. Posted a map,

2Q_P9()KFZ7]CF(4KRW5SAH

To remove an extraordinary consistency, in fact, altered the definition of the line.

Definition 2: For the function sets L (z, \ beta_i)an arbitrary set of non-empty \Delta:

                                          \ Delta = \ L int (z, \ beta) dF (z) /> c, (Which \betais the parameter c \in (-\infty ,+ \infty)set )

There R_{emp}(\beta_i){----- />}\limits_{i->\infty}^{p}inf R(\beta), (which \betais a \Deltaset of parameters) is established, said ERM principles set of functions L (z \ beta)and probability distributions F(z)is non-trivial consistent.

This definition is equivalent to minimize the risk of the function can be removed from the centralized type is still a function of convergence.

       The following theorem is the legendary key theorem of learning theory , is Daniel Vapnik and Chervonenkis proposed this theorem would say that the principle of consistency of ERM depends on the function centralized worst conditions function, that is based on the principle of ERM analysis are "worst-case analysis," in theory, based on the principles of ERM trying to find a learning theory can "analyze real situation" in accordance with the eight possible.

Theorem 1: set set of functions L (z \ beta)to meet: A \ leq\ L int (z, \ beta) dF (z)\ Leq B, then the necessary and sufficient conditions for the principle of consistency of ERM is: R_{emp}(\beta)is the function set L (z \ beta)on following in the sense of convergence in the real risk R (\ beta):

                                         lim_{i- />\infty}P(sup(R(\beta)-R_{emp}(\beta))>\epsilon)=0   , \forall \epsilon />0, \betaIs a set of parameters (this formula can actually see that it is unilateral convergence, also called unilateral consistency)

2, the convergence rate of the learning process

        Several basic concepts: a N(z_1,z_2,...z_l)represents a set indicative of the function L (z \ beta)(function indicating means which function value takes only two values 0 and 1) function can be given to the sample z_1, z_2 ... z_lnumber classified into different.

Definition 3 : Random entropy H(z_1,z_2,...z_l)=ln N(z_1,z_2,...z_l), which describes the function of a given set of diversity in the data set, which is apparently random numbers (because it is based on a data set of independent and identically distributed). 

The definition 4 : VC entropy, entropy random distribution functions in F(z_1,z_2,...z_l)mathematical expectation on: H(l)=ElnN(z_1,z_2,...z_l).

        Theorem 1 describes the consistency of ERM unilateral approach will naturally think of what time to meet the bilateral agreement, namely:  lim_{i- />\infty}P(sup|R(\beta)-r_{emp}(\beta)|>\epsilon)=0   , \forall \epsilon />0, \betafor the parameter set

Theorem 2 : necessary and sufficient conditions for the indicator function uniform convergence of the learning process is bilateral:

                                        {lim}\limits_{l- />\infty}\frac{H(\epsilon,l)}{l}=0

        因为双边一致性条件等价于:lim_{i- />\infty}P(sup(R(\beta)-R_{emp}(\beta))>\epsilon)=0lim_{l- />\infty}P(sup(R_{emp}(\beta)-R(\beta))>\epsilon)=0,所以定理2其实是单边一致性成立的充分条件。

         下面在N(z_1,z_2,...z_l)的基础上构造两个新概念,然后总结一下传说中的学习理论的三个milestone。

定义5:退火VC熵,H^'(l)=lnEN(z_1,z_2,...z_l)

定义6:生长函数,G(l)=ln({sup}\limits_{z_1,z_2,...z_l}\quad N{(z_1,z_2,...z_l)})

milestone1:ERM一致性的充分条件,所有的最小化经验风险的学习器都要满足它。                                     

                                       {lim}\limits_{l- />\infty}\frac{H(\epsilon,l)}{l}=0

milestone2:收敛速度快的充分条件,它保证了收敛有快的收敛的渐进速度。

                                       {lim}\limits_{l- />\infty}\frac{H^'(l)}{l}=0

milestone3:履行ERM原则的学习器在解决任意问题的时候有一个快的收敛的渐进速度的充要条件

                                       {lim}\limits_{l- />\infty}\frac{G(l)}{l}=0

          接下来就轮到传奇的VC维出场了,VC维与生长函数有着非常重要的联系,又是一个定理,

定理3:任何生长函数都满足:

                                                 G(l)=l\quad ln2

或者有一下上界:

                                       G(l)\leq h(ln{\frac{l}{h}}+1),其中h是一个整数,当l=h时满足G(h)=hln2G(h)<(h+1)ln2

直白点说就是:生长函数只能是线性的或者以一个对数函数为上界

定义4如果一个样本集含有h个样本,它能被某个指示函数集按照所有可能的2^h分类方式分为两类,则称该函数集能将样本数为h的样本集打散,对于任意指示函数集,它能打散的最大样本集的样本数量,就是它的VC维。由生长函数的定义可以知道,对于一个指示函数集,其生长函数是线性的则其VC维是无穷大的,而如果其生长函数以一个参数为h的指数函数为上界,那么它的VC维就是h。这就是生长函数和VC维的重要关系。

直观点,举个例子来理解VC维,在二维空间中,如果指示函数集L(z,\beta)是线性函数,那么对于空间中三个点A、B、C,将它们分成两类0或者1,可能的组合有2^3=8种,分别是:\left{ \begin{array}{c} A- />0\\ BC->1\\ \end{array}\right }\left{ \begin{array}{c} B- />0\\ AC->1\\ \end{array}\right }\left{ \begin{array}{c} C- />0\\ AB->1\\ \end{array}\right }\left{ \begin{array}{c} A- />1\\ BC->0\\ \end{array}\right }\left{ \begin{array}{c} B- />1\\ AC->0\\ \end{array}\right }\left{ \begin{array}{c} C- />1\\ AB->0\\ \end{array}\right }\left{ \begin{array}{c} ABC- />1\\ \end{array}\right }\left{ \begin{array}{c} ABC- />0\\ \end{array}\right }种,

image

再增加一个点,那么就应该有2^4=16种组合,我就不列出来了,发现没?没有一个线性函数能将B、C分为一类,而A、D分为另一类。(类似于一个单层感知器无法解决异或问题,嘿嘿)

image

所以在二维空间中线性函数的VC维为3,实际上,推广以后,在n维空间中,线性函数的VC维总是n+1。

        遗憾的是目前尚没有一个可以计算任意指示函数集VC维的统一的理论框架,但是这不影响VC维的重要性,它实际上衡量了一个指示函数集的学习性能。

3、控制学习过程的推广能力
       
控制学习过程的推广能力的理论的目的是构造一种利用小样本训练实例来最小化风险泛函的归纳原则,小样本指的是\frac{l}{h}比较小,例如:20\g\frac{l}{h}

        关于学习器推广能力的,有以下知识:

假设实函数集合L(z,\beta)满足A \leqL(z,\beta)\leq B(A可以是-\infty,B可以是+\infty),引入符号\epsilon\epsilon =4\frac{G^{A,B}(2l)-ln(\eta /4)}{l},于是对于下面三种情况分别说明:

情况1 L(z,\beta)是完全有界函数集(A \leqL(z,\beta)\leq B),下面的不等式以至少1-\eta的概率同时对L(z,\beta)的所有函数(包括使经验风险最小的函数)成立:

                                      R(\beta)\leq R_{emp}(\beta)+\frac{(B-A)}{2}\sqrt{\epsilon},(在L(z,\beta)取值只取0、1的二类问题上显然有:R(\beta)\leq R_{emp}(\beta)+\frac{1}{2}\sqrt{\epsilon})

情况2L(z,\beta)是完全有界非负函数集(0\leqL(z,\beta)\leq B),下面的不等式以至少1-\eta的概率同时对L(z,\beta)的所有函数(包括使经验风险最小的函数)成立:

                                       R(\beta)\leq R_{emp}(\beta)+\frac{B\epsilon}{2}(1+\sqrt{1+\frac{4R_{emp}(\beta)}{B\epsilon})

情况3L(z,\beta)是无界非负函数集(0\leqL(z,\beta)),这种情况下无法得出描述学习器推广能力的不等式,所以需要做如下假设:

                                       sup(\frac{\int L^p(z,\beta)dF(z)}{\int L(z,\beta)dF(z)})\leq \tau <\infty其中p \g 1

下面的不等式以至少1-\eta的概率同时对满足上述假设的所有函数(包括使经验风险最小的函数)成立:

                                       R(\beta) \leq \frac{R_{emp}(\beta)}{(1-\beta(p)\tau\sqrt\epsilon)_+},其中\beta(p)=\sqrt[p]{\frac{1}{2}(\frac{p-1}{p-2})^{p-1}}

对于上述情形,如果L(z,\beta)包含无穷多个函数且VC维h有限,则有:

                                       \epsilon=4\frac{h(ln\frac{2l}{h}+1)-ln(\frac{\eta}{4})}{l}

                          如果L(z,\beta)包含N个元素,则有:

                                       \epsilon=2\frac{lnN-ln\eta}{l} 

上面说了这么多,其实就是想得出在统计学习理论框架下的实际风险到底由什么组成,

                                      R(\beta)\leq R_{emp}(\beta)+\Phi(\frac{l}{h}),第一部分都知道了,第二部分叫做置信风险

这个式子很令人兴奋,它说明了神马问题呢?

用R画个大概吧:

h=seq(1,100,by=1)

l=seq(1,2000,by=1)

x=l/h

e=4*(log(2*x,base=exp(1))+1)/(x)+(4/l)*(log(0.25/4))

plot (x, e)

~@`X%R$LGWAIV~LL}UKP@)B

       When \frac{l}{h}relatively large when \epsilonit is relatively small, so \Phi(\frac{l}{h})relatively small, this time closer to the actual risk experience in risk and therefore less risk experience to ensure that the expected risk is small. On the contrary, the experience of a small risk can not guarantee the expected risk is small.

       If a fixed number of samples and then look at the relationship of confidence with the VC dimension of risk:

l=200

h=seq(1,100,by=1)

x=l/h

e=4*(log(2*x,base=exp(1))+1)/x-4*log(0.25,base=exp(1))/l

plot(h,e)image

            This shows that when a fixed number of samples when, VC dimension greater confidence then the greater the risk, thus minimizing the risk of experience does not guarantee that the expected risk is small, so at this time to promote the ability of poor learning outcomes, which also It explains why the higher complexity of the learner is more likely to lead to overfit.
       R(\beta)\leq R_{emp}(\beta)+\Phi(\frac{l}{h})This type of house R(\beta)is called structural risk (SRM) . Ever since, by the above analysis, we can know what objective statistical learning is right - the learning process into a meet SRM inductive principle to meet the ERM from the principle of induction, while SVM algorithm is to achieve this principle.

Reproduced in: https: //www.cnblogs.com/vivounicorn/archive/2010/12/18/1909709.html

Guess you like

Origin blog.csdn.net/weixin_34184561/article/details/93642173