100 days to get the machine learning | Day22 machine Why can learn?

Antecedent Review

Machine learning 100 days | Day1 data preprocessing
for 100 days machine learning | Day2 simple linear regression analysis
for 100 days machine learning | Day3 multiple linear regression
for 100 days machine learning | Day4-6 logistic regression
for 100 days machine learning | Day7 K- NN
100 days to get the machine learning | Principia Mathematica Day8 logistic regression
for 100 days machine learning | Day9-12 support vector machine
for 100 days of machine learning | Day11 achieve KNN
100 days to get the machine learning | achieve Day13-14 SVM of
100 days to get the machine learning | Day15 naive Bayes
for 100 days of machine learning | Day16 achieved by SVM core skills
for 100 days of machine learning | Day17-18 magical logistic regression
for 100 days of machine learning | Day19-20 Caltech open class: machine learning and data mining

Day17, Avik-Jain 22 days to complete the Caltech machine learning courses -CS156 Yaser Abu-Mostafa taught in Lesson 2.

1 Hoeffding inequality

Suppose you have a jar full of orange and green balls, in order to estimate the proportion jar of orange and green, we randomly grab a ball, called a sample:

image

Wherein the ratio of the orange ball jar provided for [mu], sample ball orange v ratio, the sample size is N, the real difference of our tolerance profile and sample distribution [mu] [epsilon] v is, there is the following inequality:

image.gif

That is, there is an upper bound on the probability, as long as we ensure that a large sample size N, will be able to make "a big difference μ and v," it is a very small probability.

2 for the case of a hypothetical function h

If we assume that the function h has been determined, then we can put our problems corresponds to jar model: Each ball represents an input x, orange indicates the true value of the function f h prediction is not the same, and green for the same, that is, :

image

Then all the balls in the jar is all possible inputs x, and caught a ball express our training set ( Note that this is in fact made a hypothesis:! Our training and test sets by the same unknown probability distribution P to produce, which is derived from the same pot ), then the orange ball proportion μ says our assumption function h forecast error rate Eout input in real space (we want to reduce the final), v says we in the training set of prediction error rate Ein (our algorithm can minimized) by Hoeffding inequality, you can get:

image

In other words, as long as we can guarantee the training set amount of N is large enough, we can guarantee the real error rate prediction error rate training set is a great probability of close.

For 3 h in the case of a plurality of limited

We show that the top one, for a given hypothesis function h, as long as the training set is large enough, we can guarantee it predict the effect on the training set and the real effect of a high probability forecast is close. However, we can only guarantee their predicted effect close, it is possible to predict the effect is bad?

Our machine learning algorithm is selected on the assumption that space inside a h, h makes this train set on the error rate is very small, so that h is not on the entire input space is very small error rate it? In this section we want to prove is that the hypothesis space for only a finite number h, as long as the training set is large enough N, which is the probability of a large establishment.

 First we look at this table:

image

First, for a given h, we can define a concept: "bad training set" (corresponding to the table in the red bad). The so-called bad training set, is the difference between real and Eout of Ein h in the training set above exceeds the tolerance ε we define. Hoeffding inequality ensures that, for a given h (row of the table), the election to the bad training set of probability is very low.

Then, for the hypothesis space inside M candidate of h, we redefine the concept of "bad training set" (corresponding to orange in the table bad), so long as it is for any h is bad, then it is a bad . So we have chosen to orange bad training set of probability can be derived as follows:

image

Since M is limited, as long as the training set N is large enough, we have chosen to bad training set of probability it is still small. In other words, our training set is very likely a good training set, all of the above are good h, the algorithm simply select a good performance on the training set h, then its predictive ability is good PAC . There is also inequality:

image

因此机器学习过程如下图:

image

(这里多出来的橙色部分表示,训练集和测试集是由同一个概率分布产生的)

因此当有限个h的情况下,机器学习算法确实是能学到东西的。

之后我们会讨论,当假设空间存在无限个h时,机器学习是否还有效。

上一节我们证明了,当假设空间的大小是M时,可以得到概率上界:

image.gif

即,只要训练数据量N足够大,那么训练集上的Ein与真实的预测错误率Eout是PAC(大概率)接近的。

但是,我们上面的理论只有在假设空间大小有限时才成立,如果假设空间无限大,右边的概率上界就会变成无限大。

事实上,右边的边界是一个比较弱的边界,这一节我们要找出一个更强的边界,来证明我们的机器学习算法对于假设空间无限大的情形仍然是可行的。我们将会用一个m来代替大M,并且证明当假设空间具有break point时,m具有N的多项式级别的上界。

2 成长函数

对于一组给定的训练集x1,x2,...,xN。定义函数H(x1,x2,......,xN),表示使用假设空间H里面的假设函数h,最多能把训练集划分成多少种圈圈叉叉组合(即产生多少种Dichotomy,最大是2^N)。

例如,假设空间是平面上的所有线,训练数据集是平面上的N个点,则有:

N = 1 时,有2种划分方式:

image

N = 2时,有4种划分方式:

image

N = 3 时, 有8种划分方式:

image

N = 4时,有14种划分方式(因为有两种是不可能用一条直线划分出来的):

image

…………

另外,划分数与训练集有关,(例如N=3时,如果三个点共线,那么有两种划分就不可能产生,因此只有6种划分而不是8种):

image

为了排除对于训练数据的依赖性,我们定义成长函数:

image

因此,成长函数的意义就是:使用假设空间H, 最多有多少种对训练集(大小为N)的划分方式。成长函数只与两个因素有关:训练集的大小N,假设空间的类型H。

下面列举了几种假设空间的成长函数:

image

3 break point

这里我们定义break point。所谓break point,就是指当训练集的大小为k时,成长函数满足:

image

假设空间所不能shatter的训练集容量

容易想到,如果k是break point,那么k + 1, k + 2....也是break point。

4 成长函数的上界

由于第一个break point会对后面的成长函数有所限制,于是我们定义上界函数B(N,k),表示在第一个break point是k的限制下,成长函数mH(N)的最大可能值:

image.gif

现在我们开始推导这个上界函数的上界:

首先,B(N,k)产生的Dichotomy可以分为两种类型,一种是前N-1个点成对的出现,一种是前N-1个点只出现一次:

image

因此显然有:

image

 然后,对于前N-1个点在这里产生的所有情况:

image

显然这里的个数就是α+β,显然,这前N-1个点产生的Dichotomy数仍然要受限于break point是k这个前提,因此:

image

然后,对于成对出现的Dichotomy的前N-1个点:

image

我们可以说,这里的前N-1个点将会受限于break point是k-1。反证法:如果这里有k-1个点是能够shatter的,那么配合上我们的第N个点,就能找出k个点能shatter,这与B(N,k)的定义就矛盾了。因此我们有:

image

 综合上面,我们有:

image

利用这个递推关系以及边界情形,我们可以用数学归纳法简单证明得到(事实上可以证明下面是等号关系):

image

因此成长函数具有多项式级别的上界。

5 VC-Bound

这里我们不涉及严格的数学证明,而是用一种通俗化的方法来引出VC-Bound。也就是如何用m来替换M。

image

image

image

于是我们就得到了机器学习问题的PAC概率上界,称为VC-Bound:

image

因此我们得到了更强的边界,当右边的成长函数具有break point时,它的上界是N^k-1级别的,只要我们的N足够大,“存在一个假设函数h使得坏情况发生”这件事的几率就会很小。

6 结论

Conclusion: When the function assumes that space has grown break point, as long as N is large enough, we can set PAC to ensure that training is a good training set, all the above Ein h and Eout are approximate, these algorithms can h make free choice. That is indeed a machine learning algorithm can work in.

In layman's terms, conditions can work in machine learning:

1 good hypothesis space. So that the function has grown break point.

2 good training data set. Such that N is sufficiently large.

3 good algorithm. It allows us to choose a good performance on the training set h.

4 good luck. Because there are still some small probability of a bad situation will occur.

END

image

This switched:
https://www.cnblogs.com/coldyan/

Guess you like

Origin www.cnblogs.com/jpld/p/11361030.html
Recommended