Machine Learning cornerstone Lin Xuan Tian notes 4-Feasibility of Learning

The lesson introduces the machine learning can be divided into different types. Among them, supervised learning in binary classification and regression analysis is the most common and most important machine learning problems. This lesson, we will introduce the feasibility of machine learning, to discuss whether the problem can be solved using machine learning.

一、Learning is Impossible

First, consider an example, as shown below, there are three squares of label and label 3 -1 +1 is squared. The six samples, the extraction features corresponding label, the right belongs squared prediction -1 or +1? As a result, if the basis of symmetry, we will return it to +1; if based on whether the upper left corner squares are black, we will return it to -1. In addition, there are further classified according to different characteristics, the case of different results obtained. Moreover, these seemingly classification results are correct and reasonable, because for six training samples, the model we choose to have a very good classification results.

Let's look at a binary comparison example of mathematical, wherein x is a binary input, three-dimensional, there are 8 corresponding to an input, wherein D has five training samples. Then, the output of the training samples corresponding y, assuming there are eight hypothesis, hypothesis on eight D, effects as Category five training samples are completely correct. But on the other three test data, different hypothesis performance has been mixed. In the known data D, G F; on unknown data, but other than D, G F does not necessarily hold. The machine learning objectives, it is the model we have chosen hope can predict the real results on unknown data is consistent, rather than seeking the best results on known data set D.

This example tells us, we want the data outside of the D closer to the target function seems to be impossible, can only guarantee good classification results for D. This feature is called machine learning is no free lunch (No Free Lunch) theorem. NFL theorem shows that a learning algorithm can not always produce the most accurate learning in any field. Usually referred to a learning algorithm is more "superior" than the other algorithms, the better, but for a particular issue, a particular a priori information, data distribution, the number of training samples, consideration or reward function and so on. From this example, NFL describes a machine learning algorithm can not guarantee that will be able to classify the data set on the outside of D or prediction is correct, except with some assumptions, we will be present.

二、Probability to the Rescue

Derived from a conclusion: on the sample other than the training set D, machine learning model is very difficult, it seems impossible to predict or correct classification. Are there any tools that can make some inferences or method of the unknown target function f, let our machine learning models to become useful to it?

If there is one with a lot (a lot to count the number of) the orange ball and the green ball jar, we can not infer the proportion of orange balls u? Statistical approach is taken randomly from the N balls jars, as a sample, calculating the ratio of the N v ball orange ball, then estimate the proportion of the orange ball jar approximately v.

三、Connection to Learning

下面,我们将罐子的内容对应到机器学习的概念上来。机器学习中hypothesis与目标函数相等的可能性,罐子里的一颗颗弹珠类比于机器学习数据x;选择一个h,对于一个x,当f(x)==g(x)时,把该弹珠涂成绿色,否则为橙色;从罐子中抽取的N个球类比于机器学习的训练样本D,且这两种抽样的样本与总体样本之间都是独立同分布的。所以呢,如果样本N够大,且是独立同分布的,那么,从样本中h(x)=f(x)的概率就能推导在抽样样本外的所有样本中h(x)=f(x)的概率是多少。

映射中最关键的点是讲抽样中橙球的概率理解为样本数据集D上h(x)错误的概率,以此推算出在所有数据上h(x)错误的概率,这也是机器学习能够工作的本质,即我们为啥在采样数据上得到了一个假设,就可以推到全局呢?因为两者的错误率是PAC的,只要我们保证前者小,后者也就小了。

这里我们引入两个值Ein(h)和Eout(h)。Ein(h)表示在抽样样本中,h(x)与f(x)不相等的概率;Eout(h)表示实际所有样本中,h(x)与f(x)不相等的概率是多少。

四、Connection to Real Learning

如果有一个h在你的资料上与f完全一致,那么h一定是最好的?

答案是不一定,就想抛硬币时,连续5次正面的硬币一定比其他硬币要好吗?

也就是说,不同的数据集Dn,对于不同的hypothesis,有可能成为Bad Data。只要Dn在某个hypothesis上是Bad Data,那么Dn就是Bad Data。只有当Dn在所有的hypothesis上都是好的数据,才说明Dn不是Bad Data,可以自由选择演算法A进行建模。那么,根据Hoeffding’s inequality,Bad Data的上界可以表示为连级(union bound)的形式:

其中,M是hypothesis的个数,N是样本D的数量,ϵ是参数。该union bound表明,当M有限,且N足够大的时候,Bad Data出现的概率就更低了,即能保证D对于所有的h都有EinEout,满足PAC,演算法A的选择不受限制。那么满足这种union bound的情况,我们就可以和之前一样,选取一个合理的演算法(PLA/pocket),选择使Ein最小的hm作为矩g,一般能够保证gf,即有不错的泛化能力。

所以,如果hypothesis的个数M是有限的,N足够大,那么通过演算法A任意选择一个矩g,都有EinEout成立;同时,如果找到一个矩g,使Ein0,PAC就能保证Eout0。至此,就证明了机器学习是可行的。

但是,如上面的学习流程图右下角所示,如果M是无数个,例如之前介绍的PLA直线有无数条,是否这些推论就不成立了呢?是否机器就不能进行学习呢?这些内容和问题,我们下节课再介绍。

五、总结

本节课主要介绍了机器学习的可行性。首先引入NFL定理,说明机器学习无法找到一个矩g能够完全和目标函数f一样。接着介绍了可以采用一些统计上的假设,例如Hoeffding不等式,建立EinEout的联系,证明对于某个h,当N足够大的时候,EinEout是PAC的。最后,对于h个数很多的情况,只要有h个数M是有限的,且N足够大,就能保证EinEout,证明机器学习是可行的。

Guess you like

Origin www.cnblogs.com/cchenyang/p/11459124.html