Machine learning personal understanding of the first chapter --NFL

First blog, want to deepen their learning and memory. When you see the book first formula, I had wanted to see direct results prove enough, however goose. . . Notes on the author to write: just to use some very basic knowledge of mathematics, only prepared to read the first chapter and the readers' fear of math "can be skipped. . . Trained, trained, trained, not convinced, trying to figure out some of the.

You know almost see this article, it is regarded as my first article, and gratitude. https://zhuanlan.zhihu.com/p/48493722

The following join their understanding for this article.

There is no free lunch theorem is to say whether the two algorithms  [official] and more intelligent,  [official] and more clumsy, their expected performance turned out to be the same.

(1) the desired properties

Expectations also called mean: can be expressed as there is an inherent power tends to be something. Simple explanation:

In probability theory and statistics, the expected value of a discrete random variable (or mathematical expectation, or mean, also referred to as the expectations of physics called the expected value) is the sum test each possible outcome multiplied by the probability of the result . In other words, expectations are like a random test at the same opportunity to repeat many times, all those results might mean the state, then basically the same "expectations" desired number. It should be noted that the expectations are not necessarily equivalent to the common sense of the "expectations" - "expectations" may not be equal to each result. (In other words, the expected value is an average value of the output variable. Expectations are not necessarily included in the set of values ​​in the output variables.)

For example, throwing a six-sided fair dice, each of its expected value "points" is 3.5, is calculated as follows:

However, as explained above, although 3.5 "points" to a desired value, but the results may not belong to any one, not possible to throw this point.

(2) The algorithm assumes that

NFL theorem premise is: the same opportunity to all problems, or that all problems are equally important.

No Free Lunch Theorem (No Free Lunch Theorem), this theorem, if the learning algorithm  [official] , on some issues than learning algorithm  [official] is better, so there must be other problems, these problems  [official] than  [official] perform better.

(The following analogy: La and Lb are like two parents, La parents educated children learn well, but Lb parents educated children themselves are strong)

Generalization is this: Now let two parents to educate their children in their own original way to train other children

Good performance here that is said earlier generalization ability. Then came the following formula

[official]

Long formulas awesome, but we turn to interpret it.

The X-  : sample space, the sample space is what it is before you sample text attributes of the space spanned, the book is a description?

Or watermelon in his book to it, for example:

Attributes and attribute values ​​for each watermelon is

= Dark green color || || plain x = 0 || 1 || 2

= || pedicle slightly curled stiffened curled || y = 0 || 1 || 2

She voiced sound loud knock || = || boring crisp z = 0 || 1 || 2

You color, pedicle think knocking sound into x, y, z-axis. The value range is 0,1,2. How, is not like a three-dimensional cube, of course, there may be a variety of attributes, then it went up to the multi-dimensional space, not imagined.

H: hypothesis space, what is the hypothesis space?

What it is assumed that, said earlier also called learn model, where we engage in those concepts. See this article bloggers , should be able to read and understand the hypothesis space space version)

La hypothesis space refers to the parents, the child may develop into police, programmers, astronaut, waiter, white Fu-mei, silly white sweet. . . The composition of the hypothesis space

But then according to the traditional definition, only astronauts and programmers can become pillars of the motherland, this version is composed of a space defined under the pillars of the motherland

[official] :学习算法,学习算法有其偏好性,对于相同的训练数据,不同的学习算法可以产生不同的假设,学得不同的模型,因此才会有那个学习算法对于具体问题更好的问题,这里这个没有免费的午餐定理要证明的就是:若对于某些问题算法 [official] 学得的模型更好,那么必然存在另一些问题,这里算法 [official] 学得的模型更好。这里的好坏在下文中使用算法对于所有样本的总误差表示(就是相同的孩子给不同的父母培养,可能产生不同的结果)

P(h|X, [official] ): 算法La基于训练数据X产生假设h的概率(就是La父母训练完自己的孩子成为警察之后,开始训练别的孩子(X)了,P就是别的孩子在La父母的训练下成为警察的概率)

这里我说一下自己的理解,既然是 [official] 基于X产生假设h的概率,那么就说明假设不止一个(你说这不是废话吗?上面都说有假设空间了,假设当然不止一个),这里要注意的是这里的假设是一个映射,是y=h(x),(这个孩子变成警察算不算祖国的栋梁)是基于数据X产生的对于学习目标(判断好瓜)的预测。因数据X不一样,所以可能产生不一样的假设h,既然假设假设有可能不一样,那么对每一种假设都有其对应的概率即P(h|X, [official])。而且所有假设h加起来的概率为1,这个不难理解,概率总和为1(不成为警察最后成为各种职业的人,加起来概率为1,因为每个孩子总要当一个啥)

f: 代表希望学得的真实目标函数,要注意这个函数也不是唯一的,而是存在一个函数空间,在这个空间中按某个概率分布,下文证明中采用的是均匀分布。(f就是指最后成为了是祖国的栋梁)

[official](训练集外误差:La父母开始上路变成职业培训父母了,Eote是他们没有成功把社会上的孩子变成警察的这一事件的期望)

首先看这个E,这个E是期望,expectation的意思,这个下标ote,是off-training error,即训练集外误差Eote( [official] |X,)算法 [official] 学得的假设在训练集外的所有样本上的误差的期望

P(x): 对于这个,我的理解是样本空间中的每个样本的取得概率不同,什么意思呢?拿西瓜来说,(色泽=浅白,根蒂=硬挺,敲声=清脆)的西瓜可能比(色泽=浅白,根蒂=稍蜷,敲声=沉闷)的西瓜更多,取到的概率更大。所以有P(x)这个概率。

I(h(x)≠f(x))看前面的符号表把这个叫做指示函数,这个很好理解,就像if语句括号里的表达式一样,为真就=1,为假就=0。(假就是指成功变为警察,不需要考虑了,只考虑没成功的)(与孩子本身P(x)和La的训练能力P(h|X,La)有关)

P(h|X, [official] ): 前面说过了,再复习一下,算法 [official] 基于训练集X产生假设h的概率。

好了,公式的每一部分都说清楚了,来整体理解一下,这个公式就是说:

第一个求和符号:

[official] : 这里的这个对假设的求和其实我也不是很理解,我的理解主要是算法对于同一个训练集会产生不同的假设,每个假设有不同的概率。(除了警察也可以是其他的职业,各种)

第二个求和符号:

[official] :对于样本空间中每一个训练集外的数据都进行右边的运算。(训练集外的孩子)

 

好了,公式的每一部分都说清楚了,来整体理解一下,这个公式就是说:

Algorithm for  [official] each of the different hypotheses h generated outside the training samples tested, then the test is unsuccessful (because the error is required) to 1 indicating function, and multiplying the probability of two, the final result of all together, is the algorithm generates an error outside the training set.

Then consider the following binary classification problem, first explanation, we want real objective function obtained  [official] may be more than one, this is easy to understand, because the function version of the space to meet the assumptions may be true objective function, then these different  [official] identical with the probability (uniform distribution), function space {0,1} {0,1}, then how many such functions have? we look for the predictive value of a single sample for the sample space of a χ sample x, if f1 (x) = 0, f2 (x) = 1, then this is two different real objective function, so for a sample can be distinguished two real objective function, a total of | χ | samples , so a total of  [official] one real objective function, the real target of these functions are equally likely distribution (uniform distribution), so if h (x) = 0 then it is a hypothesis for h (x)  [official] may be equal to the real objective function.

Therefore, the following equation is derived look

(There are 2 | the X-| 1/2) probabilities sum to 1, is that simple

After such a pass derivation, it found expression on the desired results no specific algorithm, so that the algorithm is independent of!

 

He wrote to his look, do not like do not spray.

Guess you like

Origin www.cnblogs.com/caigouba/p/11311693.html