Dry goods|PRML reading postscript (1): fitting learning

1

Beautiful Gaussian distribution

Dry goods|PRML reading postscript (1): fitting learning
Dry goods|PRML reading postscript (1): fitting learning
Dry goods|PRML reading postscript (1): fitting learning

[P29] Figure 1.16 is a good depiction of the beauty of this expression:
Dry goods|PRML reading postscript (1): fitting learning

2

Ill-conditioned fitting of maximum likelihood estimation

Dry goods|PRML reading postscript (1): fitting learning

3

Parameters-Regularizer

Dry goods|PRML reading postscript (1): fitting learning
Dry goods|PRML reading postscript (1): fitting learning

4

Prior distribution: Gaussian distribution

The Gaussian distribution should be regarded as the most basic and hard-line prior knowledge that describes all continuous numerical uncertainties in our knowledge.

Regardless of what kind of monster you are, as long as you are continuous, not discrete, first give you a Gaussian distribution.

Of course, from a mathematical point of view, the King James Gaussian distribution is due to its beautiful mathematical conjugate form.

The exercise of [P98] proved that the Gaussian likelihood distribution x Gaussian prior distribution, the result is still a Gaussian distribution.

(This proof needs to be familiar with the 150 formulas of the Gaussian distribution in Chapter 2, and requires a good foundation of probability theory and line generation.)

The Gaussian distribution has many conveniences in mathematical form, such as the zero-mean simplified version of the Gaussian distribution mentioned below, which attracts a lot of Bayesian methods

Bad reviews, [P23] is explained in this way: One of the reasons why the Bayesian method is widely criticized is because it selects the prior probability distribution based on

The convenience of the mathematical form is based on the reliability of the prior distribution.

The Bayesian method pays attention to rigorous derivation and complete formulas. For those weird and incapable of expressing the principle in mathematical language, the extensive prior knowledge of nature,

Such as Deep Learning thought, naturally will not consider, this is why some people think that Deep Learning and Bayesian are opposite. [Quroa]

5

Volatility penalty: simplified Gaussian distribution

Dry goods|PRML reading postscript (1): fitting learning
Dry goods|PRML reading postscript (1): fitting learning
Dry goods|PRML reading postscript (1): fitting learning
Dry goods|PRML reading postscript (1): fitting learning

6

Sparsity penalty: L1 Regularizer

Dry goods|PRML reading postscript (1): fitting learning
Dry goods|PRML reading postscript (1): fitting learning

I. There are more than 100 billion neurons in the brain, but only 1% to 4% are activated at the same time, and the activated area is different each time.

This is the sparsity in biological nerves.

II. Sparsity The original information is wrapped around dense data to sparse, and sparse feature expression is obtained. For example, sparse the real number 5 into a [1,0,1] vector,

It is easy to be linearly separable. Another example is to identify a bird. As long as the noise is thinned out and the key parts are retained, there will be a better feature expression in the end.

This is the sparsity of feature expression, practical applications include [sparse coding] [deep neural network], and of course our biological neural network.

Of course, the above has nothing to do with L1 Regularizer, because its sparse posture is wrong, otherwise Deep Learning needs to be done.

First of all, this sparsity strategy is not Adaptive. It does not intelligently find out where sparsity is needed and where sparsity is not needed.

From the perspective of mathematical programming, it is a multivariate constraint. As for which element is unlucky enough to be constrained to 0, no one can determine this.

Secondly, the parameter W directly affects the model's fitting ability. If it is sparsely 0 incorrectly, it will cause serious underfitting.

Based on the above two points, one cannot think that L1 is similar to L2, and that L1 can also alleviate over-fitting. In fact, it is more likely to cause under-fitting.

7

Graphical understanding of L1&L2 Regularizer

The interesting pictures from [P146], [P107].CHS.HIT. Ma Chunpeng seem to explain why L1 gets 0 directly, while L2 is infinitely close to 0.
Dry goods|PRML reading postscript (1): fitting learning

8

Find features better: Adaptive Represention Regularizer

[Erhan10] of the Hinton group believes that Pre-Training of Deep Learning is also a Regularizer for two reasons:

First, the search direction of the parameter W after pre-training is more likely to escape from the local minimum.

Second, the search direction of the parameter W after pre-training makes the likelihood function value larger, but it has better generalization ability (the test error rate becomes lower).

The first point is the more magical Regularizer effect. Even the Bayesian method wearing the Turing Award cannot be explained.

The second point is a bit like the effect of L2 Regularizer, but it is more likely to be related to the Attention mechanism inside the model.

If the parameter W after Pre-Training is fixed, then Pre-Training is equivalent to a nonlinear PCA, pre-injected

The prior knowledge of the unlabeled observation data yields a more reasonable P(W), which again cannot be explained by the Bayesian method.

9

Reliable sparsity: Adaptive Sparsity Regularizer

There are two methods that can adaptively introduce sparsity in Deep Learning, [ReLU] & [Dropout].

I. [ReLU] The output of neurons is sparse, and the output of neurons is obviously variable.

II. [Dropout] is to sparse the output of neurons, but the method is a bit special, using random probability to determine, rather than an adaptive method.

But this does not mean that [Dropout] does not get adaptive sparseness. Its adaptation comes from random itself.

Due to the randomness, the network structure is different every time, which forces the parameter W to be adjusted in a stable direction.

As analyzed in 2.1.2, [I] can be considered as finding sparse features, instead of L1. [II] It can be considered as a sparse activation mechanism similar to biological neural networks, replacing L2.

The two do not conflict, so in the conventional Deep Learning model, [I]+[II] is the standard equipment.

Blog Park: http://www.cnblogs.com/neopenx/p/4820567.html

Recommended reading:

Featured Dry Goods|Summary of
Dry Goods Catalogue for Nearly Six Months Dry Goods|Taiwan University Lin Xuantian Machine Learning Foundation Stone Course Study Notes 5 - Training versus Testing
Dry Goods|MIT Linear Algebra Course Fine Notes [First Lesson]

           欢迎关注公众号学习交流~          

Dry goods|PRML reading postscript (1): fitting learning
Welcome to join the exchange group to exchange learning

Dry goods|PRML reading postscript (1): fitting learning

Guess you like

Origin blog.51cto.com/15009309/2553801