In-depth understanding of generalization

table of Contents

1 Introduction

What is a generalization of it?

For the first chestnuts:

  Xiao Ming and Li are on the third year. Xiao Ming clever mind, while the five-year college entrance examination three years simulated brush aside questions do summarize the law, while Mike obsessed brush title, but finished a set, brush the mountains of paper but no right or wrong questions summarized. The end of the college entrance examination results announced, Xiao Ming beyond a very few lines, but Mike was barely on the two lines. Why is this?

  The original College Entrance Examination general topic is new, no one has done, brush the usual problem is to grasp the questions of law, can learn by analogy, apply their knowledge, so when faced new problems can calmly. This mastery of the law is the generalization.

  In this example, Xiao Ming title law to do good at summing up, we can say that he is a good generalization capability; and Li only know the title but did not grasp the brush do question the law, can be said that his poor generalization ability.

2. Definition of generalization

  The fundamental problem of machine learning (deep learning) is the opposition between optimization and generalization.

  Optimization (Optimization) refers to the adjustment model to the training data trained models to give the best performance (i.e. learning machine learning), rather generalization (Generalization) refers unprecedented data performance is good or bad on the .

  The purpose of machine learning, of course, is to get good generalization, but you can not control generalization, can only adjust the model based on training data.

Of generalization can be understood from the following six areas:

  1. The most direct definition of generalization is the difference between training and real data, head to training model is to test the model on a complete stranger data;

  2. Generalization can also be seen as sparsity model. As Occam scissors pointed out, when faced with a different interpretation, the simplest explanation is the best explanation. In machine learning, the model has generalization ability should have a lot of parameters are close to zero. In depth study, it is to be optimized matrix should have a preference for sparsity.

  3. A third explanation is the generalization ability to generate high fidelity model. Generalization model should have the ability to reconstruct the features in each of levels of abstraction.

  4. The fourth explanation is that the model can effectively ignore trivial feature, or you can find the same characteristics in the change irrelevant.

  5. Information can also be viewed as generalization model compression capability.

      This involves learning to explain why the effective depth of a hypothesis, information bottleneck (information bottleneck), said that the stronger the ability of a model to feature compression (dimensionality reduction), the greater the greater the possibility to make accurate classification. Compression capability information can be summarized explanation of the above-described four generalization ability, because of its sparse structure model is finished compressing information, the ability to generate a strong, low generalization error model by compression information may ignore the extraneous feature is information compression products.

  6. The last point of understanding generalization that risk is minimized.

      This is a game theory point of view, strong generalization ability of the model can minimize their risk of an accident in a real environment, it will produce an early warning mechanism for unknown features in the interior, and make plans to deal with in advance. This is a very abstract not so precise explanation, but as technology advances, people will find out the evaluation methods to quantify model generalization ability in the interpretation.

3. Data Set Classification

Data set can be divided into:

  • Training Set: data set of the actual training algorithm; used to calculate the gradient matrix, or Jacobian, and updates the network weights determine each iteration;
  • Validation set: used to track the effect of the learning data set; is an indicator function for indicating that the network is formed between the training data point is what happens, and verify on a set of error values ​​throughout the training process will be monitor;
  • Test set: generating a data set for the final result.

In order for the test set can effectively reflect the generalization capability of the network, we need to keep in mind:

  First, the test set must not be used to train the network in any way, even when the network is used to select from a set of network selection alternative network. Test set can only be used after all the training and model selection is complete;

  Second, the test set must be representative of all involved in the case of using the network (when the input is a high-dimensional space or a complicated shape is difficult to guarantee).

For chestnut:

  Teacher out of 10 exercises for the students to practice, the teacher with the same questions as this 10 question exam, the exam results can effectively reflect the students learn better wrong with that?

  The answer is no, some students may only do this 10 question was able to get a high score. Come back to our question, we hope to get good generalization performance model? Like the students of the course is to learn very well , the ability to obtain the knowledge "replicability" of; the training exercise is equivalent to a sample of students practice, the testing process is equivalent examination obviously, if the test samples are used for training, and then get will be too "optimistic "the estimation results.

4. generalization Category

According to the strength of generalization can be divided into:

Snipaste_2019-11-15_10-49-24.png

  • Underfitting: The model can not get low enough error on the training set;
  • Fit: Test error and training error smaller gap;
  • Overfitting: the gap between training error and testing error is too large;
  • It does not converge: the model is not obtained by the training set training.

  In machine learning, can Bias (bias), Variance (variance) measure less fit, just fit, too fit.

For Bias:

  • Bias measure the ability of the model to fit the training data (training data is not necessarily the entire training dataset, but only for that part of its training data, such as: mini-batch);
  • Bias error is reflected on the sample between the true value and the model output, i.e., the accuracy of the model itself;
  • Bias smaller, the higher the ability of the fitting (overfitting possible); conversely, the lower the ability of the fitting (possible underfitting).

For Variance:

  • Variance is described range, the predicted value of the degree of dispersion, i.e. its distance from the desired value. The larger the variance, the more dispersed distribution of the data, the worse the stability of the model;
  • Variance reflect every model output error between the model output and the desired result, i.e., the stability of the model;
  • Variance, the higher the ability of generalization of the model; conversely, the lower the generalization capability of the model.

Picture 1.png

Picture 2.png

  The left is a high deviation, corresponding to underfitting; right variance is high, corresponding to overfitting, the intermediate is Just Right, exactly corresponding to the fitting.

For over-fitting it can be understood as:

  Over-fitting means that given a bunch of data, this pile of data with the noise, the use of models to fit this pile of data, noise data will likely also to fit, and it is very lethal, on the one hand will result in comparing models complex (think about it, would have been able to fit a function of the data, because the data is now with the noise, resulting in five use the function to fit, and more complex!), on the other hand, the generalization performance of the model is bad ( could have been a function of generated data, the results due to interference noise, the resulting model is five times), I met new data allows you to test, over-fitting model, you get the correct rate is very poor.

5. From the training process to understand the generalization

image-20191115110255593.png

image-20191115110310372.png

  The model capacity (capacity): generally refers to the number of parameters may be learned model (or model complexity).

  Training started when the model is still in the learning process, but also has a high error on the training set and test set, this time a larger deviation model. This time, the model has not yet learned knowledge, in the state of poor fitting, curve falls underfitting area.

  As the training progresses, the training error and testing error have declined. With the further training model, the performance on the training set getting better and better, and finally, after a breakthrough point, the training set error continues to decline, and the error test set up, this time a larger variance model, entered the over-fitting range. It usually occurs in the case where the model is too complex, too many parameters such as, makes the prediction performance model becomes weak, and increase the volatility of the data. Although the effect of the training model can be expressed is perfect, basically remember all the features of the data, but this model in performance capacity will be greatly reduced discount unknown data, because of the simple model generalization are usually weak of.

  Thus, the model used should have enough parameters to prevent underfitting that the model should avoid running out of memory resources. Between the capacity is too large and insufficient capacity to find a compromise, that the optimal capacity.

There are two caveats.

(1) model training process, the error on the training set must have been set lower than the test it?

  not necessarily. If these two sets already taken from the same distribution of data, such as from a centralized randomly sampled data, it is possible to test the error from the beginning is lower than the training set. However, the overall trend will be unchanged, both from the start to slowly decline until the last over-fitting, error on the training set is lower than the test set.

Training (2) model will fit over it?

  This is not necessarily! If the data set is large enough, the ability to model likely will not be enough to always over-fitting. On the other hand, there are many ways to prevent or slow down the over-fitting models, such as regularization.

Reference material

1. talk Neural network regularization

2. Yin Dawn Goodfellow depth study [M] Beijing: People's Posts and Telecommunications Press, 2017.8

3. Martin T · Hagen, Zhang Yi (translated) neural network design second edition Beijing: Mechanical Press, 2017.12

4. depth study asked 500

5. talk about some of their own understanding of the regularization

6. Other information online (after the meeting)

Guess you like

Origin www.cnblogs.com/sc340/p/11870773.html