Model improvement and generalization (overfitting)

1 Fit

Since  the concept of fitting has not been introduced yet, here is a slight addition. The so-called process of solving the model is actually the process of fitting the model parameters through a certain method (such as gradient descent). Fitting refers to a dynamic process of finding model parameters.  When the execution of this process is over, there will be a variety of post-fitting states, such as over-fitting and under-fitting.

1.1 Reference

Suppose there are a batch of sample points, which are originally generated by the function sin(x) ( not known in reality ), but due to other factors, the sample points we get do not accurately fall on the curve sin(x) , But distributed in its vicinity.

The red dot in the figure above is the data set we got, and the blue curve is the true distribution curve. Now we need to model it and solve it to get the corresponding prediction function. Now if you were used degree = 1,5,10to model these 12 sample points ( degreevisual situation represents the highest degree of the polynomial), and to solve the following:

It can be seen from the figure that as the degree of the polynomial increases, R^{2} the value of the index becomes higher and higher (the R^{2} larger indicates the better the model), and when the degree is set to 10, R2
 reaches 1.0. But should we choose the model with degree=10 in the end?

I don’t know how long it took, suddenly a customer wanted to buy your model for commercial use; at the same time, the customer brought a new batch of labeled data to evaluate your model (although you have R^{2} tested it before, But other customers will not completely believe it, in case you cheat). So you took the customer's new data and re-tested the following own model. The visualization is as follows:

What makes you wonder is that the result of degree=5 is actually better than the result of degree=10 model. What is the problem? The reason is that: when we modeled through these 12 sample points for the first time, we R^{2} used a very complex model in order to make the "model as good as possible (the representation is as large as possible)", which led to the final result The model will seriously deviate from its real model (sin(x) here), although in the end every sample point falls on the prediction curve "exactly". And this is obviously not what we want.

1.2 Over-fitting and under-fitting

In machine learning, we call the phenomenon produced by the above degree=10 as overfitting , the phenomenon produced by its opposite degree=1 is called underfitting , and the phenomenon with degree=5 is called just fitting ( Good fitting) ; At the same time, the data during modeling is called training data , the error generated on the training set is called training error , and the data during testing is called test set or verification set (test data) . The error generated by the model on the test set is called generalization error , and the entire modeling solution process is called training .

It should be noted that, in the quoted examples, only linear regression is used as an example to introduce you intuitively what is over-fitting and under-fitting. It does not mean that this phenomenon only occurs in linear regression. In fact, all machine learning models have this problem. Generally speaking, the so-called over-fitting phenomenon means that the performance is very good on the training set, but the performance is bad on the test set; the under-fitting phenomenon is that the performance is very bad on both; and the just-fitting phenomenon means that the performance is very bad on the training set. The performance is good on the set (although it may not be as good as the overfitting), but it also has a good performance on the test set.

1.3 How to solve underfitting?

After the above description, we have an intuitive understanding of under-fitting. The so-called under-fitting means that the trained model cannot fit the existing training data well. The method to solve under-fitting is relatively simple, mainly divided into the following two categories:

  • Redesign more complex models

       For example, in linear regression, the number of feature mapping polynomials can be increased;

  • Add more feature dimensions as input

      Collect or design more feature dimensions as input to the model;

1.4 How to solve overfitting?

There are two main types of common methods to solve model overfitting:

  • Collect more data

       This is the most effective but the most difficult method in practice. The more training data, the more able to correct the influence of noise data on the model during the training process, making the model less prone to overfitting. However, the collection of new data is often difficult.

  • Regularization

        This is an effective and easy-to-operate method to alleviate simulation overfitting

2 How to avoid overfitting

    In order to avoid the over-fitting of the trained model, before the model training, we generally divide the obtained data set into two parts: the training set and the test set , and the ratio is generally 7:3 . Among them, the training set is used to train the model (to reduce the error of the model on the training set), and then the test set is used to test the generalization error of the model on the unknown data, and observe whether there is an overfitting phenomenon. But because the data used in a complete training process is usually: "training set -> test set -> training set -> test set ->...", because you can't choose the right model the first time, so I don't know Unconsciously, the test set is also used as a training set. So there is another way to divide: training set, development set (dev data), and test set , and the ratio is generally 7:2:1. Why are there two ways of dividing? This generally depends on the trainer's requirements for the model. If it is divided into three parts, the test set is generally used for the final test after the model is selected through the training set and the development set. However, there are no hard and fast standards for how to divide them.

3. Summary

In this article, the author first introduces what fitting is, and then introduces the three states brought about by fitting: under-fitting, just-fitting and over-fitting. The fit-fit model is the final result we need. . At the same time, the author went on to introduce the methods to deal with under-fitting and over-fitting, and the specific solutions to over-fitting will be explained in the next article. Finally, the author also introduced the division of the data set to avoid the phenomenon of model overfitting. This is the end of this content, thanks for reading!
 

Quote:

Guess you like

Origin blog.csdn.net/devil_son1234/article/details/107158813