Who will save the fitting? See the essence through the phenomenon, how to use regularization method to solve the problem of overfitting

Preface

Disclaimer: The blog posts during the later events of The Force Project will be transferred to the corresponding fee column.

The blogger will continue to update his knowledge in this field in the future:

Full analysis of artificial intelligence AI combat series code

Teach you the full analysis of the source code of ML machine learning algorithms

If you need it, please subscribe quickly.

 

In the work, I believe that many small partners have encountered the phenomenon of overfitting and created a machine learning model that can train samples perfectly, but they gave very bad predictions for the samples that need to be predicted! Have you ever wondered why this is happening?

This article will analyze over-fitting based on the regularization technique of regression, and clarify how to use the regularization technique to avoid over-fitting.

 

image

Every time we talk about overfitting, this picture will be pulled out from time to time to "whip corpse". As shown in the figure above, at the beginning, the model could not fit all the data points well, that is, it could not reflect the data distribution. At this time, it was under-fitting. As the number of training increases, it slowly finds out the pattern of the data, which can fit as many data points as possible while reflecting the trend of the data. At this time, it is a model with better performance. On this basis, if we continue training, the model will further dig out the details and noise in the training data. In order to fit all the data points "unscrupulous", then it will overfit.

In other words, looking from left to right, the complexity of the model gradually increases, and the prediction error on the training set gradually decreases, but its error rate on the test set shows a downward convex curve.

Polynomial regression and overfitting

 

The focus of machine learning (ML) is to train algorithms on data in order to create a model. Through this model, we can make correct predictions on invisible data (test data). For example, if we want to create a classifier, we first need to collect the data needed to train the ML algorithm. We are responsible for finding the best distinguishing features of different classes to represent each class so that the computer can distinguish different classes. These features will be used to train the ML algorithm. Suppose we want to build an ML model to classify images into those that contain cats and those that do not contain cats. We train the data by using the following pictures.

image

The first question we want to answer is "What is the best feature that can be used to distinguish different classes?" This is the key question of machine learning; because using better features can train ML models to produce better predictions. Let's try to take these images as examples and extract some representative features of cats from them. Some representative features can be two black pupils and two angled ears. Suppose we extracted these features in some way, and created an ML model with the above image. This model can be applied to images of various cats, because most cats have the above characteristics. We can use some data that needs to be predicted to test the model, as shown below. Assume that the classification accuracy of the test data is x%.

image

You may want to increase the accuracy of classification, so the first thing you have to consider is to use more features. This is because the more discriminative features used, the higher the accuracy. By detecting the training data again, we can find more features, such as the color of the overall image, etc.; because all the trained cat samples are white, and the eye iris is yellow. The four features shown below are feature vectors. They will be used to retrain the ML model.

image

After creating the trained model, we need to test it. The expected result of using the new feature vector is that the classification accuracy will be reduced and less than x%. Why is this? The reason for the decrease in accuracy is that some features that are already in the training data but not widely available in all cat images are used. These characteristics are not present in every cat's picture. All training images used have white skin and yellow irises, so the machine thinks that all cats have the above two characteristics. In the test data, some cats are black and some are yellow-these cannot be trained by the training data of white cats. At the same time, some cats' irises are not yellow.

In this case, the features we use are efficient and powerful training samples, but they perform very poorly during the test samples. This phenomenon is called overfitting. That is, the model is trained by certain features of the training data, but such features do not exist in the test data.

The purpose of the previous discussion is to briefly explain what overfitting is through a high-level example. To understand the details of overfitting, it is better to use a simpler example. The rest of the discussion will be based on a regression example to discuss the details of overfitting.

 

Understand regularization based on regression examples

Suppose we want to create a regression model that fits the data shown below. We can use polynomial regression.

image

The simplest model we can use first is a linear model with a first-order polynomial equation:

image

θ1 and θ2 are model parameters and x is the only feature used.                              

The diagram of the previous model looks like this:

image

According to the loss function shown below, we can draw the conclusion that the model cannot fit the data well.

image

fi(xi) is the expected output value of sample i, and di is the required output value of the same sample.

This model is too simple, so many predictions are inaccurate. So we should create a more complex model to be able to fit the data well. Then we can increase the degree of the equation from one to two. The display is as follows:

image

By raising the same feature x to 2 (x2), we create a new feature; we can capture not only the linear features of the data, but also some non-linear features. The diagram of the new model is as follows:

 

image

The figure shows that the second degree polynomial is more suitable for the data than the first degree. But at the same time, the quadratic equation cannot be applied well to some data samples. This is why we can create a more complex third degree model as shown below:

image

The diagram will look like this:

image

After adding new features that capture the attributes of the third-level data, the model becomes more suitable for the data. In order to better adapt to the data, we can also increase the degree of the equation to the fourth degree, as shown in the following equation:

image

The diagram will look like this:

image

The higher the degree of the polynomial equation, the more suitable it is for the data. But there are still some important questions that need to be answered. If the accuracy of the result is improved by adding new features to increase the degree of the polynomial equation, then why not use a very high degree, such as 100 degrees? In other words, what is the optimal degree of this kind of problem?

 

Model capacity/complexity

One item is called model capacity or complexity. Model capacity/complexity refers to the level of variation that the model can use. The higher the capacity, the greater the change that the model can handle. The first model y1, compared with y4, is a small-capacity model. In our example, we increase the capacity by increasing the degree of the polynomial.

Of course, the higher the degree of the polynomial equation, the more appropriate the data. But keep in mind that increasing the degree of polynomial also increases the complexity of the model. Using a model with a higher capacity than required may lead to overfitting. In other words, the model will become very complex, it will fit the training data well, but it will fail very much for the data that needs to be predicted. The goal of ML is not only to build a model that can train data efficiently, but also to be applicable to data samples that need to be predicted.

The fourth degree (y4) model is very complicated. Yes, it is very suitable for the data we have already obtained, but it cannot be applied to the data that needs to be predicted. In this case, the newly used feature in y4 (that is, x4) captures more details than we need. So this new feature makes the model too complex, so we should give up using this feature.

In this example, we actually know which features to remove. Therefore, we can choose not to use it and return to the third degree model that was used before. But in actual work, we don't know which features to remove. In addition, assuming that the new feature is not that bad, and we don't want to remove it completely, we just want to reduce the negative impact it brings as much as possible, then what should we do?

Let's review the loss function, the only goal of which is to minimize or avoid prediction errors. We can set a new goal to minimize or avoid the effects of new features as much as possible. After modifying the loss function to fix x3, the formula we get will be as follows:

 

image

 

Our goal now is to minimize the loss function as much as possible. We just want to minimize the term θ4x4. Obviously, in order to minimize θ4x4, we should minimize θ4 because it is the only free parameter. We can set its value to zero; if in order to prevent the feature from performing very badly, we can choose to completely remove the feature, as shown below:

image

By removing it, we return to the third degree polynomial equation (y3). y3 cannot perfectly match all existing data like y4, but in general, it can provide better performance for the data that needs to be predicted than y4.

However, if x4 is a relatively good feature and we just want to limit it instead of removing it completely, we can set it to a value close to zero but less than zero (such as 0.1), as shown below. By doing this, we limit the effect of x4. Therefore, the new model will not be as complicated as before.

image

Going back to y2, it seems to be simpler than y3. It can make good use of data samples that already exist and need to be predicted. Therefore, we should remove the new feature at y3 (ie x3), or if the performance is not that bad, we can just reduce it. We can modify the loss function to do this.

image

image

Regularization

 

Note that we actually know that y2 is the best model for the data because the data chart is available for us to use. This is a very simple task, and we can solve it manually by ourselves. However, if this information cannot be provided to us, and it increases with the increase in sample size and data complexity, we will not be able to easily draw such a conclusion. There must be an automatic thing to tell us which degree will be most suitable for the data provided, and at the same time tell us that we need to minimize the impact of those features to get the best predictions. This is regularization.

Regularization helps us choose the model complexity to fit the data. Automatic identification of interference features that make the model too complex will greatly improve prediction efficiency. Remember, if these features meet the standard, regularization is very useful and will relatively help us get good predictions. And we just need to reduce this feature, not completely remove them. Regularization restricts all features used, rather than a selected subset. Before, we only restricted x3 and x4 two features, not all features. Therefore, this situation does not belong to the scope of regularization.

When using regularization, a new term will be added to the loss function to limit the features, so the loss function will look like this:

image

After moving the sum other than Λ, it can also be expressed as the following formula:

image

The newly added items control the level of model complexity by restricting features. Our previous goal was to minimize the prediction error before adding the regularization term. And now our goal is to minimize errors, but be careful not to make the model too complex and avoid overfitting.

There is a regularization parameter called lambda (λ), which controls how to restrict these features. It is a hyperparameter with no fixed value. Its value changes according to different tasks. As its value increases, the restriction on the feature will be greater. Therefore, the model becomes simpler. When its value is reduced, the intensity of the feature penalty will also be reduced, thereby increasing the complexity of the model. When it is zero, it means that no feature is restricted at all.

When λ is zero, the value of θj will not be restricted, as shown in the following equation. This is because setting to zero means removing the regularization term, leaving an error term. Therefore, our goal is to go back to minimizing errors, the closer to zero the better. When minimizing error is the goal, the model may overfit.

image

image

image

However, when the value of the limiting parameter λ is very high (for example, 109, there must be a high limit for the parameter θj to keep the loss at a minimum. Therefore, the parameter θj will be zero. From this, the model (y4) will trim θi as shown below.

image

image

image

 

Please note that the regularization term starts with its index j from 1, not 0. In fact, we use regularization terms to limit the features (xi). Because θ0 has no associated function, there is no reason to limit it. In this case, the model will be y4=θ0 and shown in the following graphic:

 

image

Guess you like

Origin blog.csdn.net/wenyusuran/article/details/113868226