Machine Learning Practical Tutorial (9): Model Generalization

Generalization

Model generalization refers to the ability of a machine learning model to adapt to new, unseen data. In machine learning, we usually divide the existing data set into a training set and a test set, use the training set to train the model, and then use the test set to evaluate the performance of the model. A model that performs well on the training set may not necessarily perform well on the test set or in actual applications. Therefore, we need to ensure that the model has good generalization capabilities to ensure its effectiveness in actual scenarios.

In order to improve the generalization ability of the model, we usually need to take a series of measures, such as increasing the size of the data set, feature selection, feature scaling, regularization, cross-validation, etc. Through these methods, the overfitting of the model can be reduced and the prediction ability of new data can be improved.

In short, model generalization is a very important concept in machine learning. It is directly related to the effect of the model in practical applications, and is also one of the important indicators for evaluating machine learning algorithms and models.

Model evaluation and selection

error analysis

The machine's prediction is like throwing a dart. The closer it is to the bull's-eye, the more accurate the prediction. Errors can be divided into two categories: bias and variance. The following figure can be used to vividly depict:
Insert image description here
Regarding the learning task, if the assumed function is not good enough, two problems may occur in the fitting results:

  • Underfitting (underfit): There are too few parameters, the hypothesis function is too unfree and too simple, and even the sample set is not well-fitted. The prediction is easily biased to one side and the deviation is large.
  • Overfitting (overfit): There are too many parameters, the assumed function is too free, and it is not resistant to interference. It fits the sample set very well, but the assumed function is too deformed, and the prediction goes left and right, with a large variance.

Underfitting and overfitting can be vividly illustrated by the following figure:
Insert image description here
When changing the complexity of the model and the size of the training set, the function graph of the error of the training set and the test set (the error when changing the complexity of the model): Changing the data
Insert image description here
set Time error

It is relatively simple to solve under-fitting, just add parameters or features. The trouble is over-fitting.
Solutions to overfitting include:

  • Reduce the parameters of this model, or change to a simpler model.
  • Regularization.
  • Increase the training set, reduce noise components, etc.

generalization error

θ \theta θ represents the hyperparameter,J is unknown J_{unknown}JUnknown { \{ { θ \theta i } \}} represents the trained modelθ \thetaThe smaller the error for unknown data after the θ parameter, the stronger the generalization ability,J test J_{test}Jtest{ \{{ θ \theta i } \}} represents the error of the model to the test machine. The smaller the error, the stronger the generalization ability.

We hope that our model can generalize, that is, it can also function in untrained and unknown situations. Generalization error refers to the cost function of the model when processing unknown data: J Unknown J_{Unknown}JUnknown{ \{{ θ \theta i } \}}
value, which can quantify the generalization ability of the model.
However, when we train and test the model, there is no unknown data. We will improve the model based on its performance on the training set, and then conduct training and testing. But the final calculation on the test set is:J test J_{test}Jtest{ \{{ θ \theta i } \}} The test set has been optimized, and its estimate of the generalization error is obviously too optimistic and will be too low. In other words, the effect of putting the model in actual application will be much worse than expected.
In order to solve this problem, people have proposed the method of cross validation

Cross-validation

Cross-validation steps

  1. The training set is further divided into sub-training sets and cross-validation sets. Hide the test set and don't use it yet. (The test set is a simulation of unknown data)
  2. Use various models to train on the sub-training set, and measure the J cv J_{cv} of each model on the cross-validation setJcv { \{ { θ \theta i } \}}
  3. Select J cv J_{cv}Jcv{ \{{ θ \theta i } \}} The smallest model is considered the best. Combine the sub-training set and cross-validation set into a training set to train the final model.

Insert image description here
An improved method of cross-validation is K-fold cross-validation (Figure 6): the training set is divided into many small blocks, and a small block is taken as the cross-validation set in each case, and the remaining parts are merged as sub-training sets. , find the J cv J_{cv} of the modelJcv{ \{{ θ \theta i } \}} , calculate each situation and find the averageJ cv J_{cv}Jcv{ \{{ θ \theta i } \}} , the model with the smallest average is considered the best model. In the end, the best model is still trained on the entire training set, and the generalization error is estimated on the test set.

Insert image description here
The advantage of K-fold cross-validation is to further ensure that the cross-validation set is not special and the estimation of the generalization error is more accurate.

KFold split

In sklearn, we can use the KFold class to implement k-fold cross validation.
When performing k-fold cross-validation, the KFold object randomly divides the original data set into k approximately sized subsets, each subset is called a "fold". , for example, a 10-element array, if k=5, will be split into 5 data sets, each folded data set is 2, and the 5 folded data sets will be used as a test machine, so there will be 5 combinations.

from sklearn.model_selection import KFold
import numpy as np

# 创建一个包含10个元素的数组作为样本数据
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
# 定义K值
k = 5

# 创建KFold对象,并指定n_splits参数为K
kf = KFold(n_splits=k)

# 遍历KFold对象中的每一组训练集和测试集
for train_index, test_index in kf.split(X):
    print("train_index:", train_index, "test_index:", test_index)

The output is as follows:

train_index: [2 3 4 5 6 7 8 9] test_index: [0 1]
train_index: [0 1 4 5 6 7 8 9] test_index: [2 3]
train_index: [0 1 2 3 6 7 8 9] test_index: [4 5]
train_index: [0 1 2 3 4 5 8 9] test_index: [6 7]
train_index: [0 1 2 3 4 5 6 7] test_index: [8 9]

The value of fold also determines the number of scores finally verified using the cv data set.

cross_val_score actual combat

The cross_val_score function is one of the quick methods in the Scikit-learn library for evaluating model performance. It computes a cross-validation based model score and returns the test performance score for each fold. Unlike KFold, cross_val_score does not need to show split data sets. You only need to provide a model and dataset for evaluation, and the function will automatically handle the cross-validation process, making the code cleaner and easier to understand.

Datasets and models

load_digits is a function in the Scikit-learn library for loading a handwritten digit image dataset. This dataset contains 1797 handwritten digit images of 8x8 pixel size, each image corresponding to a numeric label from 0 to 9.

from sklearn.datasets import load_digits

digits = load_digits()
import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(10, 3))

for i, ax in enumerate(axes):
    ax.imshow(digits.images[i], cmap='gray')
    ax.set_title(digits.target[i])

plt.show()

Insert image description here
The KNeighborsClassifier in the Scikit-learn library implements the k-nearest neighbor algorithm, in which the hyperparameters k and p affect the performance of the model.

  • n_neighbors (i.e. k): Specify the number of nearest neighbors to be considered. By default, it is 5, which means that when predicting a new sample, the labels of the 5 closest data points in the data set will be used, and the label with the most among the 5 is the label of the current data.
  • p: metric used to calculate distance. By default, Minkowski distance is used, and p is 2, which means Euclidean distance is used. Different p values ​​correspond to different distance measurement methods. For example, p=1 represents Manhattan distance, and p=3 can use a more complex Manhattan distance measurement method.

Get the best k,p using data set and test set

Split the data set into a training set and a test set, then k ranges from 1 to 11, p from 1 to 6, test the score of the training set, and get the best k and p.

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

digits = datasets.load_digits()
x = digits.data
y = digits.target

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=666)

best_score, best_p, best_k = 0, 0, 0 
for k in range(2, 11):
    for  p in range(1, 6):
        knn_clf = KNeighborsClassifier(weights="distance", n_neighbors=k, p=p)
        knn_clf.fit(x_train, y_train)
        score = knn_clf.score(x_test, y_test)
        if score > best_score:
            best_score, best_p, best_k = score, p, k

print("Best K=", best_k)
print("Best P=", best_p)
print("Best score=", best_score)

Output result:

Best K= 3
Best P= 4
Best score= 0.9860917941585535

Get optimal k,p using cross validation

The cross-validation method used by the cross_val_score function by default is 3-Fold cross-validation, which divides the data set into 3 equal parts, 2 of which are used for training and 1 for testing. In each fold iteration, the test set is used to obtain the performance metric score, and then the results of all folds are averaged and returned.

It should be noted that cross_val_score also has a parameter named cv, which can be used to specify the number of folds for cross-validation, that is, the k value. For example, cv=5 means 5-Fold cross-validation, splitting the dataset into 5 equal parts, 4 parts for training and 1 part for testing. For classification problems and regression problems, 3, 5 or 10-fold cross-validation is usually chosen. In general, the greater the number of folds in cross-validation, the more reliable the model evaluation results, but the computational cost also increases.

In short, when the cv parameter is not explicitly set, cross_val_score uses 3-Fold cross-validation by default, that is, the default k value is 3.

best_score, best_p, best_k = 0, 0, 0 
for k in range(2, 11):
    for  p in range(1, 6):
        knn_clf = KNeighborsClassifier(weights="distance", n_neighbors=k, p=p)
        scores = cross_val_score(knn_clf, x_train, y_train)
        score = np.mean(scores)
        if score > best_score:
            best_score, best_p, best_k = score, p, k

print("Best K=", best_k)
print("Best P=", best_p)
print("Best score=", best_score)

output

Best K= 2
Best P= 2
Best score= 0.9823599874006478

Comparing the first case, we found that the optimal hyperparameters obtained are different. Although the score will be slightly lower, the second case is generally more credible. However, this score only shows that this set of parameters is optimal, and does not refer to the accuracy of the model on the test set, so let's look at the accuracy next.

best_knn_clf = KNeighborsClassifier(weights='distance', n_neighbors=2, p=2)
best_knn_clf.fit(x_train, y_train)
best_knn_clf.score(x_test, y_test)

Output result: 0.980528511821975, this is the accuracy of the model.

Regularization

principle

Insert image description here
To understand what regularization is, first we need to understand the equation in the above figure. When there are very few features and data for training, underfitting often occurs, which corresponds to the coordinates on the left; and the goal we want to achieve is often the coordinates in the middle, where appropriate features and data are used for training; but often In real life, there are many factors that affect the results, that is to say, there will be many feature values, so when training the model, overfitting will often occur, as shown in the figure above.
Taking the formula in the picture as an example, the model we often get is:

θ 0 + θ 1 x + θ 2 x 2 + θ 3 x 3 + θ 4 x 4 \theta_{0}+\theta_{1}x+\theta_{2}x^2+\theta_{3}x^3+\theta_{4}x^4 i0+i1x+i2x2+i3x3+i4x4

In order to get the graph of the intermediate coordinates, we definitely hope that θ3 and θ4 are as small as possible, because the smaller these two items are, the closer they are to 0, and the middle graph can be obtained.
For the loss function:
( 1 2 m [ ∑ i = 1 m ( h θ ( xi ) − yi ) 2 ] ) ({1\over2m}[\sum_{i=1}^{m}{(h_\theta( x^i)-y^i)^2}])(2m _1[i=1m(hi(xi)yi)2 ])
In linear regression, the minimum value of the loss function
min ( 1 2 m [ ∑ i = 1 m ( h θ ( xi ) − yi ) 2 ] ) min({1\over2m}[ \sum_{i=1}^{m}{(h_\theta(x^i)-y^i)^2}])min(2m _1[i=1m(hi(xi)yi)2 ])
and calculateθ \thetaθ value.
If a number is added to the loss function to find the minimum value, then this number must be closer to 0, and the minimum must be smaller.
So what is added to this value? We hope thatθ \thetaAs θ approaches 0, the smaller the impact on the loss function, the better, that is, the features are reduced.
Generalizing the formula:
1 2 m [ ∑ i = 1 m ( h θ ( xi ) − yi ) 2 ] ) + λ ∑ j = 1 n θ j 2 {1\over2m}[\sum_{i=1} ^{m}{(h_\theta(x^i)-y^i)^2}])+\lambda\sum_{j=1}^{n}\theta_{j}^22m _1[i=1m(hi(xi)yi)2])+lj=1nij2

In order to find the minimum value of the loss function, so that the θ value approaches 0, this achieves our goal.
It is equivalent to adding a penalty term (λ term) to the original loss function.
This is a method to prevent overfitting. It is usually called L2 regularization, also called ridge regression.

We can think that after adding the L2 regular term, the estimated parameter length becomes shorter, which is mathematically called feature shrinkage.

Introduction to the shrinkage method: refers to taking into account the size of the coefficients during the training and solving parameter process, by setting the penalty coefficient, the coefficients of the features with less influence are attenuated to 0, and only the important features are retained, thereby reducing the complexity of the model and avoiding overfitting. Purpose. Commonly used shinkage methods include Lasso (L1 regularization) and Ridge regression (L2 regularization).
Lasso (L1 regularization) formula:
1 2 m [ ∑ i = 1 m ( h θ ( xi ) − yi ) 2 ] + λ ∑ j = 1 n ∣ θ j ∣ {1\over2m}[\sum_{i= 1}^{m}{(h_\theta(x^i)-y^i)^2}]+\lambda\sum_{j=1}^{n}|\theta_{j}|2m _1[i=1m(hi(xi)yi)2]+lj=1nθj

The above logic may be seen as an application of the Lagrange multiplier method

The main purposes of using shrinkage method include two:

  1. On the one hand, because the model may consider many unnecessary features, which are noise for the model, shrinkage can reduce the complexity of the model by eliminating noise;
  2. On the other hand, if there is multicollinearity in model features (correlation between variables), it may lead to multiple solutions to the model, and one solution of a multi-solution model often cannot reflect the true situation of the model. Shrinkage can eliminate associated features and improve model stability.

Corresponding graphics

We can simplify the equation for L2 regularization:
J = J 0 + λ ∑ ww 2 J=J_{0}+\lambda\sum_ww^2J=J0+lww2
J0 represents the original loss function. We assume that the regularization term is:
Assume that 2 features w have two values ​​​​w1 and w2
L = λ ( w 1 2 + w 2 2 ) L=\lambda(w_{1}^ 2+w_{2}^2)L=λ ( w12+w22)
We might as well recall the equation of a circle:
( x − a ) 2 + ( y − b ) 2 = r 2 (xa)^2+(yb)^2=r^2(xa)2+(yb)2=r2where
(a, b) are the coordinates of the center of the circle, and r is the radius. Then the unit unit passing through the origin of the coordinates can be written as:
Just the same as the L2 regularization term. At the same time, the task of machine learning is to find the minimum value of the loss function through some methods (such as gradient descent).
At this time, our task becomes to find the solution that minimizes J0 under the L constraint (Lagrange multiplier method).

The process of solving J0 can draw contours. At the same time, the L2 regularized function L can also be drawn on the two-dimensional plane of w1w2. As shown in the figure below:
Insert image description here
L is represented as a black circle in the figure. As the gradient descent method continues to approach, it intersects with the circle for the first time, and this intersection is difficult to appear on the coordinate axis.

This shows that L2 regularization is not easy to obtain a sparse matrix. At the same time, in order to find the minimum value of the loss function, w1 and w2 are infinitely close to 0 to prevent over-fitting.

Ridege Regression

This is the L2 regularization
test case:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
x = np.random.uniform(-3.0, 3.0, size=100)
X = x.reshape(-1, 1)
y = 0.5 * x + 3 + np.random.normal(0, 1, size=100)

plt.scatter(x, y)
plt.show()

Insert image description here

Use 20 terms for fitting (simulating overfitting)

from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

def PolynomiaRegression(degree):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scale', StandardScaler()),
        ('lin_reg', LinearRegression()),
    ])


np.random.seed(666)
x_train, x_test, y_train, y_test = train_test_split(X, y)

poly_reg = PolynomiaRegression(degree=20)
poly_reg.fit(x_train, y_train)

y_poly_predict = poly_reg.predict(x_test)
print(mean_squared_error(y_test, y_poly_predict))
# 167.9401085999025
import matplotlib.pyplot as plt
x_plot = np.linspace(-3, 3, 100).reshape(100, 1)
y_plot = poly_reg.predict(x_plot)

plt.scatter(x, y)
plt.plot(x_plot[:,0], y_plot, color='r')
plt.axis([-3, 3, 0, 6])
plt.show()

Insert image description here
Encapsulate a function to generate a test set and test the model

def plot_model(model):
    x_plot = np.linspace(-3, 3, 100).reshape(100, 1)
    y_plot = model.predict(x_plot)

    plt.scatter(x, y)
    plt.plot(x_plot[:,0], y_plot, color='r')
    plt.axis([-3, 3, 0, 6])
    plt.show()

Use ridge regression:

from sklearn.linear_model import Ridge
def RidgeRegression(degree, alpha):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scale', StandardScaler()),
        ('lin_reg', Ridge(alpha=alpha)),
    ])

ridege1_reg = RidgeRegression(20, alpha=0.0001)
ridege1_reg.fit(x_train, y_train)

y1_predict = ridege1_reg.predict(x_test)
print(mean_squared_error(y_test, y1_predict))
# 跟之前的136.相比小了很多
plot_model(ridege1_reg)

Output error: 1.3233492754136291
Insert image description here
Adjust α \alphaα =1

ridege2_reg = RidgeRegression(20, alpha=1)
ridege2_reg.fit(x_train, y_train)

y2_predict = ridege2_reg.predict(x_test)
print(mean_squared_error(y_test, y2_predict))
plot_model(ridege2_reg)

Output: 1.1888759304218461
Insert image description here
adjust alpha \alphaa =100

ridege2_reg = RidgeRegression(20, alpha=100)
ridege2_reg.fit(x_train, y_train)

y2_predict = ridege2_reg.predict(x_test)
print(mean_squared_error(y_test, y2_predict))
# 1.3196456113086197
plot_model(ridege2_reg)

Output: 1.3196456113086197
Insert image description here
adjust alpha \alphaa =1000000

ridege2_reg = RidgeRegression(20, alpha=1000000)
ridege2_reg.fit(x_train, y_train)

y2_predict = ridege2_reg.predict(x_test)
print(mean_squared_error(y_test, y2_predict))
# 1.8404103153255003
plot_model(ridege2_reg)

Output: 1.8404103153255003
Insert image description here
From the above alpha values, we can see that we can conduct a more detailed search between 1-100 and find the most suitable relatively smooth curve to fit. This is the L2 regularity.

LASSO Regularization

encapsulation

#%%

import numpy as np
import matplotlib.pyplot as plt
from skimage.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
np.random.seed(42)
x = np.random.uniform(-3.0, 3.0, size=100)
X = x.reshape(-1, 1)
y = 0.5 * x + 3 + np.random.normal(0, 1, size=100)
np.random.seed(666)
x_train, x_test, y_train, y_test = train_test_split(X, y)

plt.scatter(x, y)
plt.show()

#%%

from sklearn.linear_model import Lasso
def plot_model(model):
    x_plot = np.linspace(-3, 3, 100).reshape(100, 1)
    y_plot = model.predict(x_plot)

    plt.scatter(x, y)
    plt.plot(x_plot[:,0], y_plot, color='r')
    plt.axis([-3, 3, 0, 6])
    plt.show()
def LassoRegression(degree, alpha):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scale', StandardScaler()),
        ('lin_reg', Lasso(alpha=alpha)),
    ])
def TestRegression(degree, alpha):
    lasso1_reg = LassoRegression(degree, alpha) 
    #这里相比Ridge的alpha小了很多,这是因为在Ridge中是平方项
    lasso1_reg.fit(x_train, y_train)
    
    y1_predict = lasso1_reg.predict(x_test)
    print(mean_squared_error(y_test, y1_predict))
    # 1.149608084325997
    plot_model(lasso1_reg)

Using lasso regression:
adjusting α \alphaα =0.01

TestRegression(20,0.01)

Output: 1.149608084325997
Insert image description here

Adjust α \alphaα =0.1

TestRegression(20,0.1)

Output: 1.1213911351818648
Insert image description here
adjust alpha \alphaα =1

TestRegression(20,1)

Output: 1.8408939659515595
Insert image description here

Explain Ridge and LASSO

Insert image description here
Comparing the two figures, it is found that the model fitted by LASSO is more likely to be a straight line, while the model fitted by Ridge is more likely to be a curve. This is because the two regularities are essentially different. Ridge tends to make the
sum of all as small as possible, while Lasso tends to make part
of the value become 0, so it can be used for feature selection. This is why it is called Selection. The reason for the operation.

Guess you like

Origin blog.csdn.net/liaomin416100569/article/details/130289602
Recommended