Learning Curve and Validation Curve

Learning Curve and Validation Curve




learning curve

  • The learning curve is to observe the performance of the model on the new data by drawing the accuracy of the model training set and the cross-validation set when the size of the training set is different, and then judge whether the variance or deviation of the model is too high, and whether to increase the training set Overfitting can be reduced.
  • The difference between the leftmost and the rightmost depends on whether the accuracy rate converges to above 0.5.
    insert image description here
  • learning curve code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import ShuffleSplit
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import learning_curve

plt.figure(figsize=(18,10),dpi=150)


def plot_learning_curve(estimator, title, x, y, ylim=None, cv=None, n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes,train_scores,test_scores=learning_curve(estimator,X,y,cv=cv,n_jobs=n_jobs,train_sizes=train_sizes)
    train_scores_mean=np.mean(train_scores,axis=1)
    train_scores_std=np.std(train_scores,axis=1)
    test_scores_mean=np.mean(test_scores,axis=1)
    test_scores_std=np.std(test_scores,axis=1)
    plt.grid()

    plt.fill_between(train_sizes,test_scores_mean-train_scores_std,train_scores_mean+test_scores_std,alpha=0.1,color="r")
    plt.fill_between(train_sizes,test_scores_mean-train_scores_std,train_scores_mean+test_scores_std,alpha=0.1,color="g")
    plt.plot(train_sizes,train_scores_mean,"o-",color="r",label="Training score")
    plt.plot(train_sizes,test_scores_mean,"o-",color="g",label="Cross-validation score")
    plt.legend(loc="best")
    return plt


train_data2 = pd.read_csv('./data/zhengqi_train.txt', sep='\t')
test_data2 = pd.read_csv('./data/zhengqi_test.txt', sep='\t')

X = train_data2[test_data2.columns].values
y = train_data2["target"].values
title = "LinearRegression"
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
estimator = SGDRegressor()
plot_learning_curve(estimator, title, X, y, ylim=(0.7,1.01), cv=cv, n_jobs=-1)
  • The picture drawn by the program is as follows:
    insert image description here

Verification curve

  • Unlike the learning curve, the horizontal axis of the validation curve is a range of values ​​of a hyperparameter, thereby comparing the accuracy of the model under different parameter settings (rather than different training set sizes).
  • As can be seen from the verification curve in the figure below, as the hyperparameter settings change, the model may undergo a process from underfitting to appropriateness to overfitting, and then an appropriate setting can be selected to improve the performance of the model.
    insert image description here
  • Verify Curve Code
# 验证曲线
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import validation_curve
import matplotlib as mpl

mpl.rcParams.update({
    
    
"text.usetex": False,
"font.family": "stixgeneral",
"mathtext.fontset": "stix",
})
X = train_data2[test_data2.columns].values
y = train_data2["target"].values

param_range = [0.1,0.01,0.001,0.0001,0.00001,0.000001]
train_scores, test_scores = validation_curve(SGDRegressor(max_iter=1000, tol=1e-3, penalty="L1"), X, y, param_name="alpha", param_range=param_range, cv=10, scoring="r2", n_jobs=1)

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)

test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.title("Validation Curve with SGDRegressor")
plt.xlabel("alpha")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
plt.semilogx(param_range, train_scores_mean, label="Training score", color="r")
plt.fill_between(param_range,
                train_scores_mean-train_scores_std,
                train_scores_mean+train_scores_std,
                alpha=0.2,
                color="r")
plt.semilogx(param_range,test_scores_mean,
            label="Cross-validation score",
            color="g")
plt.fill_between(param_range,
                test_scores_mean-test_scores_std,
                test_scores_mean+test_scores_std,
                alpha=0.2,
                color="g")
plt.legend(loc="best")
plt.show()
  • The picture drawn by the program is as follows:insert image description here

error curve

  • When the training is insufficient, the fitting ability of the learner is not strong enough, and the disturbance of the training data is not enough to make the learner change significantly. At this time, the deviation dominates the generalization error rate; as the training level deepens, the fitting ability of the learner gradually increases. , the disturbance of the training data can be learned by the learner gradually, and the variance gradually dominates the generalization error rate; after the training level is sufficient, the fitting ability of the learner is already very strong, and a slight disturbance of the training data will cause the learner to Obviously, if the training data's own, non-global characteristics are learned by the learner, overfitting will occur.
    insert image description here

The relationship between bias, variance and model complexity

  • Variance and bias cannot be avoided, so is there any way to minimize its impact on the model?
  • A good approach is to choose the complexity of the model correctly. A model with high complexity usually has a good fitting ability to the training data, but not necessarily to the test data. However, a model with too low complexity cannot fit the training data well, let alone fit the test data well. Therefore, model complexity and model bias and variance have a relationship as shown in the figure below.

insert image description here

Guess you like

Origin blog.csdn.net/weixin_51524504/article/details/130092326