Wu Enda Machine Learning Homework Python Implementation (5): Bias and Variance

Table of contents

Regularized Linear Regression

data visualization

Regularized linear regression cost function

Regularized Linear Regression Gradients

Fit Linear Regression

Bias and Variance

learning curve

polynomial regression

 Select λ using the validation set

Calculating the test set error

reference article


Regularized Linear Regression

        In the first half of the exercise, you will implement regularized linear regression to use changes in the water level of a reservoir to predict the amount of water flowing from a dam. In the second half, you'll do some diagnostics for debugging learning algorithms and examine the effects of bias and bias.

data visualization

        First, we will visualize a dataset containing historical records of changes in water level, X, and the amount of water flowing from the dam, y. This dataset consists of three parts

        Training set X, y, used to train the model

        Validation set Xval, yval, used to select regularization parameters

        Test set Xtest, ytest, for evaluating performance

        First we need to import the library used

import numpy as np
import scipy.io as sio
import scipy.optimize as opt
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

        Then import the data used in this experiment, where we can extract the data sets separately and check their size

# 读入数据
path = r'E:\Code\ML\ml_learning\ex5-bias vs variance\ex5data1.mat'
data = sio.loadmat(path)
X, y, Xval, yval, Xtest, ytest = data['X'], data['y'], data['Xval'], data['yval'], data['Xtest'], data['ytest']
# X.shape, y.shape, Xval.shape, yval.shape, Xtest.shape, ytest.shape
# ((12, 1), (12, 1), (21, 1), (21, 1), (21, 1), (21, 1))

        Next, data visualization

# 可视化
def plotdata(X,y):
    plt.figure(figsize=(12,8))
    plt.scatter(X, y,c='r', marker='x')
    plt.xlabel('Change in water level (x)')
    plt.ylabel('Water flowing out of the dam (y)')
    plt.grid(True)
plotdata(X,y)

Regularized linear regression cost function

        The cost function of regularized linear regression is as follows

        J(\theta) = \frac{1}{2m}(\sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^{2}) +\frac{\lambda}{2m}(\sum_{j=1}^{n}\theta_{j}^{2})

        Where λ is a regularization parameter that controls the degree of regularization, so it can be used to prevent overfitting. The regularization term imposes a penalty on the total cost J(θ). As the model parameter θj increases, the penalty also increases. Note that The point is not to penalize the θ0 term, that is, the index starts from 1

# 插入x0 = 1
X, Xval, Xtest = [np.insert(x, 0, np.ones(x.shape[0]), axis=1) for x in (X, Xval, Xtest)]
# X.shape, Xval.shape, Xtest.shape
# ((12, 2), (21, 2), (21, 2))

# 代价函数
def costReg(theta, X, y, l=1):
    m = X.shape[0] # 12
    theta = np.matrix(theta) # (1,2)
    X = np.matrix(X) # (12,2)
    y = np.matrix(y) # (12,1)
    inner = (X * theta.T) - y # (12,1)
    part1 = float((1 / (2 * m)) * inner.T * inner)  #(1,1)
    part2 = float((l / (2 * m)) * theta[:,1] * theta[:,1].T)  # (1,1)
    cost = part1 + part2
    
    return cost

        Among them, when the initial θ = [1; 1], the obtained output is 303.993. Here, it must be noted that the regularization part of the second half of the formula does not include the term θ0, otherwise the output will become 304.04. I checked it several times before I found out Something went wrong here.

Regularized Linear Regression Gradients

        Correspondingly, the regularized linear regression gradient formula is defined as

               \frac{\partial J(\theta)}{\partial (\theta_{0})}=\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x_{(i)})-y_{(i)})^{2}x_{j}^{i}) \qquad for \ \ j = 0                     

\frac{\partial J(\theta)}{\partial (\theta_{0})}=(\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x_{(i)})-y_{(i)})^{2}x_{j}^{i})+\frac{\lambda}{m}\theta_{j} \qquad for \ \ j \geq 1

# 正则化梯度下降
def gradientReg(theta, X,y, l=1):
    m = X.shape[0] # 12
    theta = np.matrix(theta) # (1,2)
    X = np.matrix(X) # (12,2)
    y = np.matrix(y) # (12,1)
    # (2,12) * ((12,2) * (2,1)) - (12, 1)) = (2,1)
    inner = (1 / m) * X.T * ((X * theta.T) - y) 
    reg = (l /m) * theta # (1,2)
    reg[0,0] = 0  #第0项不正则化
    return inner + reg.T # (2,1)

        What needs to be noted in the code is that the calculation of the gradient of θ0 does not require regularization

        When λ = 1, the calculated initial gradient is [-15.303; 598.250]

Fit Linear Regression

        In this part, we set λ to 0, because the current model has few parameters, regularization will not help much, and call the minimize function in scipy.optimize to calculate the optimal solution of the parameters, and the fitting function and raw data plotted together

# 拟合线性回归
final_theta = opt.minimize(fun=costReg, x0=theta, args=(X, y, 0), method='TNC', jac=gradientReg, options={'disp': True}).x
def plotdata1(theta, X, y):
    fig,ax = plt.subplots(figsize=(12,8))
    plt.scatter(X[:,1], y, c='r', label="Training data")
    plt.plot(X[:,1], X @ theta, c = 'b', label="Prediction")
    ax.set_xlabel("water_level")
    ax.set_ylabel("flow")
    ax.legend()
    plt.show()

plotdata1(final_theta, X, y)

        The line of best fit tells us that the model does not fit the data well because of the nonlinearity in the data. While visualizing the best fit as shown is one possible way to debug a learning algorithm, it is not always easy to visualize the data and model. We can then implement a function that generates a learning curve, which can help debug the learning algorithm even if the data is not easily visualized.

Bias and Variance

        An important concept in machine learning is the bias-variance tradeoff. The deviation represents the degree of fitting between the model and the data. A model with high deviation is prone to underfitting, while the variance represents the generalization ability of the model. A model with high variance will overfit the training data, that is, it will generalize to new samples. The ability to transform is poor. We can judge the bias and variance problem by plotting the error of the training set and the validation set

learning curve

        To plot the learning curve, we need the training and cross-validation set errors for different training set sizes. In order to obtain different training set sizes, you should use different subsets of the original training set X. Specifically, for a training set of size i, you should use the first i examples (i.e. X(1:i, :) and y(1:i)).

        The optimal parameters of the model can be solved by the minimize function mentioned above. After the model parameters are obtained, the error between the training set and the verification machine is calculated. The training error of the data set is defined as

J_{train}(\theta)=\frac{1}{2m}[\sum_{i=1}^{2m}(h_{\theta}(x^{(i)})-y^{(i)})^{2}]

        In particular, note that the training error does not include a regularization term. One way to calculate training error is to use your existing cost function and set λ to 0, only when using it to calculate training error and cross-validation error. When computing the training set error, make sure you are computing it on the training subset (i.e. X(1:n, :) and y(1:n)) (not the entire training set). However, for cross-validation error, it needs to be computed on the entire cross-validation set. Finally store the computed errors in a vector, error sequence and error value for easy visualization

# 学习曲线
def linear_regression(X,y,l=1):
    """求出最优参数"""
    theta = np.ones(X.shape[1]) # 初始化参数
    # 训练参数
    res = opt.minimize(fun=costReg,x0=theta,args=(X,y,l),method='TNC',jac=gradientReg, options={'disp': True})
    return res.x

def plot_learning_curve(X, y, Xval, yval, l):
    """画出学习曲线"""
    m = X.shape[0]
    training_cost, cv_cost = [], []
    
    for i in range(1,m+1):
        res = linear_regression(X[:i, :], y[:i], l)

        tc = costReg(res, X[:i, :], y[:i], 0)
        cv = costReg(res, Xval, yval, 0)

        training_cost.append(tc)
        cv_cost.append(cv)
    
    plt.figure(figsize=(12,8))
    plt.plot(np.arange(1, m+1), training_cost, label='training cost')
    plt.plot(np.arange(1, m+1), cv_cost, label='cv cost')
    plt.legend()
    plt.xlabel("Number of training examples")
    plt.ylabel("Error")
    plt.title("Learing curve for linear regression")
    plt.grid(True)
    plt.show()

plot_learning_curve(X, y, Xval, yval, 0)

                 It can be seen from the figure that with the increase of training samples, the errors of both are very high, indicating that linear regression cannot fit the data set well, and high deviation problems appear.

polynomial regression

        The problem that arose in the previous model was that it was too simple to the data, resulting in a poor fit, so adding more features was used to solve this problem. For polynomial regression, assume the following form, with features added to various powers of the original values.

def poly_features(X, power):
    """
    多项式特征
    每次在X最后一列添加次方项
    从第二列开始插入,因为X本身含有一列x0 = 1
    """
    Xpoly = X.copy()
    for i in range(2, power + 1):
        Xpoly = np.insert(Xpoly, Xpoly.shape[1], np.power(Xpoly[:,1], i), axis=1)    
    return Xpoly

        After the feature is added, since the added feature is each power, the data is very different. For example, x = 40, its eighth power reaches the 12th power of 10, so we need to standardize the feature vector.

def get_means_std(X):
    """获得训练集的均值和误差"""
    means = np.mean(X,axis = 0) # 按列
    # ddof = 1 求样本标准差 
    # ddof = 1 求总体标准差 
    stds = np.std(X, axis=0, ddof=1)
    
    return means, stds


def featureNormalize(X, means, stds):
    """标准化"""
    X_norm = X.copy()
    X_norm[:,1:] = (X_norm[:,1:] - means[1:]) / stds[1:]
    return X_norm

        Before drawing the learning curve graph, we need to perform data preprocessing

# 数据处理
power = 6  # 扩展到x的6次方
# 均值与标准差
train_means, train_stds = get_means_std(poly_features(X, power))
# 标准化
X_norm = featureNormalize(poly_features(X, power), train_means, train_stds)
Xval_norm = featureNormalize(poly_features(Xval, power), train_means, train_stds)
Xtest_norm = featureNormalize(poly_features(Xtest, power), train_means, train_stds)

        Then draw the learning curve, here is the case when λ = 0

def plot_fit(means, stds, l):
    """画出拟合曲线"""
    theta = linear_regression(X_norm,y, l)
    
    x = np.linspace(-75,55,50)
    
    xmat = x.reshape(-1, 1) # (50,)->(50,1)
    xmat = np.insert(xmat,0,1,axis=1) # 添加x0 = 1 
    
    Xmat = poly_features(xmat, power) # 增加特征
    Xmat_norm = featureNormalize(Xmat, means, stds) # 特征规范化

    
    plotdata(X[:,1], y) # 画出原始数据
    plt.plot(x, Xmat_norm @ theta, 'b--')# 画出拟合曲线
    
plot_fit(train_means, train_stds, 0)
plot_learning_curve(X_norm, y, Xval_norm, yval, 0) # 画出学习曲线

        From the fit plot we can see that the polynomial fit follows the data well. It can be seen from the learning curve graph that as the training samples increase, the training error is relatively low. Although the overall trend of the verification error is declining, the final error is still very large. There is a gap between the training error and the verification error, indicating that the model is overfitting to the training data. , problems with high variance, do not generalize well.

        When λ = 1, it can be seen that both the training error and the verification error converge to a relatively small value, which means that the degree of fitting is not bad.

         When λ = 100, it can be seen that the fitting effect is very poor, and both the training error and the verification error converge to a large value, indicating that there is too much regularization and the model cannot fit the data.

 

 Select λ using the validation set

        The value of λ can be selected as [0., 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1., 3., 10.]

lambdas = [0., 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1., 3., 10.]
errors_train, errors_val = [], []
for l in lambdas:
    theta = linear_regression(X_norm, y, l)
    errors_train.append(costReg(theta,X_norm,y,0))  # 记得把lambda = 0
    errors_val.append(costReg(theta,Xval_norm,yval,0))
    
plt.figure(figsize=(8,5))
plt.plot(lambdas,errors_train,label='Train')
plt.plot(lambdas,errors_val,label='Cross Validation')
plt.legend()
plt.xlabel('lambda')
plt.ylabel('Error')
plt.grid(True)

        

lambdas[np.argmin(errors_val)]  # 3.0

        It can be concluded that λ = 3 when the verification error is the smallest

Calculating the test set error

theta = linear_regression(X_norm, y, 3)
print('test cost(l={}) = {}'.format(3, costReg(theta, Xtest_norm, ytest, 0))
# test cost(l=3) = 4.407884454040075

reference article

https://blog.csdn.net/Cowry5/article/details/80421712

Guess you like

Origin blog.csdn.net/weixin_50345615/article/details/126274374