编程作业(python)| 吴恩达机器学习(5)偏差与方差,训练集,验证集,测试集,学习曲线

作业及代码:https://pan.baidu.com/s/1L-Tbo3flzKplAof3fFdD1w 密码:oin0
本次作业的理论部分:笔记 | 模型选择与评估,误差分析与优化
编程环境:Jupyter Notebook

    :     +   \color{#f00}{***\ 点击查看\ :吴恩达机器学习 \ —— \ 整套笔记+编程作业详解\ ***}

Programming Exercise 5:Bias v.s. Variance

题目

利用水库水位变化预测大坝出水量,数据集:ex5data1.mat

(X表示水位变化的历史记录,y表示流出大坝的水量)

1. 导入数据

import numpy as np
import matplotlib.pyplot as plt
from scipy.io import loadmat
from scipy.optimize import minimize

data = loadmat('ex5data1.mat')
>>> data.keys()
dict_keys(['__header__', '__version__', '__globals__', 'X', 'y', 'Xtest', 'ytest', 'Xval', 'yval'])

1.1 训练集、验证集、测试集

# 训练集
X_train, y_train = data['X'], data['y']
>>> X_train.shape, y_train.shape
((12, 1), (12, 1))

# 验证集
X_val, y_val = data['Xval'], data['yval']
>>> X_val.shape, y_val.shape
((21, 1), (21, 1))

# 测试集
X_test, y_test = data['Xtest'], data['ytest']
>>> X_test.shape, y_test.shape
((21, 1), (21, 1))

# 输入变量插入偏置
X_train = np.insert(X_train,0,1,axis=1)
X_val = np.insert(X_val,0,1,axis=1)
X_test = np.insert(X_test,0,1,axis=1)

1.2 数据可视化

绘制训练集的数据(X表示水位变化的历史记录,y表示流出大坝的水量):

def plot_data():
    fig,ax = plt.subplots()
    ax.scatter(X_train[:,1],y_train)
    ax.set(xlabel = 'change in water level(x)',
          ylabel = 'water flowing out of the dam(y)')

plot_data()

在这里插入图片描述

2. 线性回归

2.1 代价函数(带正则化)

在这里插入图片描述

# 定义代价函数
def reg_cost(theta,X,y,lamda):    
    cost = np.sum(np.power((X@theta-y.flatten()), 2))
    reg = theta[1:] @ theta[1:] * lamda    
    return (cost + reg) / (2 * len(X))

# 参数初始化
theta = np.ones(X_train.shape[1])
lamda = 1
>>> reg_cost(theta,X_train,y_train,lamda)
303.9931922202643

2.2 梯度函数(带正则化)

在这里插入图片描述

def reg_gradient(theta,X,y,lamda):   
    grad = (X @ theta - y.flatten()) @ X
    reg = lamda * theta
    reg[0] = 0 # 第一项不参与正则化计算   
    return(grad + reg) / (len(X))

reg_gradient(theta,X_train,y_train,lamda)

2.3 优化函数 minimize()

优化函数参阅 1.2 使用scipy优化函数

def train_model(X,y,lamda):    
    theta = np.ones(X.shape[1])    
    res  = minimize(fun = reg_cost,
                   x0 = theta,
                   args =(X,y,lamda),
                   method = 'TNC',
                   jac = reg_gradient)    
    return res.x # 返回最优参数

theta_final = train_model(X_train,y_train,lamda=1)

2.4 拟合直线可视化(训练集)

plot_data()
plt.plot(X_train[:,1],X_train@theta_final,c='r')
plt.show()

在这里插入图片描述
从图中可以看出,模型欠拟合严重,偏差大。

2.5 学习曲线

学习曲线是训练集的误差 J t r a i n ( θ ) J_{train}(θ) 与交叉验证集的误差 J c v ( θ ) J_{cv}(θ) 关于训练集样本数量 m m 的函数图像。
在这里插入图片描述

def plot_learning_curve(X_train,y_train,X_val,y_val,lamda):
    
    x = range(1,len(X_train)+1)
    training_cost = []		# 训练集的误差
    cv_cost = []			# 验证集的误差
    
    for i in x:        
        res = train_model(X_train[:i,:],y_train[:i,:],lamda) #训练集取样本个数 i
        
        #需要注意的是,当计算训练集、交叉验证集和测试集误差时,不计算正则项,所以令 λ=0
        training_cost_i = reg_cost(res,X_train[:i,:],y_train[:i,:],lamda = 0)
        cv_cost_i = reg_cost(res,X_val,y_val,lamda = 0) #验证集取自己全部样本
        training_cost.append(training_cost_i)   
        cv_cost.append(cv_cost_i)	
        			
    plt.plot(x,cv_cost,label = 'cv cost')
    plt.plot(x,training_cost,label = 'training cost')
    plt.legend()
    plt.xlabel('number of training examples')
    plt.ylabel('error')
    plt.show()
    
plot_learning_curve(X_train,y_train,X_val,y_val,lamda)

在这里插入图片描述

由图得:当 J t r a i n ( θ ) J_{train}(θ) J c v ( θ ) J_{cv}(θ) 都很大时,属于 高偏差(欠拟合)问题,此时 增加训练数据是没有用的。可以引入更多的相关特征来降低偏差。

3. 多项式回归

3.1 构造特征多项式

进行特征映射获取特征多项式: 1 , x , x 2 , x 3 , . . . , x n 1,x,x^2,x^3,...,x^n

def poly_feature(X,power):    
    for i in range(2,power+1):
        X = np.insert(X,X.shape[1],np.power(X[:,1],i),axis=1) # 在列尾插入
    return X

通常特征映射完后,要做特征归一化。归一化的时候我们要用到均值与方差
x i = x i m e a n s t d x_i=\frac{x_i-mean}{std}

# 获得均值与方差
def get_mean_std(x):
    means = np.mean(x, axis=0) #按行取均值和方差
    std = np.std(x, axis=0)  
    return means,std
    
# 特征归一化
def feature_normalize(X,means,stds):    
    X[:,1:] = ( X[:,1:]  - means[1:]) / stds[1:]  # 第一列全为1,不用归一化
    return X

获取特征多项式:

power = 6

X_train_poly = poly_feature(X_train,power)# 获取特征多项式
X_val_poly = poly_feature(X_val,power)
X_test_poly = poly_feature(X_test,power)

train_means,train_stds = get_means_stds(X_train_poly)# 均值和方差

X_train_norm = feature_normalize(X_train_poly,train_means,train_stds)# 特征归一化
X_val_norm = feature_normalize(X_val_poly,train_means,train_stds)
X_test_norm = feature_normalize(X_test_poly,train_means,train_stds)

theta_fit = train_model(X_train_norm,y_train,lamda=0)# 训练新特征,获得新的优化模型参数

3.2 拟合曲线可视化(训练集)

def plot_poly_fit():#绘制拟合曲线
    plot_data()#绘制原始数据
    
    x = np.linspace(-60,60,100)
    xx = x.reshape(100,1)
    xx = np.insert(xx,0,1,axis=1)
    xx = poly_feature(xx,power)
    xx = feature_normalize(xx,train_means,train_stds)#获取横坐标的特征多项式
    
    plt.plot(x, xx @ theta_fit,'r--')# xx @ theta_fit即为曲线的y值

plot_poly_fit()

在这里插入图片描述

3.3 学习曲线——改变 λ \lambda

  • λ = 0 \lambda=0 ,训练集过拟合(高方差),表现为验证集误差较大,此时可以适当增加 λ \lambda
    plot_learning_curve(X_train_norm,y_train,X_val_norm,y_val,lamda=0) 
    

在这里插入图片描述

  • λ = 1 \lambda=1 ,过拟合基本消除,最后 J t r a i n ( θ ) J_{train}(θ) J c v ( θ ) J_{cv}(θ) 趋于相等,模型泛化效果好
    plot_learning_curve(X_train_norm,y_train,X_val_norm,y_val,lamda=1)
    

在这里插入图片描述

  • λ = 100 \lambda=100 ,正则化项过大,模型偏差大,欠拟合
    plot_learning_curve(X_train_norm,y_train,X_val_norm,y_val,lamda=100)
    

在这里插入图片描述

3.4 正则化参数 λ \lambda 的选取

lamdas = [0,0.001,0.003,0.01,0.03,0.1,0.3,1,3,10] # lamda的候选值

training_cost = []
cv_cost = []

for lamda in lamdas:
    res = train_model(X_train_norm,y_train,lamda)
    #注意:当计算训练集、交叉验证集和测试集误差时,不计算正则项    
    tc = reg_cost(res,X_train_norm,y_train,lamda = 0)
    cv = reg_cost(res,X_val_norm,y_val,lamda = 0)    
    training_cost.append(tc)
    cv_cost.append(cv)

plt.plot(lamdas,training_cost,label='training cost')
plt.plot(lamdas,cv_cost,label='cv cost')
plt.legend()

plt.show()

在这里插入图片描述

# 获取cv_cost的最小值
>>> min(cv_cost)
3.540898729951058

# 此时对应的 lamda 值
>>> lamdas[np.argmin(cv_cost)]
3

# 测试集误差
res = train_model(X_train_norm,y_train,lamda =3)
test_cost = reg_cost(res,X_test_norm,y_test,lamda = 0)
>>> print(test_cost)
4.397616335103924

# 将 lamda=3 放在测试集上,绘制出训练集,验证集,测试集的学习曲线
ef plot_learning_curve_3(X_train,y_train,X_val,y_val,X_test,y_test,lamda):
    
    x = range(1,len(X_train)+1)
    training_cost = []
    cv_cost = []
    test_cost = []
    
    for i in x:
        
        res = train_model(X_train[:i,:],y_train[:i,:],lamda)
        training_cost_i = reg_cost(res,X_train[:i,:],y_train[:i,:],lamda = 0)
        cv_cost_i = reg_cost(res,X_val,y_val,lamda = 0)
        test_cost_i = reg_cost(res,X_test,y_test,lamda = 0)
        
        training_cost.append(training_cost_i)
        cv_cost.append(cv_cost_i)
        test_cost.append(test_cost_i)
        
    plt.plot(x,training_cost,label = 'training cost')    
    plt.plot(x,cv_cost,label = 'cv cost')   
    plt.plot(x,test_cost,label = 'test cost')
    plt.legend()
    plt.xlabel('number of training examples')
    plt.ylabel('error')
    plt.show()
    
plot_learning_curve_3(X_train_norm, y_train,
                      X_val_norm, y_val,
                      X_test_norm, y_test,
                      lamda=3)

在这里插入图片描述

发布了21 篇原创文章 · 获赞 21 · 访问量 1958

猜你喜欢

转载自blog.csdn.net/m0_37867091/article/details/105058430
今日推荐