Create polynomial features sklearn.preprocessing.PolynomialFeatures
Parameters used: —degree: Set the order of the polynomial feature, the default is 2. —Include_bias: Whether to include bias items, the default is true.
Use fit_transform function to process the data.
Feature standardization sklearn.preprocessing.StandarScaler (subtract the mean and divide by the standard deviation)
Use fit_transform function to process the data.
3.1 Building a polynomial feature model training function
from sklearn.preprocessing import StandarScaler
from sklearn.preprocessind import PolynomialFeatures
from sklearn.linear_model import LinearRegression
defpoly_fit_train(degree,X,y,y_real,model=None):# 如果degree不是整数就报错ifnotisinstance(degree,int):raise ValueError('degree should be interger.')# 如果degree不大于0就报错if degree <=0:raise Valuation('gegree should be greater than 0.')# 为了满足model.fit函数的输入要求,将特征数据从一维变成二维,即从(samples,)变成(samples,1)
X_2D=X.reshape(-1,1)# 如果degree大于1,生成多项式数据if degree >1:
poly=PolynomialFeatures(degree,include_bias=False)
X_2D=poly.fit_transform(X_2D)# 数据标准化(减均值除标准差)
scaler=StandarScaler()
X_2D=scaler.fit_transform(X_2D)if model==None:
model=LinearRegression()#创建并训练线性回归模型
model.fit(X_2D,y)# 模型预测
y_pred=model.predict(X_2D)# 画出样本的散点图
plt.scatter(X,y,marker='o',color='g',label='train dataset')# 画出实际函数曲线
plt.plot(np.sort(X),y_real[np.argsort(X)],color='b',label='real curve')# 画出预测函数曲线
plt.plot(np.sort(X),y_pred[np.argsort(X)],color='r',label='predict curve')
plt.legend()
plt.xlabel('x')
plt.ylabel('y')
plt.show()return y_pred,model
from sklearn.metrics import mean_squared.error
# 计算MSE
mse_train1=mean_squared.error(y_train_pred1,y_train)
mse_train3=mean_squared.error(y_train_pred3,y_train)
mse_train10=mean_squared.error(y_train_pred10,y_train)
mse_train30=mean_squared.error(y_train_pred30,y_train)# 打印结果print('MSE:')print('1 order polynomial:{:.2f}'.format(mse_train1))print('3 order polynomial:{:.2f}'.format(mse_train3))print('10 order polynomial:{:.2f}'.format(mse_train10))print('30 order polynomial:{:.2f}'.format(mse_train30))# 输出结果
MSE:1 order polynomial:149.923 order polynomial:24.3210 order polynomial:23.6430 order polynomial:15.05
Indicator description:
The models of the training set MSE indicators from good to bad are: 30-order polynomial, 10-order polynomial, 3-order polynomial, and 1-order polynomial
4. Test set inspection
4.1 Polynomial feature model prediction function
defpoly_fit_predict(degree,X,y,model):#如果degree不是整数就报错ifnotisinstance(degree,int):raise ValueError('degree should be interger.')#如果degree不大于0就报错if degree <=0:raise Valuation('degree should be greater than 0.')# 为了满足model.fit函数的输入要求,将特征函数数据从一维变成二维,即从(samples,)变为(samples,1)
X_2D=X.reshape(-1,1)# 如果degree大于1,生成多项式数据if degree>1:
poly=PolynomialFeatures(degree=degree,include_bias=False)
X_2D=poly.fit_transform(X_2D)#数据标准化(减均值除标准差)
scaler=StandarScaler()
X_2D=scaler.fit_transform(X_2D)#模型预测
y_pred=model.predict(X_2D)# 画出样本的散点图
plt.scatter(X,y,maker='o',
color='c',label='test dataset')# 画出预测函数曲线
plt.plot(np.sort(X),y_pred[np.argsort(X)],
color='r',label=str(degree)+'order fitting')
plt.legend()
plt.xlabel('x')
plt.ylabel('y')
plt.show()return y_pred
# 计算MSE
mse_test1=mean_squared.error(y_test_pred1,y_test)
mse_test3=mean_squared.error(y_test_pred3,y_test)
mse_test10=mean_squared.error(y_test_pred10,y_test)
mse_test30=mean_squared.error(y_test_pred30,y_test)# 打印结果print('MSE:')print('1 order polynomial:{:.2f}'.format(mse_test1))print('3 order polynomial:{:.2f}'.format(mse_test3))print('10 order polynomial:{:.2f}'.format(mse_test10))print('30 order polynomial:{:.2f}'.format(mse_test30))# 输出结果
MSE:1 order polynomial:659.953 order polynomial:39.7110 order polynomial:41.0030 order polynomial:85.45
The model of the test set MSE indicators from good to bad:
3rd order polynomial, 10th order polynomial, 30th order polynomial, 1st order polynomial
5. Definition of underfitting and overfitting
Underfitting: The selected model is too simple, so that the model's predictions on the training set and unknown data are very poor.
Overfitting: The selected model is too complex to predict the training set very well, but predicts the unknown data poorly (poor generalization ability).