02-08 polynomial regression (Boston house prices predicted)

Newer and more full of "machine learning" to update the site, more python, go, data structures and algorithms, reptiles, artificial intelligence teaching waiting for you: https://www.cnblogs.com/nickchen121/

Polynomial regression (Boston house prices predicted)

An import module

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
%matplotlib inline
font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

Second, get the data

In the "Code - ordinary linear regression" when it comes to features and LSTAT mark MEDV have the highest correlation, but not a linear relationship between them, so this attempt to use polynomial regression relationship between them.

df = pd.read_csv('housing-data.txt', sep='\s+', header=0)
X = df[['LSTAT']].values
y = df['MEDV'].values

Third, the training model

# 增加二次方,即二项式回归
quadratic = PolynomialFeatures(degree=2)
# 增加三次方,即三项式回归
cubic = PolynomialFeatures(degree=3)
# 训练二项式和三项式回归得到二次方和三次方的X
X_quad = quadratic.fit_transform(X)
X_cubic = cubic.fit_transform(X)

# 增加x轴坐标点
X_fit = np.arange(X.min(), X.max(), 1)[:, np.newaxis]

lr = LinearRegression()

# 线性回归
lr.fit(X, y)
lr_predict = lr.predict(X_fit)
# 计算线性回归的R2值
lr_r2 = r2_score(y, lr.predict(X))

# 二项式回归
lr = lr.fit(X_quad, y)
quad_predict = lr.predict(quadratic.fit_transform(X_fit))
# 计算二项式回归的R2值
quadratic_r2 = r2_score(y, lr.predict(X_quad))

# 三项式回归
lr = lr.fit(X_cubic, y)
cubic_predict = lr.predict(cubic.fit_transform(X_fit))
# 计算三项式回归的R2值
cubic_r2 = r2_score(y, lr.predict(X_cubic))
print(lr.score(X_cubic, y))
print(cubic_r2)
0.6578476405895719
0.6578476405895719

3.1 Report coefficient of determination

r2_score i.e. report determination coefficient \ ((R & lt ^ 2) \) , it can be understood as MSE Standard Edition, \ (R & lt ^ 2 \) of formula
\ [R ^ 2 = 1 - {\ frac {{\ frac {1 } {n} \ sum_ {i = 1} ^ n (y ^ {(i)} - \ hat {y ^ {(i)}}) ^ 2}} {{\ frac {1} {n}} \ sum_ {i = 1} ^ n
(y ^ {(i)} - \ mu _ {(y)}) ^ 2}} \] where \ (\ mu _ {(y )} \) is \ (Y \) is average, i.e. \ ({{\ frac {1 } {n}} \ sum_ {i = 1} ^ n (y ^ {(i)} - \ mu _ {(y)}) ^ 2} \) is \ (Y \) variance equation can be written as
\ [^ R & lt = 2. 1 - {\ FRAC the MSE {} {Var (Y)}} \]
\ (R & lt ^ 2 \) value range \ (0-1 \) between, if \ (R & lt ^ = 2. 1 \) , then the mean square error \ (the MSE = 0 \) , i.e. a perfect model to fit the data.

Fourth, visualization

plt.scatter(X, y, c='gray', edgecolor='white', marker='s', label='训练数据')
plt.plot(X_fit, lr_predict, c='r',
         label='线性(d=1),$R^2={:.2f}$'.format(lr_r2), linestyle='--', lw=3)
plt.plot(X_fit, quad_predict, c='g',
         label='平方(d=2),$R^2={:.2f}$'.format(quadratic_r2), linestyle='-', lw=3)
plt.plot(X_fit, cubic_predict, c='b',
         label='立方(d=3),$R^2={:.2f}$'.format(cubic_r2), linestyle=':', lw=3)
plt.xlabel('地位较低人口的百分比[LSTAT]', fontproperties=font)
plt.ylabel('以1000美元为计价单位的房价[RM]', fontproperties=font)
plt.title('波士顿房价预测', fontproperties=font, fontsize=20)
plt.legend(prop=font)
plt.show()

png

The figure can be seen Trinomial fitting result is better than the results of binomial and linear regression, but at the same time increasing the complexity of the model, also need time to consider the question of whether there will be over-fitting.

Guess you like

Origin www.cnblogs.com/nickchen121/p/11686755.html