Polynomial Regression for Machine Learning

zero, model

1, official

y = b0 + b1X1+B2X2^2 +BnXn^n

2.R square

R ^ 2 = 1-Square sum / sum of squares

R^2 is 0 to 1, the larger the better the fitting effect

3. Generalized R-squared (increase penalty)

R^2 = 1 - Residual Sum of Squares / Sum of Co-squares * ((n-1)/np-1)

n : number of data p: number of independent variables

First, import the standard library

In [16]:
# Importing the libraries import library 
import  numpy  as  np 
import  matplotlib.pyplot  as  plt 
import  pandas  as  pd 
# Enable image adjustment 
% matplotlib notebook 
 #Chinese font display   
plt . rc ( 'font' ,  family = 'SimHei' ,  size = 8 )

2. Import data

In [10]:
# Importing the dataset 导入数据
dataset = pd.read_csv('./Position_Salaries.csv')

X = dataset.iloc[:, 1:2].values  # 职位级别
y = dataset.iloc[:, 2].values    # 薪水
dataset
Out[10]:
  Position Level Salary
0 Business Analyst 1 45000
1 Junior Consultant 2 50000
2 Senior Consultant 3 60000
3 Manager 4 80000
4 Country Manager 5 110000
5 Region Manager 6 150000
6 Partner 7 200000
7 Senior Partner 8 300000
8 C-level 9 500000
9 CEO 10 1000000
In [11]:
X 
Out[11]:
array([[ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10]], dtype=int64)
In [12]:
y
Out[12]:
array([  45000,   50000,   60000,   80000,  110000,  150000,  200000,
        300000,  500000, 1000000], dtype=int64)

三、创建回归模型

In [13]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X,y)
Out[13]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

四、创建多项式回归

In [29]:
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X) # 自变量不同次数的矩阵
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly,y)
X_poly 
Out[29]:
array([[  1.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          1.00000000e+00,   1.00000000e+00],
       [  1.00000000e+00,   2.00000000e+00,   4.00000000e+00,
          8.00000000e+00,   1.60000000e+01],
       [  1.00000000e+00,   3.00000000e+00,   9.00000000e+00,
          2.70000000e+01,   8.10000000e+01],
       [  1.00000000e+00,   4.00000000e+00,   1.60000000e+01,
          6.40000000e+01,   2.56000000e+02],
       [  1.00000000e+00,   5.00000000e+00,   2.50000000e+01,
          1.25000000e+02,   6.25000000e+02],
       [  1.00000000e+00,   6.00000000e+00,   3.60000000e+01,
          2.16000000e+02,   1.29600000e+03],
       [  1.00000000e+00,   7.00000000e+00,   4.90000000e+01,
          3.43000000e+02,   2.40100000e+03],
       [  1.00000000e+00,   8.00000000e+00,   6.40000000e+01,
          5.12000000e+02,   4.09600000e+03],
       [  1.00000000e+00,   9.00000000e+00,   8.10000000e+01,
          7.29000000e+02,   6.56100000e+03],
       [  1.00000000e+00,   1.00000000e+01,   1.00000000e+02,
          1.00000000e+03,   1.00000000e+04]])

五、画图比较

1.线性回归

In [24]:
plt.scatter(X,y,color="red")
plt.plot(X,lin_reg.predict(X),color ="blue")
plt.title(u"真话还是假话(线性模型)")
plt.xlabel(u"职位水平")
plt.ylabel(u"薪水")
plt.show()
print('r-squared', lin_reg.score(X,y))
('r-squared', 0.66904123319298947)

2.多项式回归

In [28]:
X_grid = np.arange(min(X),max(X),0.1)
X_grid = X_grid.reshape(len(X_grid),1)
plt.scatter(X,y,color="red")
plt.plot(X_grid,lin_reg_2.predict(poly_reg.fit_transform(X_grid)),color ="blue")
plt.title(u"真话还是假话(多项式模型)")
plt.xlabel(u"职位水平")
plt.ylabel(u"薪水")
plt.show()
print('r-squared', lin_reg_2.score(X_poly,y))
('r-squared', 0.99739228917066114)

结论:多项式回归结果明显优于简单线性回归结果(实际应用时应考虑过拟合的问题)

六、薪水预测

1.线性回归

In [19]:
lin_reg.predict(6.5)
Out[19]:
array([ 330378.78787879])

2.多项式回归预测

In [20]:
lin_reg_2.predict(poly_reg.fit_transform(6.5))
Out[20]:
array([ 158862.45265154])


七、项目地址


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324421534&siteId=291194637