scikit-learn to learn linear regression

University UCI machine learning data disclosed to run a linear regression, the data set is a data cycle power plant, a total of 9568 samples of data, each data has five columns, are: AT (temperature), V (pressure), the AP (humidity), RH (pressure), PE (power output). Our goal is to get a linear relationship in which AT / V / AP / RH is four sample characteristics, PE is the sample output, that is the result of machine learning is to get a linear regression model:

\[PE=\theta _{0}+\theta _{1}*AT+\theta _{2}*V+\theta _{3}*AP+\theta _{4}*RH\]

1 pandas read data

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model

data = pd.read_csv('D:/Python/Mechine_learning/data/CCPP/ccpp.csv')
print(data.head())

#输出
      AT      V       AP     RH      PE
0  23.25  71.29  1008.05  71.36  442.21
1  13.87  42.99  1007.45  81.52  471.12
2  16.91  43.96  1013.32  79.87  465.86
3  10.09  37.14  1012.99  72.59  473.66
4  12.72  40.60  1013.45  86.16  471.23

Data 2 ready to run the algorithm

Data dimensions
print(data.shape) #(9568, 5) 表明是9568*5的数据集
Prepare the sample wherein X, with AT, V, AP RH, and a sample of four columns wherein

X = data[['AT','V','AP','RH']]
print(X.head())
#输出
      AT      V       AP     RH
0  23.25  71.29  1008.05  71.36
1  13.87  42.99  1007.45  81.52
2  16.91  43.96  1013.32  79.87
3  10.09  37.14  1012.99  72.59
4  12.72  40.60  1013.45  86.16

Prepare the sample output y, as the sample output with PE

y = data[['PE']]
print(y.head())

#输出
       PE
0  442.21
1  471.12
2  465.86
3  473.66
4  471.23

3 division of the training and test sets

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)  #(7176, 4)
print(y_train.shape)  #(7176, 1)
print(X_test.shape)   #(2392, 4)
print(y_test.shape)   #(2392, 1)

75% of the sample data is used as the training set, 25% of the sample is used as the test set

Run 4 scikit-learn linear model

scikit-learn algorithm is a linear regression using the least squares method to achieve

from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train,y_train)

After the fitting is complete, view the model coefficients fitting results

print(linreg.intercept_)  #[ 452.50329853]
print(linreg.coef_)       #[[-1.98558313 -0.23170236  0.06410905 -0.15673512]]

5 model assessment

The quality assessment model, the linear regression, using the general mean square error (Mean Squared Error, MSE) or root mean square deviation (Root Mean Squared Error, RMSE) performance on the test set to evaluate the quality of the model

#模型拟合测试集
y_pred = linreg.predict(X_test)
from sklearn import metrics
#scikit-learn计算MSE RMSE
print("MSE:",metrics.mean_squared_error(y_test, y_pred))    #MSE: 19.4303412392
print("RMSE:",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))  #RMSE: 4.4079860752

MSE or the RMSE obtained, if obtained by other methods different coefficients, when the need to select the model, on a smaller MSE model corresponding parameters.
For example with AT, V, AP three columns as characteristics of the sample. Do RH, output is still the PE

X = data[['AT','V','AP']]
y = data[['PE']]
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train,y_train)
y_pred = linreg.predict(X_test)
from sklearn import metrics
print("MSE:",metrics.mean_squared_error(y_test, y_pred))    #MSE: 22.8503207832
print("RMSE:",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))    #RMSE: 4.78020091452

After removing RH, model fitting effect of variation, MSE bigger

6 cross validation

To continue to optimize the model by cross validation, 10-fold cross-validation, i.e., cv parameter is 10 cross_val_predict

X = data[['AT','V','AP','RH']]
y = data[['PE']]

from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(linreg, X, y, cv=10)

print("MSE:",metrics.mean_squared_error(y, predicted))  #MSE: 20.7892840922
print("RMSE:",np.sqrt(metrics.mean_squared_error(y, predicted)))    #RMSE: 4.55952673994

Using cross-validation model MSE larger than section 5, this is mainly due to do the test set predicted values ​​of the samples corresponding to all of the MSE off, while section 5 only 25% of the test set made MSE. Both prerequisites and different

7 Paint observations

Drawing the true value and the predicted value of the change in relationship, from the middle of the straight line y = x, the lower the closer the representative point directly predicted loss

fig, ax = plt.subplots()
ax.scatter(y, predicted)
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()

Reference: Learning linear regression with scikit-learn and pandas

Guess you like

Origin www.cnblogs.com/eugene0/p/11620711.html