Learn Linear Regression using scikit-learn and pandas

        For children's shoes who want to learn more about linear regression, here is a complete example. After studying this example in detail, there will be no problem evaluating the model for running linear regression with scikit-learn.

1. Get the data, define the problem

  Without data, of course it is impossible to study machine learning. Here we use the machine learning data released by UCI University to run linear regression.  

The introduction of the data is here:  http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

The download address of the data is here:  http://archive.ics.uci.edu/ml/machine-learning-databases/00294/

        Inside is the data of a cycle power plant, a total of 9568 sample data, each data has 5 columns, namely: AT (temperature), V (pressure), AP (humidity), RH (pressure), PE (output power) . We don't need to dwell on the specific meaning of each term.

  Our problem is to get a linear relationship, corresponding to PE is the sample output, and AT/V/AP/RH are the sample features. The purpose of machine learning is to get a linear regression model, namely:

  PE= \theta _0+\theta _1\times AT+\theta _2 \times V+\theta _3 \times AP+\theta _4 \times RH

  What needs to be learned is \theta_0,\theta_1,\theta_2,\theta3,\theta_4these 5 parameters.

2. Organize the data

  The downloaded data can be found to be a compressed file. After decompression, you can see that there is an xlsx file in it. We first open it with excel, then "save as"" csv format, save it, and we will use this csv to run later linear regression.

  Open this csv and you can find that the data has been sorted and there is no illegal data, so no preprocessing is required. However, these data are not normalized, that is, transformed into a format with a mean of 0 and a variance of 1. We don't need to do it, scikit-learn will help us to do the normalization first in the linear regression later.

  Well, with this data in csv format, we can do a big job.

3. Use pandas to read data

  Let's first open ipython notebook and create a new notebook. Of course, you can also enter it directly in the python interactive command line, but it is recommended to use a notebook. I ran the following examples and output in a notebook.

  First declare the library to be imported:

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model

        Then we can read the data with pandas:

# read_csv里面的参数是csv在你电脑上的路径,此处csv文件放在notebook运行目录下面的CCPP目录里
data = pd.read_csv('.\CCPP\ccpp.csv')

        Test whether the read data is successful:

#读取前五行数据,如果是最后五行,用data.tail()
data.head()

        The running result should be as follows. If you see the following data, it means that pandas reads the data successfully:

#读取前五行数据,如果是最后五行,用data.tail()
data.head()
AT V AP RH PE
0 8.34 40.77 1010.84 90.01 480.48
1 23.64 58.49 1011.40 74.20 445.75
2 29.74 56.90 1007.15 41.91 438.76
3 19.07 49.69 1007.22 76.79 453.09
4 11.80 40.66 1017.13 97.20 464.43

4. Prepare the data to run the algorithm

  Let's look at the dimensions of the data:

data.shape

  The result is (9568, 5). Explain that we have 9568 samples and each sample has 5 columns.

  Now we start to prepare the sample feature X, we use the 4 columns AT, V, AP and RH as the sample feature.

X = data[['AT', 'V', 'AP', 'RH']]
X.head()

  You can see that the first five outputs of X are as follows:

AT V AP RH
0 8.34 40.77 1010.84 90.01
1 23.64 58.49 1011.40 74.20
2 29.74 56.90 1007.15 41.91
3 19.07 49.69 1007.22 76.79
4 11.80 40.66 1017.13 97.20

  Then we prepare the sample output y, we use PE as the sample output.

y = data[['PE']]
y.head()

  You can see that the first five outputs of y are as follows:

PE
0 480.48
1 445.75
2 438.76
3 453.09
4 464.43

5. Divide training set and test set

  We divide the sample combination of X and y into two parts, one part is the training set and the other part is the test set. The code is as follows:

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

  Look at the dimensions of the training set and test set:

print X_train.shape
print y_train.shape
print X_test.shape
print y_test.shape

  The result is as follows:

(7176, 4)
(7176, 1)
(2392, 4)
(2392, 1) 
  It can be seen that 75% of the sample data is used as the training set, and 25% of the samples are used as the test set.
  

6. Run the scikit-learn linear model

  Finally, we have come to the end, we can use scikit-learn's linear model to fit our problem. The linear regression algorithm of scikit-learn is implemented using the method of least squares. code show as below:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)

  After the fitting is complete, let's look at the results of our required model coefficients:

print linreg.intercept_
print linreg.coef_

  The output is as follows:

[ 447.06297099]
[[-1.97376045 -0.23229086  0.0693515  -0.15806957]]

  In this way, we get the 5 values ​​that need to be obtained in step 1. That is to say, the relationship between PE and the other four variables is as follows:

        PE=447.06297099 - 1.97376045 \times AT - 0.23229086 \times V + 0.0693515 \times AP - 0.15806957 \times RH

    

7. Model evaluation

  We need to evaluate the quality of our model. For linear regression, we generally use Mean Squared Error (Mean Squared Error, MSE) or Root Mean Squared Error (Root Mean Squared Error, RMSE) on the test set to evaluate Models are good and bad.

  Let's look at the MSE and RMSE of our model, the code is as follows:

#模型拟合测试集
y_pred = linreg.predict(X_test)
from sklearn import metrics
# 用scikit-learn计算MSE
print "MSE:",metrics.mean_squared_error(y_test, y_pred)
# 用scikit-learn计算RMSE
print "RMSE:",np.sqrt(metrics.mean_squared_error(y_test, y_pred))

    The output is as follows:

MSE: 20.0804012021
RMSE: 4.48111606657

        After obtaining MSE or RMSE, if we obtain different coefficients by other methods, when we need to select a model, we use the corresponding parameters when the MSE is small.

  For example, this time we use the three columns AT, V, and AP as sample features. Don't RH, the output is still PE. code show as below:

X = data[['AT', 'V', 'AP']]
y = data[['PE']]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
#模型拟合测试集
y_pred = linreg.predict(X_test)
from sklearn import metrics
# 用scikit-learn计算MSE
print "MSE:",metrics.mean_squared_error(y_test, y_pred)
# 用scikit-learn计算RMSE
print "RMSE:",np.sqrt(metrics.mean_squared_error(y_test, y_pred))

        The output is as follows:

MSE: 23.2089074701
RMSE: 4.81756239919
   It can be seen that after removing RH, the model fit is not as good as adding RH, and the MSE becomes larger.

8. Cross Validation

  We can continue to optimize the model through cross-validation. The code is as follows. We use 10-fold cross-validation, that is, the cv parameter in cross_val_predict is 10:
X = data[['AT', 'V', 'AP', 'RH']]
y = data[['PE']]
from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(linreg, X, y, cv=10)
# 用scikit-learn计算MSE
print "MSE:",metrics.mean_squared_error(y, predicted)
# 用scikit-learn计算RMSE
print "RMSE:",np.sqrt(metrics.mean_squared_error(y, predicted))

9. Draw and observe the results

  Here, the relationship between the actual value and the predicted value is drawn. The closer the point to the middle straight line y=x, the lower the predicted loss. code show as below:

fig, ax = plt.subplots()
ax.scatter(y, predicted)
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()

        The output image is as follows:

Guess you like

Origin blog.csdn.net/qq_39312146/article/details/130993595