[Python][Scikit-learn][ML Study Notes 01] Example Analysis of Boston Housing Prices by Linear Regression

>Data selection

Load Boston housing prices from the Scikit-learn dataset:

from sklearn import datasets
boston = datasetd.load_boston()

The Boston dataset is a common linear dataset with 13 features and the first example in the NG online class. We can print its description document to get its various properties:

print boston.DESCR

The result is as follows:

Data Set Characteristics:  

    :Number of Instances: 506

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

> Linear Regression Model - Manually split training and test sets

We first give a default sampling frequency, such as 0.5, to divide the training and test sets into two equal sets:

sampleRatio = 0.5
n_samples = len(boston.target)
sampleBoundary = int(n_samples * sampleRatio)

Next, shuffle the entire set and take out the corresponding training set and test set data:

shuffleIdx = range(n_samples)
numpy.random.shuffle(shuffleIdx) # need to import numpy
# Features and regression values of the training set
train_features = boston.data[shuffleIdx[:sampleBoundary]]
train_targets = boston.target[shuffleIdx[:sampleBoundary]]
# Test set features and regression values
test_features = boston.data[shuffleIdx[sampleBoundary:]]
test_targets = boston.target[shuffleIdx[sampleBoundary:]]

Next, get the regression model, fit it and get the predictions on the test set:

lr = sklearn.linear_model.LinearRegression() # 需要导入sklearn的linear_model
lr.fit(train_features, train_targets) # 拟合
y = lr.predict(test_features) # prediction

Finally, draw the prediction results through matplotlib:

import matplotlib.pyplot as plt
plt.plot(y, test_targets, 'rx') # y = ωX
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'b-.', lw=4) # f(x)=x
plt.ylabel("Predieted Price")
plt.xlabel("Real Price")
plt.show()

The result obtained is as follows:

The points on the blue line are the accurately predicted points, while the points below and above the blue line are the result of under-prediction and over-prediction, respectively.

> Linear Regression Model - KFlod Cross Validation

From the official sample:

from sklearn import datasets
from sklearn.model_selection import cross_val_predict
from sklearn import linear_model
import matplotlib.pyplot as plt

lr = linear_model.LinearRegression()
boston = datasets.load_boston()
y = boston.target

# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
predicted = cross_val_predict(lr, boston.data, y, cv=10)

fig, ax = plt.subplots()
ax.scatter(y, predicted, edgecolors=(0, 0, 0))
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()

The main use is cross_val_predict in the cross-validation model, the linear regression model is also given (linear_model.LinearRegression(), the model needs to implement the fit() method), and cv=10 cross-validation sets are divided. The code is shorter and more readable than manually dividing the collection, nothing to analyze too much. The default is the KFlod method, and the results are as follows:

> Scoring of cross-validated models

Considering the use of cross-validation, we can score an estimator using cross_val_score() of sklearn.cross_validation:

from sklearn import cross_validation
print cross_validation.cross_val_score(lr, boston.data, y, cv=10)

Get the results for 10 cross-validation sets:

[ 0.73334917  0.47229799 -1.01097697  0.64126348  0.54709821  0.73610181
  0.37761817 -0.13026905 -0.78372253  0.41861839]

Obviously the result was not good.

[Python][Scikit-learn][ML Study Notes 01] Example Analysis of Boston Housing Prices by Linear Regression

Guess you like