>Data selection
Load Boston housing prices from the Scikit-learn dataset:
from sklearn import datasets boston = datasetd.load_boston()
The Boston dataset is a common linear dataset with 13 features and the first example in the NG online class. We can print its description document to get its various properties:
print boston.DESCR
The result is as follows:
Data Set Characteristics: :Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive :Median Value (attribute 14) is usually the target :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's :Missing Attribute Values: None :Creator: Harrison, D. and Rubinfeld, D.L.
> Linear Regression Model - Manually split training and test sets
We first give a default sampling frequency, such as 0.5, to divide the training and test sets into two equal sets:
sampleRatio = 0.5 n_samples = len(boston.target) sampleBoundary = int(n_samples * sampleRatio)
Next, shuffle the entire set and take out the corresponding training set and test set data:
shuffleIdx = range(n_samples) numpy.random.shuffle(shuffleIdx) # need to import numpy # Features and regression values of the training set train_features = boston.data[shuffleIdx[:sampleBoundary]] train_targets = boston.target[shuffleIdx[:sampleBoundary]] # Test set features and regression values test_features = boston.data[shuffleIdx[sampleBoundary:]] test_targets = boston.target[shuffleIdx[sampleBoundary:]]
Next, get the regression model, fit it and get the predictions on the test set:
lr = sklearn.linear_model.LinearRegression() # 需要导入sklearn的linear_model lr.fit(train_features, train_targets) # 拟合 y = lr.predict(test_features) # prediction
Finally, draw the prediction results through matplotlib:
import matplotlib.pyplot as plt plt.plot(y, test_targets, 'rx') # y = ωX plt.plot([y.min(), y.max()], [y.min(), y.max()], 'b-.', lw=4) # f(x)=x plt.ylabel("Predieted Price") plt.xlabel("Real Price") plt.show()
The result obtained is as follows:
The points on the blue line are the accurately predicted points, while the points below and above the blue line are the result of under-prediction and over-prediction, respectively.
> Linear Regression Model - KFlod Cross Validation
From the official sample:
from sklearn import datasets from sklearn.model_selection import cross_val_predict from sklearn import linear_model import matplotlib.pyplot as plt lr = linear_model.LinearRegression() boston = datasets.load_boston() y = boston.target # cross_val_predict returns an array of the same size as `y` where each entry # is a prediction obtained by cross validation: predicted = cross_val_predict(lr, boston.data, y, cv=10) fig, ax = plt.subplots() ax.scatter(y, predicted, edgecolors=(0, 0, 0)) ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4) ax.set_xlabel('Measured') ax.set_ylabel('Predicted') plt.show()
The main use is cross_val_predict in the cross-validation model, the linear regression model is also given (linear_model.LinearRegression(), the model needs to implement the fit() method), and cv=10 cross-validation sets are divided. The code is shorter and more readable than manually dividing the collection, nothing to analyze too much. The default is the KFlod method, and the results are as follows:
> Scoring of cross-validated models
Considering the use of cross-validation, we can score an estimator using cross_val_score() of sklearn.cross_validation:
from sklearn import cross_validation print cross_validation.cross_val_score(lr, boston.data, y, cv=10)
Get the results for 10 cross-validation sets:
[ 0.73334917 0.47229799 -1.01097697 0.64126348 0.54709821 0.73610181 0.37761817 -0.13026905 -0.78372253 0.41861839]Obviously the result was not good.