9. Machine learning sklearn-----ridge regression and its application examples

1. Basic Concepts

For general linear regression problems, the least squares method is used to solve the parameters, and the objective function is as follows:


The solution of the parameter w can also be done using the following matrix method: 

      

For matrix X, if some columns are linearly correlated (that is, some attributes in the training sample are linearly correlated), it will cause the value of XTX to be close to 0, and there will be inconsistencies when calculating (XTX)-1. stability: 

Conclusion: The traditional linear regression method based on least squares lacks stability.



Ridge regression is a biased estimation regression method specially used for collinear data analysis.

In the sklearn library, the ridge regression model can be called using sklearn.linear_model.Ridge, and its main parameters are:

                     • alpha: regularization factor, corresponding to alpha in the loss function

                     • fit_intercept: Indicates whether to calculate the intercept,

                     • solver: method for setting calculation parameters, optional parameters 'auto', 'svd', 'sag', etc.

2. Examples

Data introduction: The data is the traffic flow monitoring data of a certain intersection, which records the hourly traffic flow throughout the year. 

Experiment purpose: Create polynomial features based on existing data, use ridge regression model instead of general linear model, and perform polynomial regression on traffic flow information.


import numpy as np
 import pandas as pd
 #Load ridge regression method
 through sklearn.linermodel from sklearn.linear_model import Ridge
 from sklearn import model_selection
 #Load cross validation module
 import matplotlib.pyplot as plt
 from sklearn.preprocessing import PolynomialFeatures
 #Use numpy method from Load data in txt file
 a=pd.read_csv( 'data.csv' )
data=np.array(a)

#Use plt to display traffic flow information
 plt.plot(data[: , 5 ])
plt.show()
#X is used to save 0-5 dimensional data, that is, attribute
 X=data[: , 1 : 5 ]
 #y is used to save 6th dimensional data, that is, traffic flow
 y=data[: , 5 ]
 #Used to create the highest number of times Polynomial features of the 6th power, after
 many trials, it is decided to use 6th poly =PolynomialFeatures( 6 )
 #X is the created polynomial feature
 X=poly.fit_transform(X)
 # Divide all data into training set and test set, test_size Indicates the proportion of the test set,
 #random_state is the random number seed
 train_set_X , test_set_X , train_set_y , test_set_y=\
    model_selection.train_test_split(X,y,test_size=0.3,random_state=0)

#Create a regressor
 and train it #Create a ridge regression instance
 clf =Ridge( alpha = 1.0 , fit_intercept = True )
 #Call the fit function to train the regressor using the training set
 clf.fit(train_set_X , train_set_y) #Use
 the test set to calculate the regression curve The goodness of fit, clf.score returns a value of 0.7375 goodness
 of fit, #Used to evaluate the fit, the maximum is 1 , there is no minimum value,
 #When the same value is output for all inputs, the goodness of fit degree is 0 .
clf.score(test_set_X , test_set_y)

start = 200 #Spend a fitting curve in the range of
 200 to 300 end = 300
 y_pre =clf.predict(X) #It is the fitting value of calling the predict function
 time =np.arange(start , end)
plt.plot(time,y[start:end],'b',label="real")
plt.plot(time,y_pre[start:end],'r',label='predict')
plt.legend(loc='upper left')
plt.show()

result:


Analysis conclusion: The predicted value and the actual value are about the same trend


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325565597&siteId=291194637