Regression practice of python lasso regression algorithm
basic concept
Regularization
Regularization refers to making explicit constraints on the model to avoid overfitting. The lasso regression used in this article is L1 regularization. (From a mathematical point of view, lasso penalizes the L1 norm of the coefficient vector, in other words, the sum of the absolute values of the coefficients.)
The specific principles of regularization are not described here. Those who are interested can read this article: Intuitive understanding of regularization terms L1 and L2 in machine learning .
Introduction to Algorithm
lasso regression
Before understanding lasso regression, it is recommended that friends do some understanding of ordinary least squares and ridge regression, you can refer to these two articles: least squares-regression practice , ridge regression-regression practice .
In addition to ridge regression, lasso is another regularized linear regression model, so its model formula is the same as that of the least square method, as shown in the following formula:
y=w[0]*x[0]+w[1]*x[1]+w[2]x[2]+…+w[p]x[p]+b
Like ridge regression, using lasso also restricts the coefficient w to close to 0, but the method used is different, called L1 regularization. The result of L1 regularization is that some coefficients are exactly 0 when using lasso. This shows that some features are completely ignored by the model. This can be seen as an automated feature selection. Some coefficients are exactly 0, so that the model is easier to understand and can also present the most important features of the model.
Data Sources
Boston housing prices: https://www.kaggle.com/altavish/boston-housing-dataset
is a very classic data
Briefly explain the main indicators of this data:
ZN: The proportion of land above 25,000 square feet that is zoned as residential land.
RM: The average number of rooms per residence.
AGE: Proportion of self-owned houses built before 1940
CHAS: Whether there is a river passing (if equal to 1, it means yes, equal to 0 means no)
CRIM: crime rate
MEDV: housing price
other indicators needless to say, they are all Some other indicators of housing, interested friends can check it out by themselves.
Data mining
1. Import third-party libraries
import pandas as pd
import numpy as np
import winreg
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso###导入岭回归算法
from sklearn.metrics import r2_score
The old rules, come up first to import each module needed for modeling in turn
2. Read the file
import winreg
real_address = winreg.OpenKey(winreg.HKEY_CURRENT_USER,r'Software\Microsoft\Windows\CurrentVersion\Explorer\Shell Folders',)
file_address=winreg.QueryValueEx(real_address, "Desktop")[0]
file_address+='\\'
file_origin=file_address+"\\源数据-分析\\HousingData.csv"#设立源数据文件的桌面绝对路径
house_price=pd.read_csv(file_origin)#https://www.kaggle.com/altavish/boston-housing-dataset
Because every time you download data, you have to transfer the file to the python root directory or read it in the download folder, which is very troublesome. So I set up an absolute desktop path through the winreg library, so that I only need to download the data to the desktop or paste it into a specific folder on the desktop to read it, and it won't be confused with other data.
In fact, up to this step, we are going through the process, basically every data mining has to be done again, there is nothing to say.
3. Clean the data
1. Find missing values
You can see that this data does not include a lot of missing values, so just delete it.
house_price1=house_price.dropna().reset_index()
del house_price1["index"]
2. Find the mutation value
Generally, it is to see whether the eigenvalue contains data equal to zero. In fact, the straightforward point is to see if the data contains unrealistic values, such as the crime rate. In practice, there is no area with a crime rate equal to 0. From the above results, there are no other problems with this data.
Both ZN and CHAS in this data use 0 and 1 as an indicator, so it is normal to include 0.
4. Modeling
train=house_price1.drop(["MEDV"],axis=1)
X_train,X_test,y_train,y_test=train_test_split(train,house_price1["MEDV"],random_state=23)
lasso=Lasso(alpha=10,max_iter=0)
lasso.fit(X_train,y_train)
print("Lasso训练模型得分:"+str(r2_score(y_train,lasso.predict(X_train))))#训练集
print("Lasso待测模型得分:"+str(r2_score(y_test,lasso.predict(X_test))))#待测集
After introducing the lasso algorithm, after modeling, the accuracy of the test set is scored, and the results obtained are as follows:
As you can see from the results, lasso performs poorly on the training set and the test set. This indicates that there is overfitting. Similar to ridge regression, lasso also has a regularization parameter alpha, which can control the strength of the coefficient tending to 0. In the previous model, we used alpha=10. In order to reduce underfitting, we tried to reduce alpha. At the same time, we also need to increase the value of max_iter (the maximum number of iterations to run). The result is as follows:
After successively modifying the coefficients, it can be seen that the training accuracy of the model is about 79%, and for the new data, the model accuracy is about 60%.
At this point, the modeling of this data set is complete.
ps: If the value of max_iter is too small, a warning will appear that the value needs to be larger, and the value of max_iter will not affect the accuracy of the model.
discuss
Comparison with ridge regression algorithm
Let's take a look at the advantages and disadvantages of lasso and ridge regression by changing the values of constraint parameters.
from sklearn.linear_model import Ridge###导入岭回归算法
result=pd.DataFrame(columns=["参数","lasso训练模型得分","lasso待测模型得分","岭回归训练模型得分","岭回归待测模型得分"])
for i in range(1,100):
alpha=i/10
ridge=Ridge(alpha=alpha)
lasso=Lasso(alpha=alpha,max_iter=10000)
ridge.fit(X_train,y_train)
lasso.fit(X_train,y_train)
result=result.append([{
"参数":alpha,"lasso训练模型得分":r2_score(y_train,lasso.predict(X_train)),"lasso待测模型得分":r2_score(y_test,lasso.predict(X_test)),"岭回归训练模型得分":r2_score(y_train,ridge.predict(X_train)),"岭回归待测模型得分":r2_score(y_test,ridge.predict(X_test))}])
The result is as follows:
It can be seen that as the alpha changes, the two algorithms will show certain rules whether they are the training model or the model to be tested. Next, we use a line chart to show the above data more intuitively:
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("fivethirtyeight")
sns.set_style({
'font.sans-serif':['SimHei','Arial']})#设定汉字字体,防止出现方框
%matplotlib inline
#在jupyter notebook上直接显示图表
fig= plt.subplots(figsize=(15,5))
plt.plot(result["参数"],result["lasso训练模型得分"],label="lasso训练模型得分")#画折线图
plt.plot(result["参数"],result["lasso待测模型得分"],label="lasso待测模型得分")
plt.plot(result["参数"],result["岭回归训练模型得分"],label="岭回归训练模型得分")
plt.plot(result["参数"],result["岭回归待测模型得分"],label="岭回归待测模型得分")
plt.rcParams.update({
'font.size': 15})
plt.legend()
plt.xticks(fontsize=15)#设置坐标轴上的刻度字体大小
plt.yticks(fontsize=15)
plt.xlabel("参数",fontsize=15)#设置坐标轴上的标签内容和字体
plt.ylabel("得分",fontsize=15)
The result is as follows:
It can be seen that if the alpha is small, we can fit a more complex model and perform better on the training set and test set. The generalization ability of the model is slightly better than using ridge regression (red and green lines). But with the increase of the alpha parameter, the under-fitting phenomenon of the lasso algorithm model will become more and more obvious (red line and blue line), that is, the model accuracy and generalization ability will gradually decrease.
But if the alpha is set too small, the effect of regularization will be eliminated, and over-fitting will occur, resulting in a result similar to the least squares method.
At the same time, it can be seen that when alpha takes a certain value, the predictive performance of ridge regression is similar to that of lasso's model (look at the intersection of the two lines).
Therefore, in practice, ridge regression is generally preferred for these two models. It can be seen from the figure that as the parameters change, the model score changes smoothly, and even as the parameters increase, the generalization ability will be slightly Improvement (green line). But if there are many features and you think only a few of them are important, then it may be better to choose lasso. Similarly, if you want a model that is easier to explain, lasso can give a model that is easier to understand, because it only selects a part of the eigenvalues as input.
The above is about the actual operation and views of the lasso algorithm. There are many places where it is not very good. Netizens are welcome to make suggestions, and I hope to meet some friends to discuss together.