lasso return-return to practice

Regression practice of python lasso regression algorithm

basic concept

Regularization

Regularization refers to making explicit constraints on the model to avoid overfitting. The lasso regression used in this article is L1 regularization. (From a mathematical point of view, lasso penalizes the L1 norm of the coefficient vector, in other words, the sum of the absolute values ​​of the coefficients.)

The specific principles of regularization are not described here. Those who are interested can read this article: Intuitive understanding of regularization terms L1 and L2 in machine learning .

Introduction to Algorithm

lasso regression

Before understanding lasso regression, it is recommended that friends do some understanding of ordinary least squares and ridge regression, you can refer to these two articles: least squares-regression practice , ridge regression-regression practice .

In addition to ridge regression, lasso is another regularized linear regression model, so its model formula is the same as that of the least square method, as shown in the following formula:

y=w[0]*x[0]+w[1]*x[1]+w[2]x[2]+…+w[p]x[p]+b

Like ridge regression, using lasso also restricts the coefficient w to close to 0, but the method used is different, called L1 regularization. The result of L1 regularization is that some coefficients are exactly 0 when using lasso. This shows that some features are completely ignored by the model. This can be seen as an automated feature selection. Some coefficients are exactly 0, so that the model is easier to understand and can also present the most important features of the model.

Data Sources

Boston housing prices: https://www.kaggle.com/altavish/boston-housing-dataset
is a very classic data
Insert picture description here

Briefly explain the main indicators of this data:
ZN: The proportion of land above 25,000 square feet that is zoned as residential land.
RM: The average number of rooms per residence.
AGE: Proportion of self-owned houses built before 1940
CHAS: Whether there is a river passing (if equal to 1, it means yes, equal to 0 means no)
CRIM: crime rate
MEDV: housing price
other indicators needless to say, they are all Some other indicators of housing, interested friends can check it out by themselves.

Data mining

1. Import third-party libraries

import pandas as pd
import numpy as np
import winreg
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso###导入岭回归算法
from sklearn.metrics import r2_score

The old rules, come up first to import each module needed for modeling in turn

2. Read the file

import winreg
real_address = winreg.OpenKey(winreg.HKEY_CURRENT_USER,r'Software\Microsoft\Windows\CurrentVersion\Explorer\Shell Folders',)
file_address=winreg.QueryValueEx(real_address, "Desktop")[0]
file_address+='\\'
file_origin=file_address+"\\源数据-分析\\HousingData.csv"#设立源数据文件的桌面绝对路径
house_price=pd.read_csv(file_origin)#https://www.kaggle.com/altavish/boston-housing-dataset

Because every time you download data, you have to transfer the file to the python root directory or read it in the download folder, which is very troublesome. So I set up an absolute desktop path through the winreg library, so that I only need to download the data to the desktop or paste it into a specific folder on the desktop to read it, and it won't be confused with other data.
In fact, up to this step, we are going through the process, basically every data mining has to be done again, there is nothing to say.

3. Clean the data

1. Find missing values
Insert picture description here

You can see that this data does not include a lot of missing values, so just delete it.

house_price1=house_price.dropna().reset_index()
del house_price1["index"]

2. Find the mutation value

Insert picture description here
Generally, it is to see whether the eigenvalue contains data equal to zero. In fact, the straightforward point is to see if the data contains unrealistic values, such as the crime rate. In practice, there is no area with a crime rate equal to 0. From the above results, there are no other problems with this data.
Both ZN and CHAS in this data use 0 and 1 as an indicator, so it is normal to include 0.

4. Modeling

train=house_price1.drop(["MEDV"],axis=1)
X_train,X_test,y_train,y_test=train_test_split(train,house_price1["MEDV"],random_state=23)
lasso=Lasso(alpha=10,max_iter=0)
lasso.fit(X_train,y_train)
print("Lasso训练模型得分:"+str(r2_score(y_train,lasso.predict(X_train))))#训练集
print("Lasso待测模型得分:"+str(r2_score(y_test,lasso.predict(X_test))))#待测集

After introducing the lasso algorithm, after modeling, the accuracy of the test set is scored, and the results obtained are as follows:

Insert picture description here
As you can see from the results, lasso performs poorly on the training set and the test set. This indicates that there is overfitting. Similar to ridge regression, lasso also has a regularization parameter alpha, which can control the strength of the coefficient tending to 0. In the previous model, we used alpha=10. In order to reduce underfitting, we tried to reduce alpha. At the same time, we also need to increase the value of max_iter (the maximum number of iterations to run). The result is as follows:
Insert picture description here

After successively modifying the coefficients, it can be seen that the training accuracy of the model is about 79%, and for the new data, the model accuracy is about 60%.
At this point, the modeling of this data set is complete.
ps: If the value of max_iter is too small, a warning will appear that the value needs to be larger, and the value of max_iter will not affect the accuracy of the model.

discuss

Comparison with ridge regression algorithm

Let's take a look at the advantages and disadvantages of lasso and ridge regression by changing the values ​​of constraint parameters.

from sklearn.linear_model import Ridge###导入岭回归算法
result=pd.DataFrame(columns=["参数","lasso训练模型得分","lasso待测模型得分","岭回归训练模型得分","岭回归待测模型得分"])
for i in range(1,100):
    alpha=i/10
    ridge=Ridge(alpha=alpha)
    lasso=Lasso(alpha=alpha,max_iter=10000)
    ridge.fit(X_train,y_train)
    lasso.fit(X_train,y_train)
    result=result.append([{
    
    "参数":alpha,"lasso训练模型得分":r2_score(y_train,lasso.predict(X_train)),"lasso待测模型得分":r2_score(y_test,lasso.predict(X_test)),"岭回归训练模型得分":r2_score(y_train,ridge.predict(X_train)),"岭回归待测模型得分":r2_score(y_test,ridge.predict(X_test))}])

The result is as follows:
Insert picture description here

It can be seen that as the alpha changes, the two algorithms will show certain rules whether they are the training model or the model to be tested. Next, we use a line chart to show the above data more intuitively:

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("fivethirtyeight")
sns.set_style({
    
    'font.sans-serif':['SimHei','Arial']})#设定汉字字体,防止出现方框
%matplotlib inline
#在jupyter notebook上直接显示图表
fig= plt.subplots(figsize=(15,5))
plt.plot(result["参数"],result["lasso训练模型得分"],label="lasso训练模型得分")#画折线图
plt.plot(result["参数"],result["lasso待测模型得分"],label="lasso待测模型得分")
plt.plot(result["参数"],result["岭回归训练模型得分"],label="岭回归训练模型得分")
plt.plot(result["参数"],result["岭回归待测模型得分"],label="岭回归待测模型得分")
plt.rcParams.update({
    
    'font.size': 15})
plt.legend()
plt.xticks(fontsize=15)#设置坐标轴上的刻度字体大小
plt.yticks(fontsize=15)
plt.xlabel("参数",fontsize=15)#设置坐标轴上的标签内容和字体
plt.ylabel("得分",fontsize=15)

The result is as follows:
Insert picture description here

It can be seen that if the alpha is small, we can fit a more complex model and perform better on the training set and test set. The generalization ability of the model is slightly better than using ridge regression (red and green lines). But with the increase of the alpha parameter, the under-fitting phenomenon of the lasso algorithm model will become more and more obvious (red line and blue line), that is, the model accuracy and generalization ability will gradually decrease.

But if the alpha is set too small, the effect of regularization will be eliminated, and over-fitting will occur, resulting in a result similar to the least squares method.

At the same time, it can be seen that when alpha takes a certain value, the predictive performance of ridge regression is similar to that of lasso's model (look at the intersection of the two lines).

Therefore, in practice, ridge regression is generally preferred for these two models. It can be seen from the figure that as the parameters change, the model score changes smoothly, and even as the parameters increase, the generalization ability will be slightly Improvement (green line). But if there are many features and you think only a few of them are important, then it may be better to choose lasso. Similarly, if you want a model that is easier to explain, lasso can give a model that is easier to understand, because it only selects a part of the eigenvalues ​​as input.

The above is about the actual operation and views of the lasso algorithm. There are many places where it is not very good. Netizens are welcome to make suggestions, and I hope to meet some friends to discuss together.

Guess you like

Origin blog.csdn.net/weixin_43580339/article/details/112983192