Ridge Return-Return to Practice

Regression practice of python ridge regression algorithm

basic concept

Regularization

Regularization refers to making explicit constraints on the model to avoid overfitting. The ridge regression used in this article is L2 regularization. (From a mathematical point of view, ridge regression penalizes the L2 norm of the coefficient or the Euclidean length of w)

The specific principles of regularization are not described here. Those who are interested can read this article: Intuitive understanding of regularization terms L1 and L2 in machine learning .

Introduction to Algorithm

Ridge regression

Ridge regression is also a linear model used for regression, so its model formula is the same as that of the least square method, as shown in the following formula:

y=w[0]*x[0]+w[1]*x[1]+w[2]x[2]+…+w[p]x[p]+b

But in ridge regression, the choice of the coefficient w must not only get a good prediction result on the training data, but also fit additional constraints. In other words, all elements of w should be close to zero. Intuitively, this means that the impact of each feature on the output should be as small as possible (that is, the slope is small), while still giving good prediction results. This constraint is also regularization .

Data Sources

Boston housing prices: https://www.kaggle.com/altavish/boston-housing-dataset
is also a very classic data
Insert picture description here

Briefly explain the main indicators of this data:
ZN: The proportion of land above 25,000 square feet that is zoned as residential land.
RM: The average number of rooms per residence.
AGE: Proportion of self-owned houses built before 1940
CHAS: Whether there is a river passing (if equal to 1, it means yes, equal to 0 means no)
CRIM: crime rate
MEDV: housing price
other indicators needless to say, they are all Some other indicators of housing, interested friends can check it out by themselves.

Data mining

1. Import third-party libraries

import pandas as pd
import numpy as np
import winreg
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge###导入岭回归算法
from sklearn.metrics import r2_score

The old rules, come up first to import each module needed for modeling in turn
2. Read the file

import winreg
real_address = winreg.OpenKey(winreg.HKEY_CURRENT_USER,r'Software\Microsoft\Windows\CurrentVersion\Explorer\Shell Folders',)
file_address=winreg.QueryValueEx(real_address, "Desktop")[0]
file_address+='\\'
file_origin=file_address+"\\源数据-分析\\HousingData.csv"#设立源数据文件的桌面绝对路径
house_price=pd.read_csv(file_origin)#https://www.kaggle.com/altavish/boston-housing-dataset

Because every time you download data, you have to transfer the file to the python root directory or read it in the download folder, which is very troublesome. So I set up an absolute desktop path through the winreg library, so that I only need to download the data to the desktop or paste it into a specific folder on the desktop to read it, and it won't be confused with other data.
In fact, up to this step, we are going through the process, basically every data mining has to be done again, there is nothing to say.

3. Clean the data
1. Find missing values
Insert picture description here

You can see that this data does not include a lot of missing values, so just delete it.

house_price1=house_price.dropna().reset_index()
del house_price1["index"]

2. Find the mutation value

Insert picture description here
Generally, it is to see whether the eigenvalue contains data equal to zero. In fact, the straightforward point is to see if the data contains unrealistic values, such as the crime rate. In practice, there is no area with a crime rate equal to 0. From the above results, there are no other problems with this data.
Both ZN and CHAS in this data use 0 and 1 as an indicator, so it is normal to include 0.
4. Modeling

train=house_price1.drop(["MEDV"],axis=1)
X_train,X_test,y_train,y_test=train_test_split(train,house_price1["MEDV"],random_state=1)
#将MEDV划分为预测值,其它的属性划分为特征值,并将数据划分成训练集和测试集。
ridge=Ridge(alpha=10)#确定约束参数
ridge.fit(X_train,y_train)
print("岭回归训练模型得分:"+str(r2_score(y_train,ridge.predict(X_train))))#训练集
print("岭回归待测模型得分:"+str(r2_score(y_test,ridge.predict(X_test))))#待测集

After introducing the ridge algorithm, after modeling, the accuracy of the test set is scored, and the results obtained are as follows:
Insert picture description here

It can be seen that the training accuracy of the model is about 79%, and for the new data, the model accuracy is about 63%.
At this point, the modeling of this data set is complete.

discuss

1. Discussion of parameters

Since the model formula of ridge regression and linear regression (least squares method) is the same, here we will make a comparison with linear regression. Friends who don’t understand linear regression can read another article of mine: Regression Practice of Least Squares Algorithm
Insert picture description here

The constraint parameter we set up before is 10, and the model parameter above is set to 0. It can be seen that the training accuracy of the model has been improved, but the generalization ability has been reduced. At the same time, compared with the linear regression model, the scores of the two are exactly the same. Therefore, when the constraint parameter of the ridge regression is set to 0, the ridge regression without constraint is the same algorithm as the ordinary least squares method.

2. Comparison with ordinary least squares

Let's take a look at the advantages and disadvantages of ridge regression and ordinary least squares by changing the values ​​of constraint parameters.

result_b=pd.DataFrame(columns=["参数","岭回归训练模型得分","岭回归待测模型得分","线性回归训练模型得分","线性回归待测模型得分"])
train=house_price1.drop(["MEDV"],axis=1)
X_train,X_test,y_train,y_test=train_test_split(train,house_price1["MEDV"],random_state=23)
for i in range(21):
    alpha=i/10#约定参数可以选定为小数
    ridge=Ridge(alpha=alpha)
    ridge.fit(X_train,y_train)
    linear=LinearRegression()
    linear.fit(X_train,y_train)
    result_b=result_b.append([{
    
    "参数":alpha,"岭回归训练模型得分":r2_score(y_train,ridge.predict(X_train)),"岭回归待测模型得分":r2_score(y_test,ridge.predict(X_test)),"线性回归训练模型得分":r2_score(y_train,linear.predict(X_train)),"线性回归待测模型得分":r2_score(y_test,linear.predict(X_test))}])

The results are as follows: It
Insert picture description here
can be seen that if it is only for the accuracy of the training model, the least squares method is better than ridge regression, but when making predictions on new data, that is, considering the generalization ability of the model, it can be seen that ridge The regression model score is better than the least square method.
We use a line chart to show the above data more intuitively:

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("fivethirtyeight")
sns.set_style({
    
    'font.sans-serif':['SimHei','Arial']})#设定汉字字体,防止出现方框
%matplotlib inline
#在jupyter notebook上直接显示图表
fig= plt.subplots(figsize=(15,5))
plt.plot(result_b["参数"],result_b["岭回归训练模型得分"],label="岭回归训练模型得分")#画折线图
plt.plot(result_b["参数"],result_b["岭回归待测模型得分"],label="岭回归待测模型得分")
plt.plot(result_b["参数"],result_b["线性回归训练模型得分"],label="线性回归训练模型得分")
plt.plot(result_b["参数"],result_b["线性回归待测模型得分"],label="线性回归待测模型得分")
plt.rcParams.update({
    
    'font.size': 12})
plt.legend()
plt.xticks(fontsize=15)#设置坐标轴上的刻度字体大小
plt.yticks(fontsize=15)
plt.xlabel("参数",fontsize=15)#设置坐标轴上的标签内容和字体
plt.ylabel("得分",fontsize=15)

The results are as follows: It
Insert picture description here
can be seen that the ridge regression model makes a trade-off between the simplicity of the model (coefficients are all close to 0) and the performance of the training set. The importance of both simplicity and training performance to the model can be determined by the user by setting aplha parameters. Increasing alpha will make the coefficients more towards 0, thereby reducing the performance of the training set, but will improve the generalization performance.

And whether it is ridge regression or linear regression, the training scores corresponding to all data set sizes are higher than the predicted scores. Since ridge regression is regularized, its training score is lower than that of linear regression as a whole. But the test scores of ridge regression are high, especially for smaller data sets. If the amount of data is less than a certain level, linear regression will not learn anything. As more and more data are available for the model, the performance of the two models is improving, and the performance of linear regression finally catches up with the ridge regression. So if there is enough training content, regularization becomes less important, and ridge regression and linear regression will have the same performance.

The above is about the actual operation and opinions of Ling Hui. There are many places that are not very good. Netizens are welcome to make suggestions, and I hope that I can meet some friends to discuss together.

Guess you like

Origin blog.csdn.net/weixin_43580339/article/details/112931842