Machine Learning Fundamentals: Feature Selection with Lasso

The first model that everyone comes into contact with in machine learning should be simple linear regression, but it is often passed by when learning Lasso. In fact, Lasso regression is also an evergreen tree in machine learning models, and it is widely used in the industry. In many projects, especially feature selection will see his shadow.

Lasso adds L1 regularization to simple linear regression, which can shrink the coefficients of unimportant variables to 0, thus realizing feature selection. The focus of this article is to demonstrate how to use it for feature selection after explaining its principle. I hope you can gain some new knowledge.

lasso principle

Lasso is to add a 1-norm to the objective function of simple linear regression

Recall: In linear regression, if the parameter θ is too large and there are too many features, it will easily cause overfitting, as shown below:20220524162355

This picture of Mr. Li Hongyi has more visual impact

20220524111538

In order to prevent overfitting (theta is too large), in the objective function J ( θ ) J(\theta) , a complexity penalty factor, that is, a regular term, is added to prevent over-fitting and enhance the generalization ability of the model. The regular term can use L1-norm(Lasso), L2-norm(Ridge), or combine L1-norm and L2-norm(Elastic Net).

Cost function for lasso regression

J ( θ ) = 1 2 i m ( y ( i ) θ T x ( i ) ) 2 + λ j n θ j J(\theta)=\frac{1}{2}\sum_{i}^{m}(y^{(i)}-\theta ^Tx^{(i)})^2+\lambda \sum_{j}^{n}|\theta_j|

Matrix form:

J ( θ ) = 1 2 n ( X θ Y ) T ( X θ Y ) + α θ 1 J(\mathbf\theta) = \frac{1}{2n}(\mathbf{X\theta} - \mathbf{Y})^T(\mathbf{X\theta} - \mathbf{Y}) + \ alpha||\theta||_1

The left is Lasso, the right is ridge regression, β1, β2 are the model parameters to be optimized, the red ellipse is the objective function, and the blue area is the solution space.

Whether ridge regression or lasso regression, the essence is through adjustment λ l to achieve a balanced adjustment of model error and variance. The tangent point between the red ellipse and the blue area is the optimal solution of the objective function. It can be seen that the optimal solution of Lasso is easier to cut to the coordinate axis, forming a sparse result (some coefficients are zero). Ridge regression reduces the regression coefficients without discarding any features, making the model relatively stable, but compared with Lasso regression, this will make the model have more features and poor model interpretation.

Today our focus is on Lasso, and the optimization goals are: ( 1 / ( 2 n s a m p l e s ) ) y X w 2 2 + a l p h a w 1 (1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1

上式不是连续可导的,因此常规的解法如梯度下降法、牛顿法、就没法用了。常用的方法:坐标轴下降法与最小角回归法(Least-angle regression (LARS))。

这部分就不展开了,感兴趣的同学可以看下刘建平老师的文章《Lasso回归算法: 坐标轴下降法与最小角回归法小结 》,这里不过多赘述。
www.cnblogs.com/pinard/p/60…

想深入研究,可以看下Coordinate Descent和LARS的论文 www.stat.cmu.edu/~ryantibs/c…
arxiv.org/pdf/math/04…

scikit-learn 提供了这两种优化算法的Lasso实现,分别是

sklearn.linear_model.Lasso(alpha=1.0, *, fit_intercept=True, 
normalize='deprecated', precompute=False, copy_X=True,
max_iter=1000, tol=0.0001, warm_start=False, 
positive=False, random_state=None, selection='cyclic')


sklearn.linear_model.lars_path(X, y, Xy=None, *, Gram=None,
max_iter=500, alpha_min=0, method='lar', copy_X=True, 
eps=2.220446049250313e-16, copy_Gram=True, verbose=0, 
return_path=True, return_n_iter=False, positive=False)

用 Lasso 找到特征重要性

在机器学习中,面对海量的数据,首先想到的就是降维,争取用尽可能少的数据解决问题,Lasso方法可以将特征的系数进行压缩并使某些回归系数变为0,进而达到特征选择的目的,可以广泛地应用于模型改进与选择。

Feature selection is a big topic in machine learning

scikit-learn 的Lasso实现中,更常用的其实是LassoCV(沿着正则化路径具有迭代拟合的套索(Lasso)线性模型),它对超参数 α \alpha 使用了交叉验证,来帮忙我们选择一个合适的 α \alpha 。不过GridSearchCV+Lasso也能实现调参,这里就列一下LassoCV的参数、属性和方法。

### 参数
eps:路径的长度。eps=1e-3意味着alpha_min / alpha_max = 1e-3。
n_alphas:沿正则化路径的Alpha个数,默认100。
alphas:用于计算模型的alpha列表。如果为None,自动设置Alpha。
fit_intercept:是否估计截距,默认True。如果为False,则假定数据已经中心化。
tol:优化的容忍度,默认1e-4:如果更新小于tol,优化代码将检查对偶间隙的最优性,并一直持续到它小于tol为止
cv:定交叉验证拆分策略

### 属性

alpha_:交叉验证选择的惩罚量
coef_:参数向量(目标函数公式中的w)。
intercept_:目标函数中的截距。
mse_path_:每次折叠不同alpha下测试集的均方误差。
alphas_:对于每个l1_ratio,用于拟合的alpha网格。
dual_gap_:最佳alpha(alpha_)优化结束时的双重间隔。
n_iter_	int:坐标下降求解器运行的迭代次数,以达到指定容忍度的最优alpha。

### 方法

fit(X, y[, sample_weight, check_input])	用坐标下降法拟合模型。
get_params([deep])	获取此估计器的参数。
path(X, y, *[, l1_ratio, eps, n_alphas, …])	计算具有坐标下降的弹性网路径。
predict(X)	使用线性模型进行预测。
score(X, y[, sample_weight])	返回预测的确定系数R ^ 2。
set_params(**params)	设置此估算器的参数。

Python实战

波士顿房价数据为例

## 导入库 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
import warnings
warnings.filterwarnings('ignore')
##  读取数据
url = r'F:\100-Days-Of-ML-Code\datasets\Regularization_Boston.csv'
df = pd.read_csv(url)

scaler=StandardScaler()
df_sc= scaler.fit_transform(df)
df_sc = pd.DataFrame(df_sc, columns=df.columns)
y = df_sc['price']
X = df_sc.drop('price', axis=1) # becareful inplace= False
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Lasso调参数,主要就是选择合适的alpha,上面提到LassoCV,GridSearchCV都可以实现,这里为了绘图我们手动实现。

alpha_lasso = 10**np.linspace(-3,1,100)
lasso = Lasso()
coefs_lasso = []

for i in alpha_lasso:
    lasso.set_params(alpha = i)
    lasso.fit(X_train, y_train)
    coefs_lasso.append(lasso.coef_)
    
plt.figure(figsize=(12,10))
ax = plt.gca()
ax.plot(alpha_lasso, coefs_lasso)
ax.set_xscale('log')
plt.axis('tight')
plt.xlabel('alpha')
plt.ylabel('weights: scaled coefficients')
plt.title('Lasso regression coefficients Vs. alpha')
plt.legend(df.drop('price',axis=1, inplace=False).columns)
plt.show()

20220703230521 图中展示的是不同的变量随着alpha惩罚后,其系数的变化,我们要保留的就是系数不为0的变量。alpha值不断增大时系数才变为0的变量在模型中越重要。

我们也可以按系数绝对值大小倒序看下特征重要性,可以设置更大的alpha值,就会看到更多的系数被压缩为0了。

lasso = Lasso(alpha=10**(-3))
model_lasso = lasso.fit(X_train, y_train)
coef = pd.Series(model_lasso.coef_,index=X_train.columns)
print(coef[coef != 0].abs().sort_values(ascending = False))

LSTAT2 2.876424
LSTAT 2.766566
LSTAT4 0.853773
LSTAT5 0.178117
LSTAT10 0.102558
LSTAT9 0.088525
LSTAT8 0.001112
dtype: float64

lasso = Lasso(alpha=10**(-2))
model_lasso = lasso.fit(X_train, y_train)
coef = pd.Series(model_lasso.coef_,index=X_train.columns)
print(coef[coef != 0].abs().sort_values(ascending = False))

LSTAT 1.220552
LSTAT3 0.625608
LSTAT10 0.077125
dtype: float64

或者直接画个柱状图

fea = X_train.columns
a = pd.DataFrame()
a['feature'] = fea
a['importance'] = coef.values

a = a.sort_values('importance',ascending = False)
plt.figure(figsize=(12,8))
plt.barh(a['feature'],a['importance'])
plt.title('the importance features')
plt.show()

20220703230558

总结

The advantage of the Lasso regression method is that it can make up for the shortcomings of the least squares estimation method and the local optimal estimation of the stepwise regression, and can select the features well and effectively solve the problem of multicollinearity between the features.

The disadvantage is that when there is a set of highly correlated features, the Lasso regression method tends to select one of the features and ignore all other features, which can lead to instability of the results.

Although the Lasso regression method has drawbacks, it can still play a good role in suitable scenarios.

reference

www.biaodianfu.com/ridge-lasso…
machinelearningcompass.com/machine_lea…
www.cnblogs.com/pinard/p/60…
www.biaodianfu.com/ridge-lasso…

Guess you like

Origin juejin.im/post/7116163017888235528