Lasso model selection: Cross-Validation / AIC / BIC

Akaike information criterion (AIC), Bayes information criterion (BIC) and cross-validation are used to select the optimal value of the regularization parameter α of the Lasso estimator.

LassoLarsICThe results obtained are based on AIC/BIC standards.

Model selection based on information criteria is very fast, but it relies on an appropriate estimation of degrees of freedom, is derived for large samples (asymptotic results), and assumes that the model is correct, that is, the data is actually generated by the model of. They also tend to break when the problem conditions are bad (more features than samples).

For cross-validation, we use two algorithms to calculate the lasso path: coordinate descent (by LassoCV类implementation) and Lars (minimum angle regression) (by LassoLarsCV类implementation). The results of the two algorithms are roughly the same. They differ in execution speed and sources of numerical error.

Lars only calculates the path solution for each kink in the path. Therefore, it is very effective when there are few kinks, which is the case if there are few features or samples. No need to set any parameters, meta can also calculate the full path. Instead, the coordinate descent calculation pre-specifies the path point on the grid (here we use the default value). Therefore, if the number of grid points is less than the number of kinks in the path, the efficiency is higher. If the number of features is really large, and there are enough samples to select a large number of features, then such a strategy may be interesting. In terms of numerical error, Lars will accumulate more errors for highly correlated variables, while the coordinate descent algorithm only samples the path on the grid.

Note how the optimal value of α changes with each fold. This illustrates that when trying to evaluate the performance of the method of selecting parameters by cross-validation, nested cross-validation is necessary: ​​for invisible data, this parameter selection may not be optimal.

import time

import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LassoCV, LassoLarsCV, LassoLarsIC
from sklearn import datasets

# 这是为了避免在np.log10时被0除
EPSILON = 1e-4

X, y = datasets.load_diabetes(return_X_y=True)  # 导入数据

rng = np.random.RandomState(42)
X = np.c_[X, rng.randn(X.shape[0], 14)]  # 增加一些表现不好的特征

# 按Lars所做的标准化数据,以便进行比较
X /= np.sqrt(np.sum(X ** 2, axis=0))

# #############################################################################
# LassoLarsIC: 基于BIC/AIC准则的最小角度回归

model_bic = LassoLarsIC(criterion='bic')
t1 = time.time()
model_bic.fit(X, y)
t_bic = time.time() - t1  # 拟合模型运行时间
alpha_bic_ = model_bic.alpha_  # 打印alpha

model_aic = LassoLarsIC(criterion='aic')
model_aic.fit(X, y)
alpha_aic_ = model_aic.alpha_

# 画出所有alphas中信息标准(“aic”、“bic”)的值
def plot_ic_criterion(model, name, color):
    criterion_ = model.criterion_  # 模型使用的标准
    plt.semilogx(model.alphas_ + EPSILON, criterion_, '--', color=color,
                 linewidth=3, label='%s criterion' % name)
    plt.axvline(model.alpha_ + EPSILON, color=color, linewidth=3,
                label='alpha: %s estimate' % name)  # 标出最优alpha
    plt.xlabel(r'$\alpha$')
    plt.ylabel('criterion')


plt.figure()
plot_ic_criterion(model_aic, 'AIC', 'b')
plot_ic_criterion(model_bic, 'BIC', 'r')
plt.legend()
plt.title('Information-criterion for model selection (training time %.3fs)'
          % t_bic)

# #############################################################################
# LassoCV: 坐标下降法

# Compute paths
print("Computing regularization path using the coordinate descent lasso...")
t1 = time.time()
model = LassoCV(cv=20).fit(X, y)
t_lasso_cv = time.time() - t1

# Display results
plt.figure()
ymin, ymax = 2300, 3800
plt.semilogx(model.alphas_ + EPSILON, model.mse_path_, ':')
plt.plot(model.alphas_ + EPSILON, model.mse_path_.mean(axis=-1), 'k',
         label='Average across the folds', linewidth=2)
plt.axvline(model.alpha_ + EPSILON, linestyle='--', color='k',
            label='alpha: CV estimate')

plt.legend()

plt.xlabel(r'$\alpha$')
plt.ylabel('Mean square error')
plt.title('Mean square error on each fold: coordinate descent '
          '(train time: %.2fs)' % t_lasso_cv)
plt.axis('tight')
plt.ylim(ymin, ymax)

# #############################################################################
# LassoLarsCV: 最小角度回归

# Compute paths
print("Computing regularization path using the Lars lasso...")
t1 = time.time()
model = LassoLarsCV(cv=20).fit(X, y)
t_lasso_lars_cv = time.time() - t1

# Display results
plt.figure()
plt.semilogx(model.cv_alphas_ + EPSILON, model.mse_path_, ':')
plt.semilogx(model.cv_alphas_ + EPSILON, model.mse_path_.mean(axis=-1), 'k',
             label='Average across the folds', linewidth=2)
plt.axvline(model.alpha_, linestyle='--', color='k',
            label='alpha CV')
plt.legend()

plt.xlabel(r'$\alpha$')
plt.ylabel('Mean square error')
plt.title('Mean square error on each fold: Lars (train time: %.2fs)'
          % t_lasso_lars_cv)
plt.axis('tight')
plt.ylim(ymin, ymax)

plt.show()

Insert picture description here

Insert picture description here

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_42946328/article/details/111592023