1 parameter
sklearn's LinearRegression has a parameter that can be standardized before training
from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
The document introduces
normalizebool, default=False
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use StandardScaler before calling fit on an estimator with normalize=False.
The interesting thing is that normalized and standardize are both standardized, minus the mean divided by the l2 norm
2 coefficient
The coefficient of the linear regression model after training can represent the importance of the feature
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)
Can also draw
model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)
3 Check the model
After training the model, it is necessary to compare the gap between the real and the predicted to determine whether the model is feasible
subsample_index = np.random.randint(low=0, high=len(train_y), size=50)
plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price is obvious different from true price')
plt.show()
If the deviation is large, it means that there
is a problem with the model. It may be a problem with the label. If the label is a long-tailed distribution, it does not meet the assumptions of the model. It
needs to be adjusted to a normal distribution.
train_y_ln = np.log(train_y + 1)
Training again, it will be much better