table of Contents
Preface
I was busy with the project contract some time ago, and this study was delayed for a while, taking advantage of the time when I was not busy a year ago, hurry up and finish it.
Underfitting and overfitting
- Underfitting
Definition: A hypothesis cannot get a better fit on the training set, and no good fitting data can be obtained on the test set.
Reason: The model is too simple and the learning data has few features
Solution: increase the number of features in the data
- Overfitting
Definition: A hypothesis can get a better fit on the training data than other hypotheses, but it cannot fit the data well on the test data set, it is considered as overfitting.
Reason: the model is too complex, the original features are too many, and there are some noisy features
Solution: regularization (mainly used for regression) to reduce the impact of higher-order items.
Regularization
There are two main methods: L1 regularization and L2 regularization.
L1 regularization
Principle: Add a penalty term on the basis of the original loss function, but the penalty term is only the sum of the absolute values of the weight w. The formula is as follows:
Where m is the number of sample sets, n is the number of eigenvalues, and is the penalty coefficient. w is the weight of each eigenvalue.
Function: The value of some w can be directly set to 0, and the influence of this characteristic value can be deleted.
The model using L1 regularization is also called Lasso regression.
L2 regularization
Principle: Add a penalty term based on the original loss function. The penalty term is the sum of the squares of all weights w. The formula is as follows:
Where m is the number of sample sets, n is the number of eigenvalues, and is the penalty coefficient. w is the weight of each eigenvalue.
Function: The value of some of the w can be made very small, close to 0, which weakens the influence of a certain feature.
The model using L2 regularization is also called Ridge regression (ridge regression).
The optimization of linear regression is to reduce the loss function. After the penalty term is added, not only the loss function must be reduced, but also the penalty term must be reduced.
It can be seen from the above formula that the greater the regularization strength , the closer the corresponding weight w is to 0, and conversely, the smaller the regularization strength , the greater the corresponding weight w.
Ridge regression
Ridge regression is a linear regression with L2 regularization. It is also a kind of linear regression, but when the algorithm establishes the regression equation, L2 regularization is added to solve the over-fitting.
Corresponding to the API in sklearn
Ridge(self, alpha=1.0, fit_intercept=True, normalize=False,
copy_X=True, max_iter=None, tol=1e-3, solver="auto",
random_state=None)
The meanings of the parameters are as follows:
parameter | meaning |
alpha | The penalty coefficient, the regularization strength, is what we mentioned in the previous formula |
fit_intercept | Whether to add offset, the default is True |
normalize | Whether to standardize the data, the default is False, if it is changed to True, we no longer need to perform feature standardization processing on the data |
copy_X | The default is True. If True, X will be copied; otherwise, it may be overwritten. |
max_iter | The maximum number of iterations |
solver | The default is auto. The optimization algorithm will be automatically selected based on the data. If there are more data sets and features, select SAG. There are other values'svd','cholesky','lsqr','sparse_cg','sag |
random_state | Random state |
In fact, Ridge is equivalent to SGDRegressor(penalty='l2', loss="squared_loss"), except that SGD is used in SGDRegressor, while Ridge uses SAG.
The return value is:
parameter | meaning |
coef_ | Regression coefficient, which is w in the linear model |
intercept_ | Bias, which is b in the linear model |
The example is to analyze the previously mentioned machine learning introductory research (12) -the problem of predicting Boston housing prices by linear regression . The code is as follows:
from sklearn.linear_model import Ridge
def ridge():
# 1)获取数据集
boston = load_boston()
# 2)划分数据集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
# 3)进行标准化
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 4)预估器
estimator = Ridge()
#estimator = SGDRegressor( penalty='l2', loss="squared_loss")
estimator.fit(x_train, y_train)
# 5)进行预测
y_predict = estimator.predict(x_test)
print("线性模型的参数为 w :", estimator.coef_)
print("线性模型的参数为 b:", estimator.intercept_)
# )模型评估
error = mean_squared_error(y_test, y_predict)
print("ridge 的误差值:", error)
return
The result of its operation is:
线性模型的参数为 w : [-0.63591916 1.12109181 -0.09319611 0.74628129 -1.91888749 2.71927719
-0.08590464 -3.25882705 2.41315949 -1.76930347 -1.74279405 0.87205004
-3.89758657]
线性模型的参数为 b: 22.62137203166228
ridge 的误差值: 20.656448214354967
The error value of the result is still different from the error solved by the normal equation LinearRegression. We can improve the accuracy of ridge regression by adjusting the values of alpha, max_iter, etc. in Ridge.