Introductory Research on Machine Learning (13)-Ridge Regression

table of Contents

 

Preface

Underfitting and overfitting

Regularization

L1 regularization

L2 regularization

Ridge regression


Preface

I was busy with the project contract some time ago, and this study was delayed for a while, taking advantage of the time when I was not busy a year ago, hurry up and finish it.

Underfitting and overfitting

  • Underfitting

Definition: A hypothesis cannot get a better fit on the training set, and no good fitting data can be obtained on the test set.

Reason: The model is too simple and the learning data has few features

Solution: increase the number of features in the data

  • Overfitting

Definition: A hypothesis can get a better fit on the training data than other hypotheses, but it cannot fit the data well on the test data set, it is considered as overfitting.

Reason: the model is too complex, the original features are too many, and there are some noisy features

Solution: regularization (mainly used for regression) to reduce the impact of higher-order items.

Regularization

There are two main methods: L1 regularization and L2 regularization.

L1 regularization

Principle: Add a penalty term on the basis of the original loss function, but the penalty term is only the sum of the absolute values ​​of the weight w. The formula is as follows:

Where m is the number of sample sets, n is the number of eigenvalues, and \lambdais the penalty coefficient. w is the weight of each eigenvalue.

Function: The value of some w can be directly set to 0, and the influence of this characteristic value can be deleted.

The model using L1 regularization is also called Lasso regression.

L2 regularization

Principle: Add a penalty term based on the original loss function. The penalty term is the sum of the squares of all weights w. The formula is as follows:

Where m is the number of sample sets, n is the number of eigenvalues, and \lambdais the penalty coefficient. w is the weight of each eigenvalue.

Function: The value of some of the w can be made very small, close to 0, which weakens the influence of a certain feature.

The model using L2 regularization is also called Ridge regression (ridge regression).

The optimization of linear regression is to reduce the loss function. After the penalty term is added, not only the loss function must be reduced, but also the penalty term must be reduced.

It can be seen from the above formula that the \lambdagreater the regularization strength , the closer the corresponding weight w is to 0, and conversely, the \lambdasmaller the regularization strength , the greater the corresponding weight w.

Ridge regression

Ridge regression is a linear regression with L2 regularization. It is also a kind of linear regression, but when the algorithm establishes the regression equation, L2 regularization is added to solve the over-fitting.

Corresponding to the API in sklearn

Ridge(self, alpha=1.0, fit_intercept=True, normalize=False,
                 copy_X=True, max_iter=None, tol=1e-3, solver="auto",
                 random_state=None)

The meanings of the parameters are as follows:

parameter meaning
alpha The penalty coefficient, the regularization strength, is what we mentioned in the previous formula\lambda
fit_intercept Whether to add offset, the default is True
normalize Whether to standardize the data, the default is False, if it is changed to True, we no longer need to perform feature standardization processing on the data
copy_X The default is True. If True, X will be copied; otherwise, it may be overwritten.
max_iter The maximum number of iterations
solver

The default is auto. The optimization algorithm will be automatically selected based on the data.

If there are more data sets and features, select SAG.

There are other values'svd','cholesky','lsqr','sparse_cg','sag

random_state Random state

In fact, Ridge is equivalent to SGDRegressor(penalty='l2', loss="squared_loss"), except that SGD is used in SGDRegressor, while Ridge uses SAG.

The return value is:

parameter meaning
coef_ Regression coefficient, which is w in the linear model
intercept_ Bias, which is b in the linear model

 

The example is to analyze the previously mentioned machine learning introductory research (12) -the problem of predicting Boston housing prices by linear regression . The code is as follows:

from sklearn.linear_model import Ridge

def ridge():
    # 1)获取数据集
    boston = load_boston()
    # 2)划分数据集
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
    # 3)进行标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)
    # 4)预估器
    estimator = Ridge()
    #estimator = SGDRegressor( penalty='l2', loss="squared_loss")
    estimator.fit(x_train, y_train)
    # 5)进行预测
    y_predict = estimator.predict(x_test)
    print("线性模型的参数为 w :", estimator.coef_)
    print("线性模型的参数为 b:", estimator.intercept_)

    # )模型评估
    error = mean_squared_error(y_test, y_predict)
    print("ridge 的误差值:", error)
    return

The result of its operation is:

线性模型的参数为 w : [-0.63591916  1.12109181 -0.09319611  0.74628129 -1.91888749  2.71927719
 -0.08590464 -3.25882705  2.41315949 -1.76930347 -1.74279405  0.87205004
 -3.89758657]
线性模型的参数为 b: 22.62137203166228
ridge 的误差值: 20.656448214354967

The error value of the result is still different from the error solved by the normal equation LinearRegression. We can improve the accuracy of ridge regression by adjusting the values ​​of alpha, max_iter, etc. in Ridge.

 

Guess you like

Origin blog.csdn.net/nihaomabmt/article/details/103911488