Summary of logistic regression optimization techniques (full)

Starting from practical applications, this paper comprehensively summarizes logistic regression (LR) optimization techniques in terms of data characteristics, optimization algorithms, and model optimization.

First, the feature generation of LR

Logistic regression is a simple generalized linear model. The fitting ability of the model is very limited, and the nonlinear information of the interaction between features cannot be learned: a classic example is that LR cannot correctly classify nonlinear XOR data, and by introducing nonlinear features (feature generation), XOR linear separability can be realized in a higher-dimensional feature space, as shown in the following sample code:

picture

Technology Exchange

Technology must learn to share and communicate, and it is not recommended to work behind closed doors. A person can go fast, a group of people can go farther.

This article is shared and organized by fans of the technology group. Interview questions, source code, data, and technical exchanges must be read. You can add the exchange group to obtain it. The group has more than 2,000 friends. The best way to add notes is: source + interest direction, easy to find Like-minded friends.

Method ①, Add WeChat account: pythoner666, Remarks: from CSDN + technical exchange
Method ②, WeChat search official account: Python learning and data mining, background reply: add group

# 生成xor数据
import pandas as pd 
xor_dataset = pd.DataFrame([[1,1,0],[1,0,1],[0,1,1],[0,0,0]],columns=['x0','x1','label'])
x,y = xor_dataset[['x0','x1']], xor_dataset['label']
xor_dataset.head()

# keras实现逻辑回归
from keras.layers import *
from keras.models import Sequential, Model
from tensorflow import random
np.random.seed(5) # 固定随机种子
random.set_seed(5)
model = Sequential()
model.add(Dense(1, input_dim=3, activation='sigmoid'))
model.summary()
model.compile(optimizer='adam', loss='binary_crossentropy')
xor_dataset['x2'] = xor_dataset['x0'] * xor_dataset['x1'] # 加入非线性特征
x,y = xor_dataset[['x0','x1','x2']], xor_dataset['label']
model.fit(x, y, epochs=10000,verbose=False)
print("正确标签:",y.values)
print("模型预测:",model.predict(x).round())
# 正确标签: [0 1 1 0]   模型预测: [0 1 1 0]

It is often said in the industry that "data and features determine the upper limit of machine learning, while models and algorithms only approach this upper limit." Since LR is a simple model, its feature quality basically determines its final effect (that is, the simple model needs to compare feature engineering).

There are three main methods of LR common feature generation (extraction):

  • **Manually combined with business-derived features:**The advantage of artificial features is that the processed features are more business-explainable and closer to actual business. The disadvantage is that it is very dependent on business knowledge and time-consuming.

  • Feature derivative tools : such as violently derived features through featuretools, common methods for ft to generate features include aggregation (average, maximum and minimum values), conversion (addition, subtraction, multiplication and division between features). Violent spawning features are faster. The disadvantage is that it takes more computing resources, is prone to some noise, and is not suitable for scenarios that require feature interpretation. (It should be noted: the method of simply adding and subtracting linear processing features is not necessary for LR, the model can express itself)

  • Model-based approach:

For example, POLY2 and the factorization machine (FM) that introduces hidden vectors can be regarded as the basis of LR, and all features are crossed in pairs to generate nonlinear feature combinations.picture

However, methods such as FM can only do second-order feature crossover. It is more effective to use GBDT to automatically screen features and generate feature combinations . That is to extract the feature division and combination path of the GBDT subtree as a new feature, and then use the feature vector as the input of the LR model, which is the classic GBDT + LR method of the recommendation system. (It should be noted that the depth of the GBDT subtree is too deep, and the level of feature combination is relatively high. While greatly improving the fitting ability of the LR model, it is also easy to introduce some noise, resulting in over-fitting of the model)

picture

The following code implementation of GBDT+LR (based on the cancer cell data set) extracts GBDT features and stitches them with the original features: the picturetraining and evaluation model has a better classification effect:picture

from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier

gbdt = GradientBoostingClassifier(n_estimators=50, random_state=10, subsample=0.8, max_depth=6,
                                  min_samples_split=20)
gbdt.fit(x_train, y_train) # GBDT 训练集训练

train_new_feature = gbdt.apply(x) # 返回数据在训练好的模型里每棵树中所处的叶子节点的位置
print(train_new_feature.shape)
train_new_feature = train_new_feature.reshape(-1, 50)
display(train_new_feature)
print(train_new_feature.shape)

enc = OneHotEncoder()
enc.fit(train_new_feature)
train_new_feature2 = np.array(enc.transform(train_new_feature).toarray())  # onehot表示

print(train_new_feature2.shape)
train_new_feature2

2. Feature discretization and coding representation

For the input of continuous numerical features, LR usually needs to perform max-min normalization on the features (x = x-min/(max-min), convert the output to a number between 0-1, which can speed up Model calculation and training convergence. But in fact, in the industry, it is rare to directly use continuous values ​​as the feature input of the logistic regression model, but first discretize the continuous features (commonly used are equal width, equal frequency, chi-square binning, decision-making Tree binning, etc., and the difference in binning also directly affects the model effect), and then do (Onehot, WOE) encoding and then input the model.

The reason for this is that we go back to the principle of the model. Logistic regression is a generalized linear model. The model is nothing more than the weighted sum of the linear features, which are normalized to probability by sigmoid. Such feature expression is very limited. Take the feature of age in identifying whether to deposit as an example. In lr, age as a feature corresponds to a weight w control, the output value = sigmoid(...+age * w+...), it can be seen that the age value can only be expressed linearly under the influence of the model parameter w.

But for the feature of age, different age values ​​should not have a linear relationship with whether the model predicts deposits. For example, 0-18 years old may be negatively correlated with deposits, and 19-55 may be positively correlated with deposits. This means that different eigenvalues ​​require different model parameters to better represent. That is, by discretizing features, for example, age can be discretized and dumb coded (onehot) converted into 4 features (if_age<18, if_18<age<30, if_30<age<55, if_55<age) input into the lr model, You can use 4 model parameters to control the expression of these 4 discrete features: sigmoid(...+age1 * w1+age2 * w2...), which can obviously increase the nonlinear expression of the model and improve the fitting ability .picture

In the field of risk control, feature representation (encoding) is more commonly used after feature discretization is not onehot, but WOE encoding. pictureWoe encoding is to calculate the woe value of each bin by the difference between the ratio Pyi of positive and negative samples in the current bin and the ratio Pni of positive and negative samples in all samples (as in the above formula), as the numerical representation of the bin.

picture

The features after binning and woe encoding are very similar to the decision-making process of a decision tree. Take the age feature as an example: if age >18 and age<22 then return - 0.57 (the age value is converted to the corresponding WOE value); if age >44 then return 1.66;…; Inputting such binning and encoding (corresponding to the feature division of the tree and leaf node values) into LR is very similar to the model fusion of decision tree and LR, which improves the nonlinear expression of the model.

Summarize the advantages of discretization coding:

  • The fitting ability of logistic regression is limited. When the variables are discretized into N, each variable has a separate weight, which is equivalent to introducing nonlinearity into the model, which can improve the model fitting ability and have better interpretability. Moreover, after discretization, feature crossover can be conveniently performed, changing from M+N variables to M*N variables, which can further improve the expressive ability.

  • The discretized features are more robust to abnormal data: for example, if a feature is age > 44, it is 1, otherwise it is 0. If the features are not discretized, the input of an abnormal data "age 200 years old" will cause great interference to the model, and the impact of discretizing it into the corresponding binning will be limited.

  • After discretization, the model will be more stable and not easily affected by noise, reducing the risk of overfitting: for example, discretizing the age of users, 18-22 as an interval, will not become a complete user just because a user is one year older. different samples.

3. Feature selection

Feature selection is used to screen out salient features and discard non-significant ones. It can reduce computing overhead, reduce interference noise, reduce the risk of overfitting, and improve the effect of the model. For logistic regression, the following three selection methods are commonly used:

Filtering method : use missing rate, single value rate, variance, pearson correlation coefficient, VIF, IV value, PSI, P value and other indicators to filter features;

Embedding method : using logistic regression with L1 regular term, it has the effect of feature selection (sparse solution);

Wrapper method : Select features using stepwise logistic regression, bidirectional search.

Among them, the VIF mentioned in the filtering method is a collinearity index. The principle is to try to use each feature as a label, use other features to learn and fit, get the R^2 value of the linear regression model fitting effect, and calculate the VIF of each feature . The VIF of the feature is 1, that is, other features cannot be used to fit the current feature, and there is no collinearity between the features (VIF<10 is commonly used as the threshold in engineering). Collinearity mainly affects the actual significance and weight of the feature for the generalized linear picturemodel Parameters (for example, the feature should be positively correlated in business, but the weight value is negative), will also weaken the interpretability of the model and the stability of the model training.

4. Optimization at the model level

4.1 Intercept term

The fitting ability of logistic regression can be improved by setting the intercept term (bias term) b. The intercept term can be simply understood as an additional parameter b in the model (it can also be regarded as a parameter w0 corresponding to a new column of constant term features), such a model is more complex and has a better fitting effect.

What if there is no intercept term b? We know that the decision boundary of logistic regression is linear (that is, the decision boundary is W * X + b). If there is no intercept term (ie, W * X), ​​the decision boundary is limited to the point that must pass through the coordinates. Such a restriction It is very likely that the model will converge slowly, have poor accuracy, and not fit the data well, that is, it is easy to underfit.picture

4.2 Regularization strategy

The overfitting risk of the model can be reduced by setting regularization items. Commonly used regularization strategies include L1 and L2 regularization:

  • L2 parameter regularization (also known as ridge regression, Tikhonov regularization) is usually called weight decay (weight decay), which is to make the weight closer to the origin by adding a regular term Ω(θ) to the objective function, and the model is more accurate. Simple. From a Bayesian perspective, the constraint term of L2 can be regarded as a priori Gaussian distribution constraint introduced by the model parameters (see "Lazy Sparse Stochastic Gradient Descent for Regularized"). Add the L2 regular expression to the objective function J as follows: pictureupdate the weight of the model parameters with the L2 objective function, ϵ learning rate:picture

It can be seen from the above formula that the addition of weight attenuation will lead to the modification of the learning rules, that is, shrink the weight (multiply by 1 − ϵα ) before performing gradient update in each step, and have the effect of weight attenuation.

  • L1 regularization (Lasso regression) is by adding a parameter penalty term Ω(θ) to the objective function, which is the sum of the absolute values ​​of each parameter. From a Bayesian point of view, the constraints of L1 can also be regarded as model parameters introducing Laplace distribution constraints. Add the L1 regular expression to the objective function J as follows:

picture

Update the weight of the model parameters with the L1 objective function (where sgn(x) is a symbolic function, taking the sign of the parameter): It picturecan be seen that under the action of the -αsgn(w) item, the weight of each element of w after each step is updated The vectors will approach 0 smoothly, and some elements of w are likely to be 0, resulting in sparsity.

Summarize the L1, L2 regular terms:

Both L1 and L2 are methods to limit the solution space and reduce the model capacity to achieve the effect of reducing overfitting. The L2 paradigm constraint has the effect of producing a smooth solution, without the ability of a sparse solution, that is, there will not be many zeros in the parameters. Assuming that our decision-making results are related to two features, the L2 regularization tends to combine the influence of the two, giving high weight to the influential features; while the L1 regularization tends to choose the parameters with a greater influence, and discard the more influential parameters as much as possible. The smaller one (has a sparse solution effect). In practical applications, the performance of L2 regularization is often better than that of L1 regularization, but L1 regularization will compress the model and reduce the amount of calculation.picture

4.3 Multiple Classification Tasks

There are two main ideas when logistic regression is applied to binary classification tasks,

  • Following the two-category idea of ​​the Sigmoid activation function, there are two ways to realize multi-category into multiple two-category combinations: the idea of ​​OVR (one-vs-rest) is to use one category to perform two-category with other summarized categories. Multiple such classifications, select the category with the highest probability value; OVO (One vs One) each classifier only picks two categories for binary classification, finds which category it belongs to, and finally puts the results of all classifiers together , select the most selected category, as shown in the figure below:picture

  • Another way is to replace the Sigmoid activation function with the softmax function, and the corresponding model can also be called Multinomial Logistic Regression (Multinomial Logistic Regression), which can be applied to multi-classification scenarios. The softmax function is simply to map the output results of multiple neurons (the number of neurons is the number of categories) to the proportion of the total output (range 0~1, the proportion can be understood as a probability value), we choose the largest probability The output category is used as the predicted category.

pictureThe following softmax function and the corresponding multi-category objective function: pictureIn softmax regression, it is generally assumed that multiple categories are mutually exclusive, and the samples are calculated in the probability formula in softmax to obtain the value of the sample belonging to each category, and the probability of each category The sum must be 1, and when using logistic regression OVR for multi-classification, the value obtained is the probability that the sample belongs to the category relative to the rest of the categories. The result obtained after a sample is calculated on multiple classifiers is not necessarily 1. Therefore, when the classification target categories are mutually exclusive (such as distinguishing cats, pigs, and dog pictures), softmax regression is often used for prediction, and when the classification target categories are not very mutually exclusive (such as distinguishing pop music, rock, Chinese), you can use Logistic regression builds multiple binary classifiers (multi-label classification can also be considered).

4.4 Learning Objectives

Logistic regression uses minimizing the cross-entropy loss as the objective function,picture

Why can't I use MSE mean square error? pictureIn short, there are the following points:

  • The background assumption of the MSE loss function is that the data errors follow a Gaussian distribution, and the binary classification problem does not meet this assumption.

  • The loss function of cross entropy only focuses on the difference of prediction error corresponding to the true class. However, MSE pays indiscriminate attention to the errors of the predicted probability and the true category on all categories. In addition to increasing the correct classification, it will also average the wrong classification values.

  • The MSE function is non-convex for the sigmoid binary classification problem, and when deriving, there will be a derivation and multiplication operation for sigmoid. The derivative value may be small and the convergence will slow down, and the loss function cannot be guaranteed to be minimized. But mse is not completely unusable for classification, and MSE can be considered for classification soft labels.

4.5 Optimization algorithm

Logistic regression under maximum likelihood has no analytical solution. We often use algorithms such as gradient descent to iteratively optimize to obtain locally better parameter solutions.

If it is modeled by a neural network library such as Keras, the gradient descent algorithm class has optimization algorithms such as SGD, Momentum, and Adam to choose from. For most tasks, you can usually try Adam first, and then continue to verify the effects of different optimization algorithms on specific tasks.

If you use the scikitl-learn library for modeling, the optimization algorithms mainly include liblinear (coordinate descent), newton-cg (quasi-Newton method), lbfgs (quasi-Newton method) and sag (stochastic average gradient descent). liblinear supports L1 and L2, and only supports OvR for multi-classification; "lbfgs", "sag" and "newton-cg" only support L2, and supports OvR and MvM for multi-classification; when the amount of data is particularly large, sag is preferred!

4.6 Model Evaluation

**Optimize the model threshold (cutoff point): **When the evaluation index is classification Precision, Recall and other indicators, you can improve the classification effect by optimizing the model threshold (default 0.5). It is commonly used to make a trade-off and select an appropriate model threshold according to the presion and recall curves (PR curves) under different thresholds.picture

4.7 Interpretability

The great advantage of the logistic regression model is interpretability. As mentioned in the previous section, discretization encoding (such as Onehot) can improve the fitting effect and interpretability. Onehot encoding after the following features are discretized: the decision-making process is to bin the features Xn pictureand The weighted sum of the model weights Wn, and then sigmoid is converted into a probability, and the actual influence of each feature on the decision can be known through the size of the model weight value, for example, the feature "age at [18,30]" corresponds to the learned weight value W is -0.8, that is, there is a negative correlation.

Guess you like

Origin blog.csdn.net/qq_34160248/article/details/130715626