Logistic regression [Brief summary of machine learning notes]

Simply put, logistic regression is a machine learning method used to solve binary classification (0 or 1) problems and is used to estimate the possibility of something.

So what is the relationship between logistic regression and linear regression?

Logistic Regression and Linear Regression are both generalized linear models. Logistic regression assumes that the dependent variable y follows a Bernoulli distribution, while linear regression assumes that the dependent variable y follows a Gaussian distribution. Therefore, it has many similarities with linear regression. If the Sigmoid mapping function is removed, the logistic regression algorithm is a linear regression. It can be said that logistic regression is theoretically supported by linear regression, but logistic regression introduces nonlinear factors through the Sigmoid function, so it can easily handle 0/1 classification problems.

No matter how much the field of machine learning goes on, logistic regression is still a model loved by industry and commerce and widely used because it has irreplaceable advantages: 1. Logistic regression has a crazy-good fitting effect on linear relationships. Between
features and labels Data with extremely strong linear relationships, such as credit card fraud in the financial field, scorecard production, marketing forecasting in e-commerce and other related data, are all the strengths of logistic regression. Although there is now a gradient boosting tree GDBT, which is better than logistic regression and is also used by many data consulting companies, the dominance of logistic regression in the financial field, especially in the banking industry, is still unshakable (in contrast, logistic regression is not The effect of linear data is often worse than guessing, so if you already know that the relationship between data is non-linear, do not be superstitious about logistic regression) 2.
Logistic regression is fast to calculate : for linear data, (most of the time) logistic regression The fitting and calculation are very fast, and the calculation efficiency is better than SVM and random forest. Personal testing shows that the difference can be seen especially on large data. 3. The
classification results returned by logistic regression are not fixed 0, 1 , but in decimal form. Probability-like numbers presented: Therefore, the results returned by logistic regression can be used as continuous data. For example, when making a scorecard, it is not only necessary to determine whether the customer will default, but also to give a certain "credit score". The calculation of this credit score requires the use of logarithmic probability calculated by class probability, and decision trees and random forests Such a classifier can produce classification results, but it cannot help us calculate scores (of course, in sklearn, decision trees can also generate probabilities, just use the interface predict_proba to call it, but generally speaking, normal decision trees do not have this function ).
In addition, logistic regression has the advantage of strong noise immunity. When Forbes magazine discussed the advantages of logistic regression, it even said that "technically speaking, when the AUC area of ​​the best model is less than 0.8, logistic regression is very obviously better than the tree model." Moreover, logistic regression performs better on small data sets, and tree models perform better on large data sets.
Therefore, the essence of logistic regression is a classifier that returns logarithmic probability and performs well on linear data. It is mainly used in the financial field. Its mathematical purpose is to solve the value of the parameter that can make the model fit the data the best, so as to construct the prediction function, and then input the feature matrix into the prediction function to calculate the result y of logistic regression. Note that although the familiar logistic regression is usually used to deal with binary classification problems, logistic regression can also do multi-classification.

To master logistic regression, you must master two points:
1. What is the input value in logistic regression?
2. How to judge the output of logistic regression.
insert image description here
The input of logistic regression is the result of a linear regression.

activation function

sigmoid function

Formulas and properties of Sigmoid function
The Sigmoid function is an S-shaped function. When the independent variable z approaches positive infinity, the dependent variable g(z) approaches 1, and when z approaches negative infinity, g(z) approaches 0. It can Maps any real number to the (0,1) interval, making it useful for converting arbitrary-valued functions into ones more suitable for binary classification.
Because of this property, the Sigmoid function is also regarded as a method of normalization. In the same way as MinMaxSclaer, it is a "scaling" function in data preprocessing and can compress data to within [0,1]. The difference is that after MinMaxScaler normalization, it can be 0 and 1 (the maximum value is 1 after normalization, and the minimum value is 0 after normalization), but the Sigmoid function only approaches 0 and 1 infinitely.

g ( θ T x ) = 1 1 + e − θ T x g(\theta^Tx)=\frac{1}{1+e^{-\theta ^Tx}} g ( iTx)=1+eiTx1
Judgment criteria

  • The regression results are input into the sigmoid function
  • Output result: a probability value in the interval [0, 1], the default is 0.5 as the threshold

The final classification of logistic regression is to judge whether it belongs to a certain category through the probability value of belonging to a certain category, and this category is marked as 1 (positive example) by default, and the other category will be marked as 0 (negative example). (Convenient for loss calculation)

Explanation of the output results (important): Suppose there are two categories A and B, and assume that our probability value is the probability value belonging to the category A(1). Now there is a sample input to the logistic regression output result 0.6, then this probability value exceeds 0.5, which means that the result of our training or prediction is the A(1) category. Then on the contrary, if the result is 0.3, then the training or prediction result will be the B(0) category.

So next we recall that we used the mean square error to measure the previous linear regression prediction results. If for logistic regression, our prediction results are incorrect, how should we measure this loss? Let’s look at a picture like this
insert image description here

Loss and optimization

The loss of logistic regression is called log-likelihood loss . The formula is as follows:
Separate categories:
insert image description here
How to understand a single formula? This should be understood based on the function image of log.
insert image description herePlease add image description
Comprehensive complete loss function.
insert image description here
Next, we will take the above example and calculate it again, and we will understand the meaning.
insert image description here
We already know that log(P), the larger the value of P, the smaller the result, so we can analyze this loss formula

Optimization
also uses the gradient descent optimization algorithm to reduce the value of the loss function. In this way, the weight parameters of the corresponding algorithm in front of the logistic regression are updated, increasing the probability of originally belonging to category 1 and reducing the probability of originally belonging to category 0.

logistic regression api

Logistic regression related classes illustrate
linear_model.LogisticRegression Logistic regression classifier (also called logit regression, maximum descendant classifier)
linear_model.LogisticRegressionCV Logistic regression classifier with cross validation
linear_model.logistic_regression_path Compute a logistic regression model to obtain a list of regularization parameters
linear_model.SGDClassifier Linear classifiers solved using gradient descent (SVM, logistic regression, etc.)
linear_model.SGDRegressor Linear Regression Model Using Gradient Descent to Minimize a Regularized Loss Function
metrics.log_loss Logarithmic loss, also known as logistic loss or cross-entropy loss
Other categories involved illustrate
metrics.confusion_matrix Confusion matrix, one of the model evaluation indicators
metrics.roc_auc_score ROC curve, one of the model evaluation indicators
metrics.accuracy_score Accuracy, one of the model evaluation indicators
class sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)
  • Solver optional parameters: {'liblinear', 'sag', 'saga', 'newton-cg', 'lbfgs'},.

    • For small data sets, 'liblinear' is a good choice, while 'sag' and 'saga' are faster for large data sets.
    • For multiclass problems, only 'newton-cg', 'sag', 'saga' and 'lbfgs' can handle multinomial losses; 'liblinear' is limited to 'one-versus-rest' classification.
    • liblinear': Use axis descent to iteratively optimize the loss function. 'lbfgs': A type of quasi-Newton method. The second-order derivative matrix of the loss function, the Hessian matrix, is used to iteratively optimize the loss function, which is recommended for smaller data sets. 'newton-cg': A type of Newton's method. 'sag': stochastic average gradient descent. Each iteration only uses a part of the samples to calculate the gradient, which is suitable when there is a lot of sample data. saga is a variant of sag that supports the non-smooth L1 regularization option penalty="l1". Therefore, this solver is often chosen for sparse polynomial logistic regression.
  • multi_class=‘auto’:

    • 'ovr' or 'multinomial'. One-to-many and many-to-many. If the selected option is 'ovr', then a binary classification problem is trained on each class. For "polynomial", the minimized loss is a polynomial loss fit over the entire probability distribution, even if the data is binary. 'multinomial' is not available when solver = 'liblinear'. 'auto' selects 'ovr' if the data is binary, or if solver = 'liblinear', otherwise selects 'multinomial'
  • Penalty: type of regularization.
    You can enter "l1" or "l2" to specify which regularization method to use. Leave the default "l2" blank.
    Note that if you choose "l1" regularization, the parameter solver can only use "liblinear", if you use "l2" regularization, all the
    solving methods in the parameter solver can be used

  • C: Regularization strength. is the inverse of regularization strength

    ​ C The reciprocal of the regularization strength, must be a floating point number greater than 0, if not filled in, the default is 1.0, that is, the default is double the regularization term​ The smaller
    C, the heavier the penalty for the loss function, and the stronger the regularization effect

method:

(1) decision_function(self, X) , predicts the sample confidence score. The confidence score of the sample is the signed distance between the sample and the hyperplane. Parameters: X: array-like or sparse matrix sample set, shape (n_samples, n_features). Returns: array, if it is 2 categories, shape=(n_samples,) otherwise shape= (n_samples, n_classes). Get the confidence score of each (sample, class) sample and class label. In binary classification, self.classes[1]>0 indicates that the class is to be predicted.
(2) densify(self) : Convert the coefficient matrix to numpy.ndarray. This is the default format for coef and is required for training, so this method only needs to be called on models that have been previously sparsified; otherwise, it is a no-op. Returns:
self : estimator.
(3) fit(self, X, y, sample_weight=None) : Use the given training data to train the model. Parameters: X: {array-like, sparse matrix}, shape (n_samples, n_features), which is a training vector. y : array-like, shape (n_samples,) Target vector associated with X. sample_weight: array-like, shape (n_samples,) optional, array of weights assigned to each sample. If not provided, each sample is given a unit weight. Returns: class object.
(4) get_params(self, deep=True): Get the estimated parameters. Parameters: deep : boolean, optional, if True, the parameters of this estimator will be returned and included as sub-objects of the estimator. Returns: params: Maps string to any, which is the parameter name mapped to its value.
(5) predict(self, X) : Predict the class label of the sample in X. Parameters: X: array_like or sparse matrix, shape (n_samples, n_features), which is the sample. Returns: C: array, shape [n_samples], consisting of the prediction list of each sample.
(6) predict_log_proba(self, X) : log form of probability. Returned estimates for all classes are ordered by class label. Parameters: Sequential sorting within classes.
(7) predict_proba(self, X): Class probability estimate. Returned estimates for all classes are ordered by class label. For multi_class multi-classification problems, if multi_class is set to "multincial", the softmax function is used to find the predicted probability for each class. Otherwise use the one-to-the-rest approach, i.e. use the logistic function to calculate the probability of each class hypothesis being positive. and normalize these values ​​across all classes. Parameters: self.classes.
(8) score(self, X, y, sample_weight=None) : returns the mean accuracy of the given test data and labels. In multi-label classification, this is a subset of the hash metric accuracy, since each set of labels needs to be correctly predicted for each sample. X : array-like, shape = (n_samples, n_features) test sample set, y : array-like, shape = (n_samples) or (n_samples, n_outputs) corresponding labels. sample_weight : array-like, shape = [n_samples], optional, sample weight. Returns:score : float, Mean accuracy of self.predict(X) wrt.y.
(9) set_params(self, params) : Set deletion for an estimator.
(10) sparsify(self):Convert the coefficient matrix to sparse format. Convert coef members to scipy.sparse matrices, which can be more memory and storage efficient than the usual numpy.ndarray for L1 regularized models. The intercept member will not be converted.
NOTE : For non-sparse models, i.e. when there are no multiple zeros in coef, this may actually increase memory usage, so use this method with caution. As a rule of thumb, the number of zero elements that can be calculated using (coef == 0).sum() must exceed 50% to provide a significant advantage. After calling this method, further use of the partial_fit method (if any) before calling densify will have no effect.

By default, those with a small number of categories are regarded as positive examples.
The LogisticRegression method is equivalent to SGDClassifier(loss="log", penalty=" "). SGDClassifier implements an ordinary stochastic gradient descent learning. And use LogisticRegression (implemented SAG)

Case: Cancer Classification Prediction - Benign/Malignant Breast Cancer Tumor Prediction
Download address of original data: https://archive.ics.uci.edu/ml/machine-learning-databases/

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


# 1.获取数据

names = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
                   'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
                   'Normal Nucleoli', 'Mitoses', 'Class']

data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",
                  names=names)
data.head()
# 2.基本数据处理
# 2.1 缺失值处理
data = data.replace(to_replace="?", value=np.NaN)
data = data.dropna()
# 2.2 确定特征值,目标值
x = data.iloc[:, 1:10]
x.head()

y = data["Class"]
y.head()

# 2.3 分割数据
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)

# 3.特征工程(标准化)
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

# 4.机器学习(逻辑回归)
estimator = LogisticRegression()
estimator.fit(x_train, y_train)

Classification assessment method

1.混淆矩阵
    真正例(TP)
    伪反例(FN)
    伪正例(FP)
    真反例(TN)
2. 精确率(Precision)与召回率(Recall)
    准确率:(对不对)
        (TP+TN)/(TP+TN+FN+FP)
    精确率 -- 查的准不准
        TP/(TP+FP)
    召回率 -- 查的全不全
        TP/(TP+FN)
    F1-score
        反映模型的稳健性
3.api
    sklearn.metrics.classification_report(y_true, y_pred)

insert image description here

Precision and Recall

confusion matrix
insert image description here

  • Precision rate: The prediction result is the proportion of positive samples that are actually positive precision = TPTP + FP precision=\frac{TP}{TP+FP}precision=TP+FPTP
  • Recall rate: the proportion of samples that are actually positive examples with predicted results (full search, ability to distinguish positive samples) recall = TPTP + FN recall=\frac{TP}{TP+FN}recall=TP+FNTP
  • F1-score: Reflects the robustness of the model
    insert image description here
    1) Basic theory
  • Adjusting the size of the threshold can adjust the proportion of precision rate and recall rate ;
  1. Threshold: threshold, classification boundary value, when score > threshold is classified as 1, and when score < threshold is classified as 0;
  2. As the threshold increases, the precision rate increases and the recall rate decreases; as the threshold decreases, the precision rate decreases and the recall rate increases;
  • Precision rate and recall rate are two variables that are in conflict with each other and cannot be increased at the same time;
  • The decision boundary of logistic regression does not necessarily have to be θ T ⋅ xb = 0 \theta^{T} \cdot x_b=0iTxb=0 , or it can be any value, which can be determined according to the business:θ T ⋅ xb = threshold \theta^{T} \cdot x_b=thresholdiTxb=t h r e s h o l d is classified as 1 when it is greater than the threshold, and is classified as 0 when it is less than the threshold;
  • Extended to other algorithms, a score value is calculated first, and then compared with the threshold to make a classification decision;

2) Give an example to illustrate the relationship between precision rate and recall rate (1)

  • When the calculation result score > 0, the classification result is ★; when score < 0, the classification result is ●;
  • ★The type is the event of concern;
  • Scenario 1 : threshold = 0
    -
  1. Accuracy: 4 / 5 = 0.80;
  2. Recall: 4 / 6 = 0.67;
  • Scenario 2 : threshold > 0;
    Please add image description
  1. Accuracy: 2 / 2 = 1.00;
  2. Recall: 2 / 6 = 0.33;
  • Scenario 3 : threshold < 0;
    insert image description here
  1. Accuracy: 6 / 8 = 0.75;
  2. Recall: 6 / 6 = 1.00;

3) Give an example to illustrate the relationship between precision rate and recall rate (2)

  • In the predict() method in the LogisticRegression() class, the default threshold threshold is 0, and then comparison and classification are performed based on the score value of the sample to be predicted calculated by the decision_function() method: score < 0, the classification result is 0, score > 0, the classification result is 1;

  • .decision_function(X_test) : Calculate the score value of all samples to be predicted, and return the result in the quantity type of vector;

The score value here is not a probability value, it is the score of the sample in another way of judging the classification, and the sample is classified according to the score of the sample;

example

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

digits = datasets.load_digits()
X = digits.data
y = digits.target.copy()
y[digits.target==9] = 1
y[digits.target!=9] = 0

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

Threshold threshold = 0

y_predict_1 = log_reg.predict(X_test)

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_predict_1)
# 混淆矩阵:array([[403, 2],
#                 [9, 36]], dtype=int64)

from sklearn.metrics import precision_score
print('precision_score',precision_score(y_test, y_predict_1))
# 精准率:0.9473684210526315

from sklearn.metrics import recall_score
print('recall_score',recall_score(y_test, y_predict_1))
# 召回率:0.8

Threshold threshold = 5

decision_score = log_reg.decision_function(X_test)

# 更改 decision_score ,经过向量变化得到新的预测结果 y_predict_2;
# decision_score > 5,增大阈值为 5;(也就是提高判断标准)
y_predict_2 = np.array(decision_score >= 5, dtype='int')

confusion_matrix(y_test, y_predict_2)
# 混淆矩阵:array([[404,   1],
#                 [ 21,  24]], dtype=int64)

print('precision_score',precision_score(y_test, y_predict_2))
# 精准率:0.96

print('recall_score',recall_score(y_test, y_predict_2))
# 召回率:0.5333333333333333

The idea of ​​​​changing the threshold : based on the decision_function() method, change the score value, reset the threshold, and no longer pass the predict() method, but obtain a new classification result through vector changes;

Threshold threshold = -5

decision_score = log_reg.decision_function(X_test)
y_predict_3 = np.array(decision_score >= -5, dtype='int')

confusion_matrix(y_test, y_predict_3)
# 混淆矩阵:array([[390,  15],
#                 [5,  40]], dtype=int64)

print('precision_score',precision_score(y_test, y_predict_3))
# 精准率:0.7272727272727273

print('recall_score',recall_score(y_test, y_predict_3))
# 召回率:0.8888888888888888
  • analyze:
  1. The precision rate and the recall rate are mutually restrained and balanced. If one increases, the other will decrease;
  2. The larger the threshold, the higher the precision rate and the lower the recall rate; the smaller the threshold, the lower the precision rate and the higher the recall rate;
  3. Change the threshold : 1) Obtain the prediction score through the decision_function() method under the LogisticRegression() module; 2) Do not use the predict() method, but reset the threshold and classify the sample directly based on the prediction score through vector conversion;

Classification Assessment Report API

sklearn.metrics.classification_report(y_true, y_pred, *, labels=None, target_names=None, sample_weight=None, digits=2, output_dict=False, zero_division='warn')
    • y_true: true target value
    • y_pred: The estimator predicts the target value
    • labels: Number corresponding to the specified category
    • target_names: target category names
    • return: precision and recall for each category
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))

y_pred = [1, 1, 0]
y_true = [1, 1, 1]
print(classification_report(y_true, y_pred, labels=[1, 2, 3]))

insert image description here
Assume such a situation, if 99 samples are cancer and 1 sample is non-cancer, no matter what, I predict all positive cases (default cancer is positive), the accuracy rate will be 99%, but the effect is not good, this is sample imbalance Evaluation issues under
Question: How to measure evaluation under sample imbalance ?

Precision-recall curve (P-R curve)

  • Corresponding to the classification algorithm, you can call its decision_function() method to get the score of the algorithm's decision for each sample;

  • In the LogisticRegression() algorithm, the default decision boundary threshold is 0. If the score value of a sample is greater than 0, the sample is classified as 1; if the score value of the sample is less than 0, the sample is classified as 0.

  • Idea : As the threshold changes, the precision rate and recall rate change accordingly;

  1. Set different threshold values:

    decision_scores = log_reg.decision_function(X_test)
    thresholds = np.arange(np.min(decision_scores), np.max(decision_scores), 0.1)
    

0.1 is the step size of the interval value;
1) Coding to implement threshold-Precision, Recall curve and P-R curve

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

digits = datasets.load_digits()
X = digits.data
y = digits.target.copy()
y[digits.target==9] = 1
y[digits.target!=9] = 0

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
decision_scores = log_reg.decision_function(X_test)

precisions = []
recalls = []
thresholds = np.arange(np.min(decision_scores), np.max(decision_scores), 0.1)

for threshold in thresholds:
    y_predict = np.array(decision_scores >= threshold, dtype='int')
    precisions.append(precision_score(y_test, y_predict))
    recalls.append(recall_score(y_test, y_predict))

threshold - Precision, Recall curve

plt.plot(thresholds, precisions)
plt.plot(thresholds, recalls)
plt.show()

insert image description here
P-R curve

plt.plot(precisions, recalls)
plt.show()

insert image description here
2) precision_recall_curve() method in scikit-learn

Directly solve precisions, recalls, and thresholds based on y_test and y_predicts;

from sklearn.metrics import precision_recall_curve
digits = datasets.load_digits()
X = digits.data
y = digits.target.copy()
y[digits.target==9] = 1
y[digits.target!=9] = 0

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
decision_scores = log_reg.decision_function(X_test)

precisions, recalls, thresholds = precision_recall_curve(y_test, decision_scores)

precisions.shape
# (145,)

recalls.shape
# (145,)

thresholds.shape
# (144,)
  1. Phenomenon : The number of elements in thresholds is 1 less than the number of elements in precisions and recalls;
  2. Reason : When precision = 1 and recall = 0, there is no threshold;

threshold - Precision, Recall curve

plt.plot(thresholds, precisions[:-1])
plt.plot(thresholds, recalls[:-1])
plt.show()

insert image description here
P-R curve

plt.plot(precisions, recalls)
plt.show()

insert image description here
The point where the curve begins to drop sharply along the way may be the point where precision and recall are balanced;

3) Analyze
different Precision-Recall curves corresponding to different models:
insert image description here
1. The model corresponding to the outer curve is better; or the one with the larger area enclosed by the coordinate axis is better.
2. The P-R curve can also be used as an indicator for selecting algorithms, models, and hyperparameters; however, this curve is generally not applicable, and the ROC curve is used instead.

roc curve and auc indicator

    roc曲线
        通过TPR和FPR来进行图形绘制,然后绘制之后,形成一个指标auc
    auc
        越接近1,效果越好
        越接近0,效果越差
        越接近0.5,效果就是胡说
    注意:
        这个指标主要用于评价不平衡的二分类问题
   api
    sklearn.metrics.roc_auc_score(y_true, y_score)
        y_true -- 要把正例转换为1,反例转换为0
        
ROC曲线的绘制
    1.构建模型,把模型的概率值从大到小进行排序
    2.从概率最大的点开始取值,一直进行TPR和FPR的计算,然后构建整体模型,得到结果
    3.其实就是在求解积分(面积)

TPR and FPR

  • TPR = TP / (TP + FN)

    • The proportion of predicted class 1 among all samples with true class 1
  • FPR = FP / (FP + TN)

    • The proportion of predicted class 1 among all samples with true class 0
  • The horizontal axis of the ROC curve is FPRate, and the vertical axis is TPRate. When the two are equal, the meaning is: for a sample regardless of whether the true category is 1 or 0, the probability of the classifier predicting 1 is equal. At this time, the AUC is 0.5
    insert image description here
    AUC indicator

  • The probability meaning of AUC is to randomly select a pair of positive and negative samples, and the probability that the score of the positive sample is greater than that of the negative sample

  • The minimum value of AUC is 0.5, the maximum value is 1, the higher the value, the better

  • AUC=1, perfect classifier. When using this prediction model, perfect predictions can be obtained no matter what threshold is set. In most prediction situations, there is no perfect classifier.

  • 0.5<AUC<1, better than random guessing. This classifier (model) can have predictive value if the threshold is properly set.

The final AUC ranges between [0.5, 1], and the closer to 1 the better

AUC计算API
from sklearn.metrics import roc_auc_score

sklearn.metrics.roc_auc_score(y_true, y_score, *, average='macro', sample_weight=None, max_fpr=None, multi_class='raise', labels=None)
  • Calculate the ROC curve area, that is, the AUC value
  • y_true: The true category of each sample, must be 0 (counterexample), 1 (positive example) mark
  • y_score: prediction score, which can be the estimated probability of the positive class, the confidence value, or the return value of the classifier method
import numpy as np
from sklearn import metrics
y = np.array([1, 1, 2, 2])
scores = np.array([0.1, 0.4, 0.35, 0.8])
fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=2)
fpr
# array([ 0. ,  0.5,  0.5,  1. ])

tpr
# array([ 0.5,  0.5,  1. ,  1. ])

thresholds
# array([1.8 , 0.8 , 0.4 , 0.35, 0.1 ])

The above is the basic implementation. You can use the thresholds of the stippling diagram
as the threshold, FP as the X axis, and TP as the Y axis. Draw the ROC curve as shown in the figure below: the
insert image description here
area enclosed by the above figure is equal to: 0.75

Call roc_auc_score

import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
roc_auc_score(y_true, y_scores)
#0.75

Differences from P-R Curve

  • P-R curve : used to determine the quality of a model trained with extremely biased data ;

  • ROC curve : used to compare the advantages and disadvantages of two models;

Model: It can be different models obtained by the same algorithm with different hyperparameters, or it can be different models obtained by different algorithms;

Decision boundaries for a data set

(Limited to linear regression and logistic regression)

  • In the two-dimensional feature space, the decision boundary is a theoretical straight line , which is determined by the coefficients and intercepts of the linear model. There may not necessarily be samples that meet this condition;

  • If the sample has only two features, the decision boundary can be expressed as:

  • θT.xb = θ0 + θ1.x1 + θ2.x2 = 0 , then the boundary is a straight line, because the coordinate axes of the feature space in the classification problem all represent features;
    then: x 2 = θ 0 − θ 1 x 1 θ 2 x_2=\frac{\theta_0-\theta_1x_1}{\theta_2}x2=i2i0i1x1
    1) Draw the decision boundary in the two-dimensional feature space

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


iris = datasets.load_iris()

X = iris.data
y = iris.target
X = X[y<2, :2]
y = y[y<2]

plt.scatter(X[y==0, 0], X[y==0, 1], color='red')
plt.scatter(X[y==1, 0], X[y==1, 1], color='blue')
plt.show()

insert image description here

x_train, y_test, y_train, y_test = train_test_split(X, y, random_state=666)

log_reg = LogisticRegression()
log_reg.fit(x_train, y_train)

# x2()函数:求满足决策边界关系的直线的函数值;
def x2(x1):
    return (-log_reg.coef_[0][0] * x1 - log_reg.intercept_) / log_reg.coef_[0][1]


x1_plot = np.linspace(4, 8, 100).reshape(-1, 1)
x2_plot = x2(x1_plot)

plt.scatter(X[y==0, 0], X[y==0, 1], color='red')
plt.scatter(X[y==1, 0], X[y==1, 1], color='blue')
plt.plot(x1_plot, x2_plot)
plt.axis([4, 8, 2, 4.5])
plt.show()

insert image description here
How to draw irregular decision boundaries

  • Idea : There are countless points distributed in the feature space. Through subdivision, the feature space is divided into countless points. For each point, a model is used to predict and classify it. These prediction results are drawn. The boundaries of points with different colors are Classification decision boundaries;

  • Segmentation method: Divide the coordinate axis of the feature space into n equal parts (only two features are displayed during visualization), then the feature space is divided into n X n points (each point is equivalent to a sample), and the model is used to predict these n2 The type of point, the predicted result (sample point) is displayed in the feature space;

# plot_decision_boundary()函数:绘制模型在二维特征空间的决策边界;
def plot_decision_boundary(model, axis):
    # model:算法模型;
    # axis:区域坐标轴的范围,其中 0,1,2,3 分别对应 x 轴和 y 轴的范围;
    
    # 1)将坐标轴等分为无数的小点,将 x、y 轴分别等分 (坐标轴范围最大值 - 坐标轴范围最小值)*100 份,
    # np.meshgrid():
    x0, x1 = np.meshgrid(
        np.linspace(axis[0], axis[1], int((axis[1]-axis[0])*100)).reshape(-1,1),
        np.linspace(axis[2], axis[3], int((axis[3]-axis[2])*100)).reshape(-1,1)
    )
    # np.c_():
    X_new = np.c_[x0.ravel(), x1.ravel()]
    
    # 2)model.predict(X_new):将分割出的所有的点,都使用模型预测
    y_predict = model.predict(X_new)
    zz = y_predict.reshape(x0.shape)
    
    # 3)绘制预测结果
    from matplotlib.colors import ListedColormap
    custom_cmap = ListedColormap(['#EF9A9A','#FFF59D','#90CAF9'])
    
    plt.contourf(x0, x1, zz, linewidth=5, cmap=custom_cmap)
plot_decision_boundary(log_reg, axis=[4, 7.5, 1.5, 4.5])
plt.scatter(X[y==0, 0], X[y==0, 1], color='red')
plt.scatter(X[y==1, 0], X[y==1, 1], color='blue')
plt.show()

insert image description here
The two color blocks are the classification results of dividing the feature space into n2 sample points; the dividing line between the two color blocks is the decision boundary of the model.

2) Draw the decision boundary of the kNN algorithm model (2 samples)

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf.fit(x_train, y_train)

plot_decision_boundary(knn_clf, axis=[4, 7.5, 1.5, 4.5])
plt.scatter(X[y==0, 0], X[y==0, 1])
plt.scatter(X[y==1, 0], X[y==1, 1])
plt.show()

insert image description here
3) Draw the decision boundary of the kNN algorithm model (3 samples)

knn_clf_all = KNeighborsClassifier()
knn_clf_all.fit(iris.data[:,:2], iris.target)

plot_decision_boundary(knn_clf_all, axis=[4, 8, 1.5, 4.5])
plt.scatter(iris.data[iris.target==0,0], iris.data[iris.target==0,1])
plt.scatter(iris.data[iris.target==1,0], iris.data[iris.target==1,1])
plt.scatter(iris.data[iris.target==2,0], iris.data[iris.target==2,1])
plt.show()

insert image description here

  • question
  1. The decision boundary is irregular, and the boundary between the yellow area and the blue area in the figure is not obvious;
  2. There are green points in the yellow area and orange points in the blue area;
  • Reason : The model may be overfitting;
  • Solution : Re-adjust the parameter training model;
  1. n_neighbors=5, the k parameter of the model is 5, which is too small, causing the model to be too complex; (in the kNN algorithm, the smaller the k value, the more complex the model)

Change the k parameter and redraw

knn_clf_all = KNeighborsClassifier(n_neighbors=50)
knn_clf_all.fit(iris.data[:,:2], iris.target)

plot_decision_boundary(knn_clf_all, axis=[4, 8, 1.5, 4.5])
plt.scatter(iris.data[iris.target==0,0], iris.data[iris.target==0,1])
plt.scatter(iris.data[iris.target==1,0], iris.data[iris.target==1,1])
plt.scatter(iris.data[iris.target==2,0], iris.data[iris.target==2,1])
plt.show()

insert image description here

Logistic regression uses polynomial features

  • The decision boundary in logistic regression is essentially equivalent to finding a straight line in the feature plane, and using this straight line to divide the corresponding categories of all samples;

  • Logistic regression can only solve binary classification problems (including linear and nonlinear problems), so its decision boundary can only divide the feature plane into two parts;

  • Problem : It is too simple to use straight line classification, because the decision boundary of classification of samples in many cases is not a straight line, as shown in the figure below; because the distribution of these sample points is nonlinear ;
    insert image description here
    scheme : introduce polynomial items, change the characteristics, and then change the sample distribution status ;

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(666)
X = np.random.normal(0, 1, size=(200, 2))
y = np.array(X[:,0]**2 + X[:,1]**2 < 1.5, dtype='int')

plt.scatter(X[y==0,0], X[y==0,1])
plt.scatter(X[y==1,0], X[y==1,1])
plt.show()

insert image description here
Use logistic regression algorithm (without adding polynomial terms)

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X, y)

def plot_decision_boundary(model, axis):
    
    x0, x1 = np.meshgrid(
        np.linspace(axis[0], axis[1], int((axis[1]-axis[0])*100)).reshape(-1,1),
        np.linspace(axis[2], axis[3], int((axis[3]-axis[2])*100)).reshape(-1,1)
    )
    X_new = np.c_[x0.ravel(), x1.ravel()]
    
    y_predict = model.predict(X_new)
    zz = y_predict.reshape(x0.shape)
    
    from matplotlib.colors import ListedColormap
    custom_cmap = ListedColormap(['#EF9A9A','#FFF59D','#90CAF9'])
    
    plt.contourf(x0, x1, zz, linewidth=5, cmap=custom_cmap)

plot_decision_boundary(log_reg, axis=[-4, 4, -4, 4])
plt.scatter(X[y==0,0], X[y==0,1])
plt.scatter(X[y==1,0], X[y==1,1])
plt.show()

Use logistic regression algorithm (adding polynomial terms)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

def PolynomialLogisticRegression(degree):
    return Pipeline([
        # 管道第一步:给样本特征添加多形式项;
        ('poly', PolynomialFeatures(degree=degree)),
        # 管道第二步:数据归一化处理;
        ('std_scaler', StandardScaler()),
        ('log_reg', LogisticRegression())
    ])

poly_log_reg = PolynomialLogisticRegression(degree=2)
poly_log_reg.fit(X, y)

plot_decision_boundary(poly_log_reg, axis=[-4, 4, -4, 4])
plt.scatter(X[y==0,0], X[y==0,1])
plt.scatter(X[y==1,0], X[y==1,1])
plt.show()

insert image description here

Logistic regression regularization

1. Basic understanding

  • When using the logistic regression algorithm to train the model, polynomial terms are introduced into the model to make the model generate irregular decision boundaries and classify nonlinear data ;

  • Problem : After introducing polynomial terms, the model becomes complex and may cause overfitting ;

  • Plan : Regularize the model, add a regularization term (αL2) to the loss function, generate a new loss function, and optimize the new loss function ;

  • Optimize the new loss function :

  1. It is satisfied to make the original loss function as small as possible;
  2. On the other hand, for the L2 regular term (including the parameter θ value), the size of θ is limited;
  3. The parameter α is introduced to adjust the importance of the two parts (the original loss function and the L2 regular term) in the new loss function; of course, the αL1 regular term can also be introduced;

2. Other methods of regularization

  • New way of expressing regularization: just a different way, regularization is the same;

insert image description here

  1. Changed the position of hyperparameters: α, C;
  2. If the hyperparameter C is larger, the status of the original loss function J(θ) is relatively important, and when optimizing the loss function, focus on optimizing J(θ) to minimize it;
  3. If the hyperparameter C is very small, the status of the regular term L2 is relatively important. When optimizing the loss function, we mainly focus on optimizing L2 to make the elements in the parameter θ as small as possible;
  4. If you want to make the regular term unimportant, you need to increase parameter C ;
  • In fact, adding parameter C before J(θ) is equivalent to changing the original αL2 to 1/αL2. The two methods are equivalent;

  • α, C: Balance the relationship between the two parts of the new loss function;

  • In logistic regression and SVM algorithms, the CJ(θ) + L2 method is preferred; this method is also used in scikit-learn's logistic regression algorithm ;

  1. Reason : When using the CJ(θ) + L2 method, the coefficient of the regularization term is 1, which means that regularization must be used when optimizing the algorithm model;

3. Example of logistic regression algorithm in scikit-learn

  • The logistic regression algorithm in scikit-learn automatically encapsulates the regularization function of the model, and only needs to adjust C and penalty;
  • Main parameters: degree, C, penalty; (there are other parameters)

1) Directly use the logistic regression algorithm

np.random.seed(666)
X = np.random.normal(0, 1, size=(200, 2))
y = np.array(X[:,0]**2 + X[:,1] < 1.5,dtype='int')

# 随机抽取 20 个样本,让其分类为 1,相当于认为更改数据,添加噪音
for _ in range(20):
    y[np.random.randint(200)] = 1

plt.scatter(X[y==0,0], X[y==0,1])
plt.scatter(X[y==1,0], X[y==1,1])
plt.show()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

insert image description here

# C=1.0:默认超参数 C 的值为1.0;
# penalty='l2':默认使用 L2 正则项;
def plot_decision_boundary(model, axis):
    
    x0, x1 = np.meshgrid(
        np.linspace(axis[0], axis[1], int((axis[1]-axis[0])*100)).reshape(-1,1),
        np.linspace(axis[2], axis[3], int((axis[3]-axis[2])*100)).reshape(-1,1)
    )
    X_new = np.c_[x0.ravel(), x1.ravel()]
    
    y_predict = model.predict(X_new)
    zz = y_predict.reshape(x0.shape)
    
    from matplotlib.colors import ListedColormap
    custom_cmap = ListedColormap(['#EF9A9A','#FFF59D','#90CAF9'])
    
    plt.contourf(x0, x1, zz, linewidth=5, cmap=custom_cmap)

plot_decision_boundary(log_reg, axis=[-4, 4, -4, 4])
plt.scatter(X[y==0,0], X[y==0,1])
plt.scatter(X[y==1,0], X[y==1,1])
plt.show()

insert image description here
Adding polynomial terms to the model of the logistic regression algorithm

# degree = 2、C 默认1.0

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

def PolynomialLogisticRegression(degree):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scaler', StandardScaler()),
        ('log_reg', LogisticRegression())
    ])

# 使用管道时,先生成实例的管道对象,在进行 fit;
poly_log_reg = PolynomialLogisticRegression(degree=2)
poly_log_reg.fit(X_train, y_train)

plot_decision_boundary(poly_log_reg, axis=[-4, 4, -4, 4])
plt.scatter(X[y==0,0], X[y==0,1])
plt.scatter(X[y==1,0], X[y==1,1])
plt.show()

insert image description here

# degree = 20、C 默认1.0

poly_log_reg2 = PolynomialLogisticRegression(degree=20)
poly_log_reg2.fit(X_train, y_train)

plot_decision_boundary(poly_log_reg2, axis=[-4, 4, -4, 4])
plt.scatter(X[y==0,0], X[y==0,1])
plt.scatter(X[y==1,0], X[y==1,1])
plt.show()

insert image description here

#degree = 20、C = 0.1

def PolynomialLogisticRegression(degree, C):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scaler', StandardScaler()),
        ('log_reg', LogisticRegression(C=C))
    ])

poly_log_reg3 = PolynomialLogisticRegression(degree=20, C=0.1)
poly_log_reg3.fit(X_train, y_train)

plot_decision_boundary(poly_log_reg3, axis=[-4, 4, -4, 4])
plt.scatter(X[y==0,0], X[y==0,1])
plt.scatter(X[y==1,0], X[y==1,1])
plt.show()

insert image description here

# degree = 20、C = 0.1、penalty = 'L1'(penalty:正则项类型, 默认为 L2)

def PolynomialLogisticRegression(degree, C, penalty='l2'):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scaler', StandardScaler()),
        ('log_reg', LogisticRegression(C=C, penalty=penalty,solver='liblinear'))
    ])

poly_log_reg4 = PolynomialLogisticRegression(degree=20, C=0.1, penalty='l1')
poly_log_reg4.fit(X_train, y_train)

plot_decision_boundary(poly_log_reg4, axis=[-4, 4, -4, 4])
plt.scatter(X[y==0,0], X[y==0,1])
plt.scatter(X[y==1,0], X[y==1,1])
plt.show()

# solver参数,这个参数定义的是分类器,‘newton-cg’,‘sag’和‘lbfgs’等solvers仅支持‘L2’regularization,‘liblinear’ solver同时支持‘L1’、‘L2’regularization,若dual=Ture,则仅支持L2 penalty。

# 决定惩罚项选择的有2个参数:dual和solver,如果要选L1范数,dual必须是False,solver必须是liblinear
  1. Analysis : degree = 20, the decision boundary of the model is too complex, the model may be overfitted, use the L1 regular term to regularize the model;
  2. Analysis : After the model is overfitted, there are many polynomial terms. Use the L1 regular term to make the coefficients of these polynomial terms 0, which in turn makes the model decision boundary more regular and less curved, making it easier to visualize;
    insert image description here

Logistic regression OvR and OvO

1. Basic understanding

  • Problem : Logistic regression algorithm uses regression to solve classification problems, and can only solve binary classification problems ;
  • Solution : It can be modified so that the logistic regression algorithm can solve multi-classification problems;
  • Transformation method:
  1. OvR (One vs Rest) , a pair of remaining meanings, sometimes called OvA (One vs All); generally use OvR, more standard;
  2. OvO (One vs One) means one-on-one;
  • The transformation method does not refer to the logistic regression algorithm, but is universal in the field of machine learning. All binary classification machine learning algorithms can be transformed using this method to solve multi-classification problems ;

2. Principle

1)OvR

  • Idea : When classifying n types of samples, take one type of sample as one category, and treat all remaining types of samples as another category. This forms n binary classification problems, and uses the logistic regression algorithm to classify n The data set trains n models, and the samples to be predicted are passed into these n models. The sample type corresponding to the model with the highest probability is considered to be the type of the predicted sample**; **Time complexity:
    insert image description here
    If processing A two-classification problem takes time T, and this method takes time nT;

2)OvO

  • Idea : Among n types of samples, pick out 2 types each time and combine them in pairs. There are a total of C n 2 C_n^2Cn2For a two-class classification case, use C n 2 C_n^2Cn2A model predicts sample types, there are C n 2 C_n^2Cn2A prediction result, the sample type with the most types is considered to be the final prediction type of the sample;
    insert image description here
  • Time complexity : If it takes T to process a binary classification problem, this method takes time C n 2 C_n^2Cn2 .T = [n.(n - 1) / 2] . T;

3) Difference

  • OvO takes more time, and its classification results are more accurate, because the real type is used for comparison every time it is classified, without confusing other categories;

1) Example (3 sample types): LogisticRegression() uses OvR by default

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

iris = datasets.load_iris()
# [:, :2]:所有行,0、1 列,不包含 2 列;
X = iris.data[:,:2]
y = iris.target

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

log_reg.score(X_test, y_test)
# 准确率:0.7894736842105263
# 绘制决策边界
def plot_decision_boundary(model, axis):
    
    x0, x1 = np.meshgrid(
        np.linspace(axis[0], axis[1], int((axis[1]-axis[0])*100)).reshape(-1,1),
        np.linspace(axis[2], axis[3], int((axis[3]-axis[2])*100)).reshape(-1,1)
    )
    X_new = np.c_[x0.ravel(), x1.ravel()]
    
    y_predict = model.predict(X_new)
    zz = y_predict.reshape(x0.shape)
    
    from matplotlib.colors import ListedColormap
    custom_cmap = ListedColormap(['#EF9A9A','#FFF59D','#90CAF9'])
    
    plt.contourf(x0, x1, zz, linewidth=5, cmap=custom_cmap)

plot_decision_boundary(log_reg, axis=[4, 8.5, 1.5, 4.5])
# 可视化时只能在同一个二维平面内体现两种特征;
plt.scatter(X[y==0, 0], X[y==0, 1])
plt.scatter(X[y==1, 0], X[y==1, 1])
plt.scatter(X[y==2, 0], X[y==2, 1])
plt.show()

insert image description here
2) Use OvO classification

log_reg2 = LogisticRegression(multi_class='multinomial', solver='newton-cg')
# 'multinomial':指 OvO 方法;

log_reg2.fit(X_train, y_train)
print(log_reg2.score(X_test, y_test))
# 准确率:0.7894736842105263


plot_decision_boundary(log_reg2, axis=[4, 8.5, 1.5, 4.5])
plt.scatter(X[y==0, 0], X[y==0, 1])
plt.scatter(X[y==1, 0], X[y==1, 1])
plt.scatter(X[y==2, 0], X[y==2, 1])
plt.show()

insert image description here
3) Use all categorical data

OvR

X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)

log_reg_ovr = LogisticRegression()
log_reg_ovr.fit(X_train, y_train)
log_reg_ovr.score(X_test, y_test)

OvO

log_reg_ovo = LogisticRegression(multi_class='multinomial', solver='newton-cg')
log_reg_ovo.fit(X_train, y_train)
log_reg_ovo.score(X_test, y_test)

Logistic regression other parameters

1. max_iter learning curve

The parameter max_iter is the maximum number of iterations instead of the step size, which helps us control the iteration speed of the model and stop the model in a timely manner. The larger the maxiter is, the smaller the step size is and the longer the model iteration time is. On the contrary, it means that the step size is set to be large and the model iteration time is very short.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression as LR
from matplotlib import  pyplot as plt
import numpy as np

data = load_breast_cancer()
x = data.data
y = data.target
#print(x.shape)  #(569, 30)
#print(y.shape)  #(569,)

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.3, random_state=0)

lr1 = []
lr1test = []
for i in range(1, 201, 10):
    LR_ = LR(penalty='l1', solver='liblinear', C=0.8, max_iter=i)
    lr = LR_.fit(xtrain, ytrain)
    score_1 = accuracy_score(lr.predict(xtrain), ytrain)
    lr1.append(score_1)
    l1test = LR_.fit(xtest, ytest)
    score_2 = accuracy_score(lr.predict(xtest), ytest)
    lr1test.append(score_2)

graph = [lr1, lr1test]
color = ['black', 'red']
label = ['l1', 'l1test']
plt.figure(figsize=(20, 5))
for i in range(2):
    plt.plot(np.arange(1, 201, 10), graph[i], color[i], label=label[i])
plt.legend()
plt.xticks(np.arange(1, 201, 10))
plt.show()

insert image description here
You can use the attribute .n_iter_meter to call the number of iterations actually implemented in this solution. If the minimum value of the loss function has been found during the iteration process, the max_iter value at this time will be used and no more iteration will be performed.

#我的可以使用间性.n_iter_米调用本洗求解中育正实理的进代次数
lr=LR(penalty='l2',solver='liblinear',C=0.8,max_iter=200).fit(xtrain,ytrain)
lr.n_iter_
#array([20], dtype=int32)

When the number of steps limited in max_iter has been completed, but the logistic regression has not yet found the minimum value of the loss function, and the parameter values ​​have not yet converged, sklearn will pop up a red warning like this:

#我的可以使用间性.n_iter_米调用本洗求解中育正实理的进代次数
lr=LR(penalty='l2',solver='liblinear',C=0.8,max_iter=10).fit(xtrain,ytrain)
lr.n_iter_

insert image description here
This is a reminder: the parameters have not converged, please increase the number entered in max_iter. But we don’t necessarily have to listen to sklearn. A large max_iter means that the step size is small and the model will run slower. Although what we are pursuing in gradient descent is the minimum value of the loss function, this may also mean that our model will overfit (perform too well on the training set, but not necessarily on the test set), so if When max_iter reports a red bar, the training and prediction effects of the model are already good, so we do not need to increase the number in max_iter. After all, everything is based on the prediction effect of the model - as long as the final prediction If the effect is good and the operation is fast, then everything is fine. It doesn't matter whether a red warning is reported.

2. Selection of solver
insert image description here
3. class_weight: sample imbalance and parameter
sample imbalance means that in a set of data sets, one type of label naturally occupies a large proportion, or the cost of misclassification is very high, that is, we want Capture the situation when a specific classification is made. When is misclassification costly? For example, when it comes to determining whether a new customer will default in a bank, the ratio of people who do not default to those who default is usually 99:1. There are actually very few people who actually default. Under this classification situation, even if the model does nothing and treats everyone as someone who will not default, the accuracy rate can still be 99%. This makes the model evaluation index meaningless and cannot achieve our " To identify people who will default" modeling purposes.

Therefore, we need to use the parameter class_weight to balance the sample labels to a certain extent, give more weight to a small number of labels, and make the model more biased towards the minority class, and model in the direction of capturing the minority class. This parameter defaults to None. This mode means that all labels in the data set are automatically given the same weight, that is, automatically 1:1. When the cost of misclassification is high, we use the "balanced" mode. When we just want to balance the labels, we can solve the problem of sample imbalance by filling nothing.

However, the parameter class_weight in sklearn is unpredictable. If you run the model, you will find that it is difficult for us to find out the model trend guided by this parameter, or draw a learning curve to evaluate the effect of the parameter, so it can be said that it is very Difficult to use. We have various methods to deal with sample imbalance. The mainstream method is the sampling method, which balances labels by repeating samples. It can be upsampling (increasing samples of the minority class), such as SMOTE, or downsampling (reducing the majority class). sample). For logistic regression, upsampling is the best approach.

Guess you like

Origin blog.csdn.net/qq_45694768/article/details/120943069