Logistic Regression Principles and Applications

Table of contents

Chapter 1: Application Scenarios of Logistic Regression

Chapter 2: Principles of Logistic Regression

1.Input

2.Sigmoid function

3.Loss function

4. Optimize loss

Use gradient descent:

Chapter 3 Logistic Regression Application Cases

1.Dataset

 2. Specific process

1.Read data

 2. Missing value processing

3. Divide the data set

4.Standardization

5. Predictor process

6. Model evaluation

7.Result display

Chapter 4 Classification Evaluation Algorithms

 1. Classification evaluation method------precision rate and recall rate

Accuracy:

Recall rate:

F1-score

2. Classification evaluation method------ROC curve and AUC index


Chapter 1: Application Scenarios of Logistic Regression

  • ad click rate
  • Is it spam?
  • Are you sick?
  • financial fraud
  • fake account

Seeing the above example, we can find the characteristics, that is, the judgment between the two categories. Logistic regression is a powerful tool for solving binary classification problems.

Note: Although logistic regression has the word regression in its name, it is not a regression algorithm, but a classification algorithm.

Chapter 2: Principles of Logistic Regression

1.Input

This is the result of linear regression output, which we can generally write in matrix form. as follows:

After the weights and biases are represented by matrices, the above formula can be written as follows:

 Key point: The input of logistic regression is the result of linear regression.

2.Sigmoid function

The image is:

Observing this image, the value range of the independent variable is (-∞, +∞), and the value range of the dependent variable is (0,1), which means that no matter what the value of the independent variable is, it can be mapped to (0,1) through the sigmoid function )between.

Summary: The sigmoid function will map the results of linear regression to [0,1]. Assuming 0.5 is the threshold, by default, values ​​less than 0.5 will be 0, and values ​​greater than 0.5 will be 1, so that classification can be performed. 

Assumption: The prediction function is:

in

 The above two formulas mean that the result of linear regression is first expressed in a matrix, and then the expressed result is put into the sigmoid function.

Classification tasks:

Understanding: Taking the probability of flipping a coin as an example, if the probability of heads is 0.7, then the probability of tails is 1-0.7=0.3

Combining the above two formulas, we get:

The characteristic of this formula is that when y=0, the whole will be equal to the formula on the right, and when y=1, the whole will be equal to the formula on the left. 

3.Loss function

To find a good logistic regression, the loss function is derived:

①The loss function is a function that reflects the degree of similarity between the "predicted value" and "true value"

②The smaller the loss function, the better the model

The loss of logistic regression is called the log likelihood loss, and the formula is as follows:

This formula is not unfamiliar either. Taking the logarithms of the above integrated formulas, the original multiplication is multiplication. After taking the logarithms, they are added, and the exponents can also be moved to the front.

Assuming that samples are independent of each other, then the probability of generating the entire sample set is the product of the generation probabilities of all samples. Then logarithmize the formula, and the following formula can be obtained:

 Example: Seek loss

Among them, y represents the real result, h(x) or 1-h(x) represents the logistic regression result (also the predicted value), which can be obtained by bringing in the value.

4. Optimize loss

Use gradient descent:

Understanding: α is the learning rate, which needs to be specified manually. The overall direction next to α indicates the direction.

Search along the direction of decline of this function, and finally you can find the lowest point of the valley, and then update the W value

Usage: Facing tasks with very large training data size, you can find better results 

The image is represented as follows:

It means constantly reducing its own value and finally finding the lowest point.

Chapter 3 Logistic Regression Application Cases

1.Dataset

Original data set download

URL: Index of /ml/machine-learning-databases/breast-cancer-wisconsin https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/

 

 After opening, download the two marked in red.

 Among them, data contains data, with a total of 699 samples and 11 columns of data. The first column is the id retrieved by the term, the next 9 columns are the medical characteristics related to the tumor, and the last column represents the value of the tumor type. Contains 16 missing values, marked with "?".

 Names contains a description of the data file, mainly a description of each column in the data, and the last column is the category.

 2. Specific process

1.Read data

It should be noted that the data and the list are separated, so when reading, they must be read in one piece.

import pandas as pd
import numpy as np
# 1.读取数据
path = "breast-cancer-wisconsin.data"
column_name = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
                   'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
                   'Normal Nucleoli', 'Mitoses', 'Class']

data = pd.read_csv(path, names=column_name)
# print(data)

 2. Missing value processing

# 2、缺失值处理
# 1)替换-》np.nan
data = data.replace(to_replace="?", value=np.nan)
# 2)删除缺失样本
data.dropna(inplace=True)

3. Divide the data set

# 3、划分数据集
from sklearn.model_selection import train_test_split
# 筛选特征值和目标值
x = data.iloc[:, 1:-1]
y = data["Class"]
x_train, x_test, y_train, y_test = train_test_split(x, y)

4.Standardization

Transform the original data into a range with a mean of 0 and a standard deviation of 1

# 4、标准化
from sklearn.preprocessing import StandardScaler
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

5. Predictor process

from sklearn.linear_model import LogisticRegression
# 5、预估器流程
estimator = LogisticRegression()
estimator.fit(x_train, y_train)
# 逻辑回归的模型参数:回归系数和偏置
# estimator.coef_
# estimator.intercept_

6. Model evaluation

# 6、模型评估
# 方法1:直接比对真实值和预测值
y_predict = estimator.predict(x_test)
print("y_predict:\n", y_predict)
print("直接比对真实值和预测值:\n", y_test == y_predict)
# 方法2:计算准确率
score = estimator.score(x_test, y_test)
print("准确率为:\n", score)

7.Result display

The code is not over yet, there is still evaluation code behind

Chapter 4 Classification Evaluation Algorithms

 1. Classification evaluation method------precision rate and recall rate

We often do not pay attention to the accuracy rate, but whether the cancer patients are detected in the cancer patients, so there is a precision rate and a recall rate.

Under the classification task, there are four different combinations between the predicted result (Predicted Condition) and the correct label (True Condition), forming a confusion matrix.

Accuracy:

The prediction result is the proportion of positive examples in which the true result is a positive example. The situation shown in the confusion matrix is:

Recall rate:

The proportion of samples whose actual results are positive examples where the predicted results are positive examples is shown in the confusion matrix as:

Summarize:

Precision rate is how many positive examples of prediction results are actually predicted correctly.

Recall rate is how many positive examples of real results were predicted correctly.

The above is the precision rate and recall rate, now introduce F1-score

F1-score

It reflects the robustness of the model. If the F1 value is large, the precision rate and recall rate will also be large.

Now use code to implement precision, recall and F1-score

# 查看精确率、召回率、F1-score
from sklearn.metrics import classification_report
report = classification_report(y_test, y_predict, labels=[2, 4], target_names=["良性", "恶性"])
print(report)

 The result is:

Before introducing the ROC curve and AUC indicators, give an example of sample imbalance

think?

Assuming such a situation, if 99 samples are cancer and 1 sample is non-cancer, I will predict all positive cases anyway (cancer is positive by default)

Write this information into the confusion matrix, as follows:

Calculate separately:

Accuracy: 99%

Accuracy: 99/(99+1)=99%

Recall: 99/(99+0)=100%

F1-score:2*99%*100%/99%+100%=99.497487% 

It can be seen that this is an irresponsible model. The root cause is that the samples are unbalanced, with too many positive examples and too few negative examples. Introduce ROC curve and AUC index.

2. Classification evaluation method------ROC curve and AUC index

Before introducing the ROC curve and AUC indicators, it is necessary to understand TPR and FPR.

TPR = TP / (TP + FN)

The proportion of predicted class 1 among all samples with true class 1

FPR = FP / (FP + TN)

The proportion of predicted class 1 among all samples with true class 0

Classification evaluation method ------ ROC curve and AUC index

 

 The blue line is the ROC curve, and the AUC indicator is the area of ​​the ROC curve and the vertical and horizontal axes.

Now introduce this picture:

The horizontal axis of the ROC curve is FPRate, and the vertical axis is TPRate. When the two are equal, the meaning is: for a sample regardless of whether the true category is 1 or 0, the probability of the classifier predicting 1 is equal. At this time, the AUC is 0.5 (i.e. random guess)

The minimum value of AUC is 0.5, the maximum value is 1, the higher the value, the better

AUC=1, perfect classifier. When using this prediction model, perfect predictions can be obtained no matter what threshold is set. In most prediction situations, there is no perfect classifier.

0.5<AUC<1, better than random guessing. This classifier (model) can have predictive value if the threshold is properly set.

in conclusion:

The final AUC ranges between [0.5, 1], and the closer to 1 the better.

Implemented through code, calculate the ROC curve area, that is, the AUC indicator:

# y_true:每个样本的真实类别,必须为0(反例),1(正例)标记
# 将y_test 转换成 0 1
y_true = np.where(y_test > 3, 1, 0)
from sklearn.metrics import roc_auc_score
print("AUC指标:",roc_auc_score(y_true, y_predict))

The results show:

Summarize:

AUC can only be used to evaluate two categories

AUC is very suitable for evaluating classifier performance in sample imbalance 

Now that we know the ROC curve and AUC index, let's go back to the example of sample imbalance mentioned earlier, that is, think about that place.

TPR:99/99+0=100%

FPR:1/1+0=100%

TPR=FPR

AUC=0.5

For this unbalanced sample situation, the AUC index is 0.5, indicating that the model is very poor.

Note:

Note that the above content comes from the study of machine learning videos by Dark Horse programmers.

The above content is also an assignment left by the teacher of the machine learning course. It was originally in PPT format. I thought it was relatively complete overall, so I posted it on the blog for myself and others to learn from. 2022.6.16

Guess you like

Origin blog.csdn.net/qq_39031009/article/details/125305196