Logistic regression revealed: from classification principles to machine learning practice

Overview

Although Logistic Regression has the word "regression" in its name, it is mainly used to solve classification problems, especially binary classification problems. The core idea of ​​logistic regression is: by passing the output of linear regression to an activation Function (Activation Function) (such as Sigmoid function) converts continuous values ​​into probability values ​​between 0 and 1, and then classifies the probabilities according to the threshold (Threshold).

Machine Learning Logistic Regression

Logistic regression application areas

Logistic Regression has wide applications in many fields, including but not limited to:

  • Medicine: for disease prediction
  • Finance: Credit score, assessing someone’s risk of defaulting
  • Marketing: Predicting whether a customer will buy a product or click on an ad
  • Social Networks: Predict user status, such as whether to receive friend requests or whether the message is spam

As our understanding of data continues to deepen, especially when designing classification tasks, logistic regression is often the first algorithm we try. Compared with other algorithms, logistic regression has very high computational efficiency, and the mathematical principles behind it are simple and clear. .

Logistic regression vs linear regression

In machine learning, linear regression and logistic regression are the most basic algorithms. Although the names of the two algorithms are very similar, they are very different in their working principles and uses. In order to have a deeper understanding of the two algorithms We will discuss the differences from the following dimensions.

basic definition

  • Linear Regression: Linear regression is used to estimate the linear relationship between one or more independent variables and the dependent variable, and is used for regression tasks
  • Logistic Regression: Logistic regression estimates the probability of an event, used for classification tasks

Output type

  • Linear Regression: Linear regression outputs continuous values, the range is − ∞ ∼ + ∞ -\infty\sim +\infty +
  • Logistic Regression: Logistic regression outputs probability values, ranging from 0-1

Functional relationship

  • Linear Regression: The linear regression relationship is linear, using the identity link function
  • Logistic Regression: The logistic regression relationship is a Sigmoid function

Error calculation

  • Linear Regression: Linear regression uses mean square error (MSE) to calculate model error
  • Logistic Regression: Logistic regression uses the Logarithmic Loss Function to calculate the model error

scenes to be used

  • Linear Regression: The goal of linear regression is to predict a continuous value, such as house prices, temperature, etc.
  • Logistic Regression: The goal of logistic regression is classification, especially binary classification, such as spam detection, disease diagnosis, etc.

Data distribution

  • Linear Regression: It is required that the independent variable and the dependent variable have a linear relationship, and the error should present a Normal Distribution. In addition, heavy collinearity, heteroscedasticity and autocorrelation cannot exist in the book
  • Logistic Regression: There is no linear relationship between the dependent variable and the independent variables. There is a linear relationship between each independent variable and log(odds). In addition, logistic regression also assumes that the observations are independent.

Mathematical Principles of Logistic Regression

Before we delve into logistic regression, it is crucial to understand the mathematical principles behind it. By understanding the principles, we can better understand the algorithm, and it can also help us make informed decisions in practical applications.

Sigmoid function

The core of logistic regression is the Sigmoid function, also known as the Logistic function. Through the Sigmoid function, we can map any real number to between 0-1.

Sigmoid function
sigmoid 公式:
S ( z ) = 1 1 + e − z S(z) = \frac{1}{1 + e^{-z}} S(z)=1+ez1

  • z is a linear combination of inputs

z 的公式:
z = w 0 + w 1 x 1 + w 2 x 2 + w 3 x 3 + . . . + w n x n z = w_0 + w_1x_1 + w_2x_2 + w_3x_3 + ... + w_nx_n With=In0+In1x1+In2x2+In3x3+...+Innxn

  • As z approaches positive infinity, S(z) approaches 1
  • As z approaches negative infinity, S(z) approaches 0
  • When z=0, S(z) = 0.5

Sigmoid's S-shaped curve is very suitable for binary classification problems, because the output that Sigmoid can produce can be regarded as the probability of belonging to a certain category.

Implementation of Sigmoid function in Python:

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

Most likely

Logistic regression involves an important concept, log-odds. Log-odds is the logarithmic ratio between the probability of an event occurring and the probability of the event not occurring.

l o g i t ( p ) = l o g ( p 1 − p ) logit(p) = log(\frac{p}{1-p}) logit(p)=log(1pp)

where p is the probability of the event occurring (Probability). Our goal is to establish the relationship between the explanatory variables and the log odds.

likelihood function

In order to estimate the parameters in a logistic regression model, we need to use a likelihood function. Given a set of parameters, the function describes the probability of observing the data.

Likelihood function and probability:

  • Likelihood: Given a certain set of observed data, we are trying to estimate the likelihood of a certain parameter (such as the probability of a coin coming up heads).
  • Probability: Given a certain parameter value (such as the probability of heads and tails of a coin), we calculate the probability of observing a certain data set under certain circumstances. For example: tossing a coin, the probability of heads and tails appearing is 0.5, we can calculate the general theory that 5 consecutive words are all positive, that is, P ( 5 positive ∣ p ) = ( 0.5 ) 5 = 0.03125 P ( 5 positive | p ) = ( 0.5)^5 = 0.03125 P(5正面p)=(0.5)5=0.03125
  • Likelihood vs Probability:
    • Function perspective: Likelihood is a function of parameters, given observed data; Probability is a function of data, given fixed parameter values
    • Normalization: The sum of likelihoods does not necessarily equal 1 and is not a true probability distribution; the sum of probabilities for all possible events is always equal to 1
    • Viewpoint: Likelihood describes the possibility of different scenarios (model parameters) given an event that has already occurred; probability describes the possibility of something happening in the future

Similar function formula:
L ( θ ∣ x ) = P ( X = x ∣ θ ) L(\theta|x) = P(X=x|\theta )L(θx)=P(X=xθ)

  • θ \theta θ: Model number
  • X X X: Automatic change amount
  • x x x: observed data
  • P ( X = x ∣ θ ) P(X=x|\theta)P(X=xθ): Current reference number θ \theta θ下观测到数位 x x xprobability

Maximum Likelihood Estimation, a major application of the likelihood function in statistics is maximum likelihood estimation (MLE). The goal is to find parameter values ​​that maximize the likelihood function. Formula:

θ ˆ = a r g ma x L ( θ ∣ x ) \^\theta = arg\; max\; L(\theta|x)iˆ=argatxL(θx)

  • θ ˆ \^\thetaiˆ is the best estimate of the model parameters

Why use likelihood function:

  • The likelihood function provides us with a framework to evaluate how well different possible values ​​of the model parameters fit the observed data. The maximum likelihood function estimation provides us with a consistent and asymptotically unbiased estimation method, that is, the larger the sample size, the better , MLE will be close to the true parameter values

Logistic regression loss function

Logistic Regression is used to solve binary classification problems. The main goal is to predict the probability that the output category of a given input is 1 (True). In order to measure the predictive value of the model (< /span> y ˆ \^y andˆPredict) and the actual class label, we need a loss function. The loss function of logistic regression chrysalis is often called log loss (Log Loss) or cross entropy loss (Cross Entropy Loss).

Number loss function:
J ( w , b ) = − y l o g ( y ˆ ) − ( 1 − y ) l o g ( 1 − y ˆ ) J(w, b ) = -ylog(\^y)- (1-y)log(1-\^y) J(w,b)=ylog(andˆ)(1y)log(1andˆ)

  • y: is the category label (0 or 1)
  • y ˆ \^yandˆ: is the model prediction value

The log loss function takes into account all possible differences between the probability of the model's predicted value and the actual class:

  • When the actual category is equal to 1 (y=1): the loss function is − l o g ( y ˆ ) -log(\^y) log(andˆ). y ˆ \^y andˆThe closer is to 1, the closer the loss value will be to 0. On the contrary y ˆ \^y andˆThe closer to 0, the greater the loss value will be
  • When the actual category is equal to 1 (y=0): the loss function is − l o g ( 1 − y ˆ ) -log(1-\^y) log(1andˆ). y ˆ \^y andˆThe closer is to 0, the closer the loss value will be to 0. On the contrary y ˆ \^y andˆThe closer it is to 1, the greater the loss value will be

Note: Cross-entropy loss and logarithmic loss are equivalent in binary classification problems.

Regularization

Regularization is a key point in machine learning. Regularization limits the complexity of the model by adding a certain "penalty" term to the model loss function to improve the model's generalization ability (Generalization Ability) and avoid excessive errors. Overfitting. In logistic regression, we often use L1 and L2 regularization to achieve this.

L1 regularization

L1 correction (Lasso correction):
J L 1 = J + λ ∑ i = 1 n ∣ w i ∣ J_{L1} = J + \lambda \sum\limits_ {i=1}^{n}|w_i| JL1=J+li=1nwi

  • NOT A WORDJ: original loss function
  • w i w_iIni: Model parameters
  • λ \lambdaλ: Positive conversion index

L1 regularization tends to produce some features with a weight of 0, which means that some features in the model may be completely ignored, thus achieving feature selection.

L2 regularization

L2 correction (Ridge correction):
J L 1 = J + λ ∑ i = 1 n w i 2 J_{L1} = J + \lambda \sum\limits_{i =1}^{n}w_i^2 JL1=J+li=1nIni2

  • NOT A WORDJ: original loss function
  • w i w_iIni: Model parameters
  • λ \lambdaλ: Positive conversion index

Unlike L1 regularization, L2 regularization will not make the weights exactly 0, but it will try to make them as small as possible, which helps prevent any single feature from having an excessive impact on the prediction results, making the model more robust.

Choose appropriate regularization:

  • Choosing L1 or L2 regularization (or their combination, Elastic Net) depends on what problem we need to solve. If only a few features are important, then L1 regularization may be more suitable, since it will lead to a sparse model. And if All features are meaningful, but we hope that the model does not rely too much on any single feature, so L2 regularization may be a better choice.

L1 vs L2 Example

Suppose we have a data set with 10 features, but only 3 of them are really meaningful, and the features contain many redundant features. In this case, we need to use L1 regularization, because L1 regularization ization can help us set the weights of unimportant features to 0.

Let’s first talk about the format of simulation data generated by sklearn:

sklearn.datasets.make_classification(n_samples=100, n_features=20, *, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)

parameter:

  • n_samples: Number of data generated
  • n_features: the number of features of the data
  • n_informative: the number of informative features (the most important feature)
  • n_redundant: number of redundant features
  • random_state: random state (random number seed)

example:

"""
@Module Name: 逻辑回归 正则化.py
@Author: CSDN@我是小白呀
@Date: October 18, 2023

Description:
逻辑回归 正则化
"""
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss, accuracy_score

# 生成模拟数据, 长度1000, 特征10, 其中有用特征3, 冗余特征7
X, y = make_classification(n_samples=1000, n_features=10, n_informative=3, n_redundant=7, random_state=8)

# 输出数据维度
print("特征维度:", X.shape)  # (1000, 10)
print("标签维度:", y.shape)  # (1000, )

# 分割数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# 使用L1正则化
l1_model = LogisticRegression(penalty='l1', solver='liblinear')
l1_model.fit(X_train, y_train)


# 使用L2正则化
l2_model = LogisticRegression(penalty='l2', solver='liblinear')
l2_model.fit(X_train, y_train)

# 对比权重
print("L1正则化的特征权重:\n{}".format(l1_model.coef_))
print("L2正则化的特征权重:\n{}".format(l2_model.coef_))

# 对比模型效果
l1_pred = l1_model.predict(X_test)
l2_pred = l2_model.predict(X_test)

print("L1 逻辑回归误差:", log_loss(y_test, l1_pred))
print("L2 逻辑回归误差:", log_loss(y_test, l2_pred))
print("L1 逻辑回归 acc:", accuracy_score(y_test, l1_pred))
print("L2 逻辑回归 acc:", accuracy_score(y_test, l2_pred))

Output result:

特征维度: (1000, 10)
标签维度: (1000,)
L1正则化的特征权重:
[[ 0.          0.          0.         -0.66172034  0.          0.
   0.         -0.75254448 -0.27951727  0.41781586]]
L2正则化的特征权重:
[[-0.14042143 -0.21016038  0.12523401 -0.43610954  0.05762657  0.08849344
  -0.19145278 -0.51859205 -0.42032689  0.5046308 ]]
L1 逻辑回归误差: 9.901243835463227
L2 逻辑回归误差: 10.016373090112928
L1 逻辑回归 acc: 0.7133333333333334
L2 逻辑回归 acc: 0.71

standardization

Standardization is commonly used to preprocess data, especially when using machine learning algorithms that need to consider feature scales, such as support vector machines, logistic regression, and K-nearest neighbors.

standardization

Why standardize?

Machine learning algorithms are sensitive to the scale and distribution of features. When the scales (ranges) of features vary greatly, those features with a larger range may dominate the behavior of the algorithm, resulting in poor model performance.

How to standardize?

The main idea of ​​standardization is to scale the feature data so that its mean is 0 and its standard deviation is 1.

Standardized formula:

z = x − μ σ z = \frac{x - \mu}{\sigma} With=pxμ

  • z z z: Primitive Special Expedition
  • μ \muμ: Special expeditionary force
  • σ \sigma σ: Special expedition standard difference

Standardize in sklearn:

from sklearn.preprocessing import StandardScaler

# 原始数据
data = [[0, 1], [2, 3], [4, 5]]
print(data)

# 标准化
scaler = StandardScaler()
scaler.fit(data)
data = scaler.transform(data)
print(data)

Output result:

[[0, 1], [2, 3], [4, 5]]
[[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]

Logistic regression predicts breast cancer, without standardization:

Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.96      0.94        47
           1       0.97      0.94      0.95        67

    accuracy                           0.95       114
   macro avg       0.94      0.95      0.95       114
weighted avg       0.95      0.95      0.95       114

To standardize:

Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.96      0.96        47
           1       0.97      0.97      0.97        67

    accuracy                           0.96       114
   macro avg       0.96      0.96      0.96       114
weighted avg       0.96      0.96      0.96       114

gradient descent

Gradient Descent is an iterative method for optimizing functions. In machine learning and deep learning, we often use gradient descent to minimize (maximize) the loss function and then find the model's optimal parameters.

working principle

The core idea of ​​gradient descent is simple: find the slope (gradient) of the loss function, and then update the parameters of the model along the opposite direction of the slope to gradually reduce the loss.

Insert image description here

Gradient descent formula

公式:
w n e x t = w − l r × ∂ l o s s ∂ w w_{next} = w - lr\times \frac{\partial loss}{\partial w} Innext=Inlr×wloss

  • w n e x t w_{next} Innext: is the next weight in the gradient descent process
  • lr: Learning Rate learning rate
  • ∂ l o s s ∂ w \frac{\partial loss}{\partial w} wloss: Derivative of the error function

Downside I came here to tell you:
I knew it: M S E = 1 n ∑ i = 1 n ( y i − y ˆ i ) 2 y = w 0 + w 1 x 1 MSE = \frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \^y_i)^2\\y = w_0 + w_1x_1 MSE=n1i=1n(yiandˆi)2and=In0+In1x1

In gradient descent, our goal is to adjust the model parameters (weights and biases, w and b or w0) to minimize the loss function, so we need to take partial derivatives of the model parameters (w and w0):

w1 (权重) 求导:
∂ M S E ∂ w 1 = 1 n ∑ i = 1 n ( w 0 + w 1 x 1 ) 2 d w = 2 n ∑ i = 1 n x ( w 0 + w 1 x 1 ) \frac{\partial MSE}{\partial w_1} = \frac{1}{n}\sum\limits_{i=1}^{n}(w_0 + w_1x_1)^2dw \\=\frac{2}{n}\sum\limits_{i=1}^{n}x(w_0 + w_1x_1) w1MSE=n1i=1n(w0+In1x1)2dw=n2i=1nx(w0+In1x1)

w0 (偏置) 求导:
∂ M S E ∂ w 0 = 1 n ∑ i = 1 n ( w 0 + w 1 x 1 ) 2 d w = 2 n ∑ i = 1 n ( w 0 + w 1 x 1 ) \frac{\partial MSE}{\partial w_0} = \frac{1}{n}\sum\limits_{i=1}^{n}(w_0 + w_1x_1)^2dw \\=\frac{2}{n}\sum\limits_{i=1}^{n}(w_0 + w_1x_1) w0MSE=n1i=1n(w0+In1x1)2dw=n2i=1n(w0+In1x1)

If there are w2, w3 and so on, I won’t write it here.

Calculate new weights and substitute the above derivatives into the gradient descent formula, we can get:
w 1 n e x t = w 1 − l r × ∂ l o s s ∂ w 1 = w 1 − l r × 2 n ∑ i = 1 n x ( w 0 + w 1 x 1 ) w_{1next} = w_1 - lr\times \frac{\partial loss}{\partial w_1} \\=w_1 - lr\times\frac {2}{n}\sum\limits_{i=1}^{n}x(w_0 + w_1x_1) In1next=In1lr×w1loss=In1lr×n2i=1nx(w0+In1x1)

同理:
w 0 n e x t = w 0 − l r × ∂ l o s s ∂ w 0 = w 1 − l r × 2 n ∑ i = 1 n ( w 0 + w 1 x 1 ) w_{0next} = w_0 - lr\times \frac{\partial loss}{\partial w_0} \\=w_1 - lr\times\frac{2}{n}\sum\limits_{i=1}^{n}(w_0 + w_1x_1) In0next=In0lr×w0loss=In1lr×n2i=1n(w0+In1x1)

In the code:

# 计算w的导, w的导 = 2x(wx+b-y)
 w_gradient += (2 / N) * x * ((w_current * x + b_current) - y)

# 计算b的导, b的导 = 2(wx+b-y)
b_gradient += (2 / N) * ((w_current * x + b_current) - y)

Variations of Gradient Descent

  • Batch Gradient Descent (BGD, Batch Gradient Descent): Use this training set to calculate the gradient each time
  • Stochastic Gradient Descent (SGD): only uses one sample at a time to calculate the gradient
  • Mini-Batch Gradient Descent: Use a mini-batch of samples to calculate gradients

learning rate

Find the lowest point of the valley, which is the end point of our objective function (what parameters can make the objective function reach the extreme point)

How many steps does it take to go down the mountain?

  1. Find the most appropriate direction at the moment
  2. take a small step
  3. Update our parameters according to the direction and pace

Learning rate (learning_rate): has a greater impact on the results, the smaller the better.

Data batch (batch_size): Prioritize memory and efficiency, batch size is secondary.

machine learning learning rate
Choosing an appropriate learning rate is very important:

  • Learning rate that is too small: slow convergence
  • Too large a learning rate: the minimum will be missed, or the gradient will explode

Common operations include:

  • Learning Rate Decay or Cosine Annealing, that is, the learning rate gradually decreases with the number of iterations

Forward propagation vs back propagation

Forward Propagation and Backpropagation are a concept for training Neural Network in Deep Learning. But today I am happy, so I will tell you about it. If you don’t like it, feel free to hit me. .

Logistic Regression can be thought of as a single-layer neural network, so the concepts of forward and backpropagation also apply.

forward propagation

Forward Propagation refers to the flow of information from the input layer to the output layer. In this process, we calculate the output value of each layer to know the final predicted value.

For logistic regression, the forward propagation of a single-layer neural network for logistic regression can be described as the following steps:

  1. Calculate the linear part: given input features X X X 和权 w w w (hereinafter eccentric b b b), I calculated z = w X + b z = wX + b With=wX+b
  2. Activation function: convert the result of the previous step z z z is passed to the Sigmoid activation function to get the predicted value y ˆ = σ ( z ) \^y = \sigma(z) < /span>andˆ=σ(z)

Backpropagation

Backward Propagation updates the parameters of the model by calculating the gradient of the loss function for each parameter. These gradients can tell us how to adjust the parameters to minimize the loss.

For logistic regression, the backpropagation process can be described as the following steps:

  1. Calculation error: Calculate y − y ˆ y - \^y andandˆ, that is, the difference between the true value and the predicted value
  2. Calculate gradient: Calculate loss function L L L compatibility w w w sum offset b b The gradient of b. Specifically, it is to find the weight w w w sum offset b b The partial derivative of b. The purpose of calculating the gradient (Gradient) is to find a direction and adjust the parameters along the direction to reduce the loss
  3. Update parameters: Use the gradient calculated in the previous step to update the weights and biases. We use the learning rate (Learning Rate) to control the update step size

In a neural network, this process is more complicated, because we need to start from the output layer, go back to the input layer layer by layer, and calculate the gradient of each layer. But the basic idea is the same, which is to calculate the loss gradient and use these Gradients are used to update the weights and biases in the network. Forward propagation provides us with the predicted values ​​​​of the model, and backward propagation provides us with a method to update the model parameters, thereby improving the predictive ability of the model.

Step by step calculation of regression

Use a simple linear regression model y = w x + b y=wx+b and=wx+b to fit the given data.

data:

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

Model initial parameters:

  • w = 0
  • b = 0

Objective: Minimize the mean square error (MSE)
1 n ∑ i = 1 n ( y − ( w x + b ) ) \frac{1}{n}\sum \limits_{i=1}^{n}(y - (wx + b)) n1i=1n(y(wx+b))

forward propagation

Forward propagation calculation:

对于 x 1 = 1 x_1 = 1 x1=1, the result is:
y ˆ 1 = w × x 1 + b = 0 × 1 + 0 = 0 \^y_1 = w \times x_1 + b = 0 \times 1 + 0 = 0 andˆ1=In×x1+b=0×1+0=0
True y 1 = 2 y_1 = 2 and1=2, where the difference is e 1 = y 1 − y ˆ 1 = 2 e_1 = y_1 - \^y_1 = 2 It is1=and1andˆ1=2

And so on:

  1. 对于 x 1 = 1 x_1 = 1 x1=1:
    • y ˆ 1 = w × x 1 + b = 0 \^y_1 = w \times x_1 + b = 0 andˆ1=In×x1+b=0
    • e 1 = y 1 − y ˆ 1 = 2 e_1 = y_1 - \^y_1 = 2It is1=and1andˆ1=2
  2. 对于 x 2 = 2 x_2 = 2 x2=2:
    • y ˆ 2 = w × x 2 + b = 0 \^y_2 = w \times x_2 + b = 0 andˆ2=In×x2+b=0
    • e 2 = y 2 − y ˆ 2 = 4 e_2 = y_2 - \^y_2 = 4It is2=and2andˆ2=4
  3. 对于 x 3 = 3 x_3 = 3 x3=3:
    • y ˆ 3 = w × x 3 + b = 0 \^y_3 = w \times x_3 + b = 0 andˆ3=In×x3+b=0
    • e 3 = y 3 − y ˆ 3 = 6 e_3 = y_3 - \^y_3 = 6It is3=and3andˆ3=6
  4. 对于 x 4 = 4 x_4 = 4 x4=4:
    • y ˆ 4 = w × x 4 + b = 0 \^y_4 = w \times x_4 + b = 0 andˆ4=In×x4+b=0
    • e 4 = y 4 − y ˆ 4 = 8 e_4 = y_4 - \^y_4 = 8It is4=and4andˆ4=8
  5. 对于 x 5 = 5 x_5 = 5 x5=5:
    • y ˆ 5 = w × x 5 + b = 0 \^y_5 = w \times x_5 + b = 0 andˆ5=In×x5+b=0
    • e 5 = y 5 − y ˆ 5 = 10 e_5 = y_5 - \^y_5 = 10It is5=and5andˆ5=10

Backpropagation

MSE 的偏导为:
∂ M S E ∂ w = − 2 n ∑ i = 1 n x ( y − ( w x + b ) ) \frac{\partial MSE}{\partial w} = -\frac{2}{n}\sum\limits_{i=1}^{n}x(y - (wx + b)) wMSE=n2i=1nx(y(wx+b))
∂ M S E ∂ b = − 2 n ∑ i = 1 n ( y − ( w x + b ) ) \frac{\partial MSE}{\partial b} =- \frac{2}{n}\sum\limits_{i=1}^{n}(y - (wx + b)) bMSE=n2i=1n(y(wx+b))

代入 x 1 = 1 x_1 = 1 x1=1 sum y 1 = 2 y_1 = 2 and1=2:

∂ M S E ∂ w = − 2 × 1 × 2 = − 4 \frac{\partial MSE}{\partial w} = -2 \times1 \times2 = -4 wMSE=2×1×2=4
∂ M S E ∂ b = = − 2 × 2 = − 4 \frac{\partial MSE}{\partial b} = = -2 \times2 = -4 bMSE==2×2=4

This is the gradient of a data. For the entire data, we need to calculate x 1 x_1 x1 to x 5 x_5 x5, and then average:

∂ M S E ∂ w = − 2 5 × [ ( 1 × 2 ) + ( 2 × 4 ) + ( 3 × 6 ) + ( 4 × 8 ) + ( 5 × 10 ) ] = − 2 5 × 110 = − 44 \frac{\partial MSE}{\partial w} \\= \frac{-2}{5} \times[(1 \times 2) + (2 \times 4) + (3 \times 6) + (4 \times 8) + (5 \times 10)] \\= \frac{-2}{5} \times 110 \\= -44 wMSE=52×[(1×2)+(2×4)+(3×6)+(4×8)+(5×10)]=52×110=44

∂ M S E ∂ b = − 2 5 × ( 2 + 4 + 6 + 8 + 10 ) = − 2 5 × 30 = − 12 \frac{\partial MSE}{\partial b} \\= \frac{-2}{5} \times(2+4+6+8+10) \\= \frac{-2}{5} \times 30 \\= -12 bMSE=52×(2+4+6+8+10)=52×30=12

Parameter update

参数更新公式:
w n e w = w o l d − L e a r n i n g R a t e × ∂ l o s s ∂ w w_{new} = w_{old} - LearningRate\times \frac{\partial loss}{\partial w} Innew=InoldLearningRate×wloss

b n e w = b o l d − L e a r n i n g R a t e × ∂ l o s s ∂ b b_{new} = b_{old} - LearningRate\times \frac{\partial loss}{\partial b} bnew=boldLearningRate×bloss

代入 ∂ M S E ∂ w = − 44 \frac{\partial MSE}{\partial w} = -44 wMSE=44 ∂ M S E ∂ b = − 12 \frac{\partial MSE}{\partial b} = -12 bMSE=12:

w n e w = 0 − 0.01 × ∂ l o s s ∂ w = 0 − 0.01 × − 44 = 0.44 w_{new} = 0 - 0.01 \times \frac{\partial loss}{\partial w} = 0 - 0.01 \times -44 = 0.44 Innew=00.01×wloss=00.01×44=0.44

b n e w = 0 − 0.01 × ∂ l o s s ∂ b = 0 − 0.01 × − 12 = 0.12 b_{new} = 0 - 0.01 \times \frac{\partial loss}{\partial b} = 0 - 0.01 \times -12 = 0.12 bnew=00.01×bloss=00.01×12=0.12

In this way, we have completed one iteration. In order to obtain more accurate weights, we need to iterate multiple times to converge the model. After many times, we will find that w w < /span>w Shikinyu 2, b b b Shima Chinyu 0.

第 1 次迭代后: w=0.44, b=0.12
第 2 次迭代后: w=0.776, b=0.2112
第 3 次迭代后: w=1.032608, b=0.280416
第 4 次迭代后: w=1.22860928, b=0.3328512
第 5 次迭代后: w=1.3783441663999998, b=0.3724776192
第 6 次迭代后: w=1.4927597926399998, b=0.402327416832
第 7 次迭代后: w=1.5802129932492799, b=0.42471528093696004
第 8 次迭代后: w=1.6470832178782207, b=0.44140819572326406
第 9 次迭代后: w=1.6982404182016162, b=0.45375503873610556
第 10 次迭代后: w=1.7374022238730944, b=0.4627855128692865
第 11 次迭代后: w=1.7674066038488565, b=0.4692856691795151
第 12 次迭代后: w=1.7904200108513373, b=0.4738555595649934
第 13 次迭代后: w=1.8080962748901435, b=0.4769532477226133
第 14 次迭代后: w=1.8216978995509552, b=0.47892840627475247
第 15 次迭代后: w=1.83218865727326, b=0.4800479641762001
第 16 次迭代后: w=1.8403042748225709, b=0.48051568545628054
第 17 次迭代后: w=1.8466063932342285, b=0.4804871152578007
第 18 次迭代后: w=1.8515237598072303, b=0.48008098935859095
第 19 次迭代后: w=1.8553836732881241, b=0.47938794398298534
第 20 次迭代后: w=1.8584359885257578, b=0.4784771647060382
第 21 次迭代后: w=1.8608714411677287, b=0.47740146210037193
第 22 次迭代后: w=1.862835636384806, b=0.47620114638830074
第 23 次迭代后: w=1.8644397275968507, b=0.47490698527744635
第 24 次迭代后: w=1.8657685684088967, b=0.4735424619160864
第 25 次迭代后: w=1.8668869356439743, b=0.4721254985732309
第 26 次迭代后: w=1.8678442798879062, b=0.4706697724631278
第 27 次迭代后: w=1.8686783519647792, b=0.46918572022059085
第 28 次迭代后: w=1.8694179713192922, b=0.4676813046982923
第 29 次迭代后: w=1.8700851393471505, b=0.4661626003251689
第 30 次迭代后: w=1.8706966526712672, b=0.4646342399578365

Actual combat

Logistic regression predicts breast cancer

"""
@Module Name: 逻辑回归 乳腺癌.py
@Author: CSDN@我是小白呀
@Date: October 19, 2023

Description:
逻辑回归 乳腺癌
"""
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# 加载数据
data = load_breast_cancer()
X = data.data
y = data.target

# 调试输出数据基本信息
print("输出特征:", X[:5])
print("输出标签:", y[:5])

# 分割数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# 标准化
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 训练
clf = LogisticRegression(random_state=42)
clf.fit(X_train, y_train)

# 预测
y_pred = clf.predict(X_test)

# 模型评估
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("acc:", accuracy)
print("\n混淆矩阵:\n", conf_matrix)
print("\nClassification Report:\n", report)

Output result:

输出特征: [[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
  1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
  6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
  1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
  4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
  7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
  5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
  2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
  2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 1.203e+03 1.096e-01 1.599e-01 1.974e-01
  1.279e-01 2.069e-01 5.999e-02 7.456e-01 7.869e-01 4.585e+00 9.403e+01
  6.150e-03 4.006e-02 3.832e-02 2.058e-02 2.250e-02 4.571e-03 2.357e+01
  2.553e+01 1.525e+02 1.709e+03 1.444e-01 4.245e-01 4.504e-01 2.430e-01
  3.613e-01 8.758e-02]
 [1.142e+01 2.038e+01 7.758e+01 3.861e+02 1.425e-01 2.839e-01 2.414e-01
  1.052e-01 2.597e-01 9.744e-02 4.956e-01 1.156e+00 3.445e+00 2.723e+01
  9.110e-03 7.458e-02 5.661e-02 1.867e-02 5.963e-02 9.208e-03 1.491e+01
  2.650e+01 9.887e+01 5.677e+02 2.098e-01 8.663e-01 6.869e-01 2.575e-01
  6.638e-01 1.730e-01]
 [2.029e+01 1.434e+01 1.351e+02 1.297e+03 1.003e-01 1.328e-01 1.980e-01
  1.043e-01 1.809e-01 5.883e-02 7.572e-01 7.813e-01 5.438e+00 9.444e+01
  1.149e-02 2.461e-02 5.688e-02 1.885e-02 1.756e-02 5.115e-03 2.254e+01
  1.667e+01 1.522e+02 1.575e+03 1.374e-01 2.050e-01 4.000e-01 1.625e-01
  2.364e-01 7.678e-02]]
输出标签: [0 0 0 0 0]
acc: 0.9649122807017544

混淆矩阵:
 [[45  2]
 [ 2 65]]

Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.96      0.96        47
           1       0.97      0.97      0.97        67

    accuracy                           0.96       114
   macro avg       0.96      0.96      0.96       114
weighted avg       0.96      0.96      0.96       114

Logistic Regression Iris

"""
@Module Name: 逻辑回归 鸢尾花.py
@Author: CSDN@我是小白呀
@Date: October 19, 2023

Description:
逻辑回归 鸢尾花
"""
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# 加载数据
data = load_iris()
X = data.data
y = data.target

# 调试输出数据基本信息
print("输出特征:", X[:5])
print("输出标签:", y[:5])

# 分割数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# 标准化
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 训练
clf = LogisticRegression(random_state=42)
clf.fit(X_train, y_train)

# 预测
y_pred = clf.predict(X_test)

# 模型评估
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("acc:", accuracy)
print("\n混淆矩阵:\n", conf_matrix)
print("\nClassification Report:\n", report)

Output result:

输出特征: [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
输出标签: [0 0 0 0 0]
acc: 1.0

混淆矩阵:
 [[11  0  0]
 [ 0 13  0]
 [ 0  0  6]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00         6

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

Hand rub logistic regression

"""
@Module Name: 手把手教你实现逻辑回归.py
@Author: CSDN@我是小白呀
@Date: October 19, 2023

Description:
手把手教你实现逻辑回归
"""
class LogisticRegression:
    def __init__(self, learning_rate=0.01, num_iterations=1000):
        """
        初始化参数
        :param learning_rate: 学习率
        :param num_iterations: 迭代次数
        """
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations
        self.weights = None
        self.bias = 0

    def _sigmoid(self, z):
        """
        sigmoid 激活函数
        :param z: 值
        :return: 0-1概率
        """
        return 1 / (1 + self._exp(-z))

    def _exp(self, value):
        # 实现指数函数, 为了防止溢出
        if value.any() < -100:
            return 0.0
        else:
            return 2.718281828459045 ** value

    def fit(self, X, y):
        """
        训练模型
        :param X: 特征
        :param y: 标签
        :return:
        """
        num_samples, num_features = X.shape
        self.weights = [0] * num_features

        for _ in range(self.num_iterations):
            model_output = self._predict(X)

            # 计算梯度
            d_weights = (1 / num_samples) * (X.T.dot(model_output - y))
            d_bias = (1 / num_samples) * sum(model_output - y)

            # 更新权重和偏置
            self.weights -= self.learning_rate * d_weights
            self.bias -= self.learning_rate * d_bias

    def predict(self, X):
        """
        预测
        :param X: 特征
        :return: 预测值
        """
        linear_model_output = X.dot(self.weights) + self.bias
        return [1 if i > 0.5 else 0 for i in self._sigmoid(linear_model_output)]

    def _predict(self, X):
        linear_model_output = X.dot(self.weights) + self.bias
        return self._sigmoid(linear_model_output)

if __name__ == '__main__':
    from sklearn.datasets import load_breast_cancer
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

    # 加载数据
    data = load_breast_cancer()
    X = data.data
    y = data.target

    # 调试输出数据基本信息
    print("输出特征:", X[:5])
    print("输出标签:", y[:5])

    # 分割数据集
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    # 标准化
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    # 训练
    clf = LogisticRegression()
    clf.fit(X_train, y_train)

    # 预测
    y_pred = clf.predict(X_test)

    # 模型评估
    accuracy = accuracy_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    report = classification_report(y_test, y_pred)

    print("acc:", accuracy)
    print("\n混淆矩阵:\n", conf_matrix)
    print("\nClassification Report:\n", report)

Output result:

输出特征: [[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
  1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
  6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
  1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
  4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
  7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
  5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
  2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
  2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 1.203e+03 1.096e-01 1.599e-01 1.974e-01
  1.279e-01 2.069e-01 5.999e-02 7.456e-01 7.869e-01 4.585e+00 9.403e+01
  6.150e-03 4.006e-02 3.832e-02 2.058e-02 2.250e-02 4.571e-03 2.357e+01
  2.553e+01 1.525e+02 1.709e+03 1.444e-01 4.245e-01 4.504e-01 2.430e-01
  3.613e-01 8.758e-02]
 [1.142e+01 2.038e+01 7.758e+01 3.861e+02 1.425e-01 2.839e-01 2.414e-01
  1.052e-01 2.597e-01 9.744e-02 4.956e-01 1.156e+00 3.445e+00 2.723e+01
  9.110e-03 7.458e-02 5.661e-02 1.867e-02 5.963e-02 9.208e-03 1.491e+01
  2.650e+01 9.887e+01 5.677e+02 2.098e-01 8.663e-01 6.869e-01 2.575e-01
  6.638e-01 1.730e-01]
 [2.029e+01 1.434e+01 1.351e+02 1.297e+03 1.003e-01 1.328e-01 1.980e-01
  1.043e-01 1.809e-01 5.883e-02 7.572e-01 7.813e-01 5.438e+00 9.444e+01
  1.149e-02 2.461e-02 5.688e-02 1.885e-02 1.756e-02 5.115e-03 2.254e+01
  1.667e+01 1.522e+02 1.575e+03 1.374e-01 2.050e-01 4.000e-01 1.625e-01
  2.364e-01 7.678e-02]]
输出标签: [0 0 0 0 0]
acc: 0.9649122807017544

混淆矩阵:
 [[45  2]
 [ 2 65]]

Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.96      0.96        47
           1       0.97      0.97      0.97        67

    accuracy                           0.96       114
   macro avg       0.96      0.96      0.96       114
weighted avg       0.96      0.96      0.96       114

Guess you like

Origin blog.csdn.net/weixin_46274168/article/details/133903411