Machine Learning Practical Tutorial (10): Logistic Regression

Overview

Logistic Regression is a statistical learning method used to solve binary or multi-classification problems. It is modeled in the form of a linear combination of independent variables, and uses the Sigmoid function to map the result into the value range of [0, 1], indicating the probability that the sample belongs to a certain category.
Logistic Regression is the most widely used algorithm. Logistic regression ranks among the top 10 machine learning algorithms all year round.

Logistic regression derivation

linear regression

Expression of linear regression:
f (f(x)=i0+i1x1+i2x2+....+inxn
Converted to matrix multiplication:
[ [ x 1 , x 2 . . . . , xn ] ] [[x_1,x_2....,x_n]][[x1,x2....,xn]] Object[ [ θ 1 , θ . . . . . . . . . . . . . . . . , θ n ] ] T [[\theta_1,\theta_2.....,\theta_n]]^T[[ i1,i2.....,in]]T
matrix demonstration:
First, suppose we have a training set X containing 3 samples, each sample has 2 features:

X = [[1, 2], [3, 4], [5, 6]]

Among them, each sample has two characteristics. Next, we randomly initialize the parameter vector θ:

θ = [[0.5,0.5]]
θ.T=[[0.5],[0.5]]
X * θ = [[1, 2], [3, 4], [5, 6]] * [[0.5], [0.5]] = [[10.5+20.5], [30.5+40.5], [50.5+60.5]] = [[1.5], [3.5], [5.5]]

Let:
f ( x ) = θ 0 + θ T xf(x)=\theta_0+\theta^Txf(x)=i0+iT x
If a column of constant 1 is added to the x data set,θ 0 \theta_0i0Join to θ \thetaIn the θ matrix, we can also abbreviate
f ( x ) = θ T xf(x)=\theta^Txf(x)=iTx

θ \theta θ is the weight, the strength of the relationship between it and the output y. If the weight is larger, the input feature has a greater impact on the output; if the weight is smaller, the input feature has a smaller impact on the output. .

logistic regression

The logistic regression (LR) model actually only applies a logistic function based on linear regression. However, because of this logistic function, the logistic regression model has become a dazzling star in the field of machine learning, and it is also a computing The core of advertising.
Usually the range of linear equations is ( − ∞ , + ∞ ) (-\infty, +\infty)(+) , and the value range of probability is [0, 1], so we make a deformation on this basis to complete from( − ∞ , + ∞ ) (-\infty, +\infty)(+) , conversion to [0,1].
The hypothesis function of logistic regression can be expressed as
h θ ( x ) = g ( θ T x ) h_\theta(x)=g(\theta^Tx)hi(x)=g ( iT x)
This conversion function g is called the Sigmoid function. The expression of the function:
g ( z ) = 1 ( 1 + e − z ) g(z)={1\over(1+e^{-z})}g(z)=(1+ez)1
Let’s take a look at the graph of the Sigmoid function

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(t):
    return 1 / (1 + np.exp(-t))

x = np.linspace(-10, 10, 500)
y = sigmoid(x)
plt.plot(x, y)
plt.show()

Insert image description here
So we get this relationship:

  • X (that is, θ T x \theta^TxiWhen T x)>0, the value (probability) of Y is greater than 0.5, and the label is 1
  • When X<0, the value (probability) of Y is less than 0.5, and the label is 0
  • When X=0, the value of Y=0.5

For binary classification problems, such as whether there is a tumor, if the probability is greater than 0.5, the prediction is a clear 1, and if the probability is less than 0.5, the prediction is a clear 0.

decision boundary

Let's give another example. Suppose we have many samples, which are represented in the figure, and assume that we have found the parameters of the LR model through some method (as shown below).
Insert image description here
According to the relationship obtained above, we can get:
Insert image description here
We draw it on the image and get:
Insert image description here
At this time, all samples above the straight line are positive samples y=1, and all samples below the straight line are negative samples y=0. Therefore we can call this line the ** decision boundary **.
The following code is for reference only for the time being. It draws a random point near x1+x2=3, uses sklearn's logistic regression training, and draws the boundary (install the mlxtend library)

#%%
import numpy as np;
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#产生一个x+y在3附近的随机点。
np.random.seed(100)
x1=np.random.uniform(0,6,100)
x2=x1-3+ np.random.normal(0, 3, size=100)
X=np.hstack((x1.reshape(-1,1),x2.reshape(-1,1)))
y=np.array([1 if i1+i2>=3 else 0 for i1,i2 in zip(x1,x2)])
color=np.array(['red' if i1+i2>=3 else 'blue' for i1,i2 in zip(x1,x2)])
plt.scatter(x1, x2,c=color)
plt.show()
#使用逻辑回归训练模型
lr_model = LogisticRegression(max_iter=2100)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lr_model.fit(X_train, y_train)
#绘制边界
plot_decision_regions(X, y, clf=lr_model, legend=2)

Output:
Insert image description here

In the same way, for non-linear separable situations, we only need to introduce polynomial features to make classification predictions, as shown below:
Insert image description here
generate a random point near x 2 + y 2 = 1.

import numpy as np;
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#产生一个x**2+y**2=1附近的随机点。
np.random.seed(100)
x1=np.random.uniform(-1,1,100)
x2=np.sqrt(1-x1**2)+ np.random.normal(-1, 1, size=100)
X=np.hstack((x1.reshape(-1,1),x2.reshape(-1,1)))
y=np.array([1 if i1**2+i2**2>=1 else 0 for i1,i2 in zip(x1,x2)])
#下面同上
#y = np.where(x1**2 + x2**2 < 1, 0, 1)

#使用逻辑回归训练模型
lr_model = LogisticRegression(max_iter=2100)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lr_model.fit(X_train, y_train)
#绘制边界
print("截距",lr_model.intercept_)
print("斜率",lr_model.coef_)
# 绘制圆形决策边界
circle = plt.Circle((0, 0), radius=1, color='black', fill=False)
fig, ax = plt.subplots()
ax.add_artist(circle)
ax.scatter(x1[y == 0], x2[y == 0], color='blue', s=5)
ax.scatter(x1[y == 1], x2[y == 1], color='red', s=5)
ax.set_xlim([-1, 1])
ax.set_ylim([-1, 1])
plt.axis([-2, 2, -2, 2])
#plt.axis('equal')
plt.show()

Output
Insert image description here
It should be noted that the decision boundary is not a property of the training set, but a property of the hypothesis itself and the parameters. Because the training set cannot define the decision boundary, it is only responsible for fitting parameters; and only when the parameters are determined, the decision boundary can be determined.

loss function

The loss function is used to measure the difference between the model's output and the real output.
Suppose there are only two labels 1 and 0, yn ∈ { 0 , 1 } y_n\in\{0,1\}yn{ 0,1 } . If we regard any set of collected samples as an event, then the probability of this event occurring is assumed to be p. The value of our model y is equal to the probability that the label is 1, which is p.

Cross-Entropy (CE) and Mean Squared Error (MSE) are two loss functions commonly used in machine learning. They are widely used in many fields, such as classification, regression and other tasks. The difference between them is that for different types of problems, the theoretical basis and optimization effect are different.

  • MSE is often used in regression problems to measure the difference between predicted values ​​and true values
  • CE is usually used in classification problems, and it can measure the difference between the predicted results and the true labels. For a two-class classification problem, its formula is:

MSE (Mean Square Error)

Why doesn’t the loss function use least squares? That is, why does the logistic regression loss function use cross-entropy (next section) instead of MSE?
From a logical point of view, we know that the predicted value of logistic regression is a probability, and cross entropy represents the similarity between the true probability distribution and the predicted probability distribution, so we choose to use cross entropy.
From the perspective of MSE, the predicted probability has nothing to do with the Euclidean distance, and in the classification problem, the value of the sample has no size relationship, and has nothing to do with the Euclidean distance, so MSE is not applicable.

Reason 1: The convexity of the loss function (using MSE may fall into a local optimum).
When we introduced the linear regression model earlier, we gave the form of the cost function of linear regression (error sum of squares function). The specific form is as follows: J (
θ ) = 1 m ∑ i = 1 m 1 2 ( h θ ( x ( j ) ) − y ( j ) ) J(\theta)={1\over m}\sum_{i=1}^m{1\ over2}(h_\theta(x^{(j)}) -y^{(j)})J(θ)=m1i=1m21(hi(x(j))y( j ) )
Here we think that logistic regression can also be regarded as a generalized linear model. So, can the most widely used cost function in linear models, the error sum of squares function, be applied to logistic regression? Let me tell you the answer first: No! So why? This is because the outer function of the LR hypothesis function is the Sigmoid function. The Sigmoid function is a complex nonlinear function, which allows us to change the logistic regression hypothesis function h θ ( x ) = 1 ( 1 +
e − θ T x ) h_\theta(x)={1\over(1+e^{-\theta^Tx})}hi(x)=(1+eiTx)1
When entering the above equation, we get
J ( θ ) = 1 m ∑ i = 1 m 1 2 ( 1 ( 1 + e − θ T x ( j ) ) − y ( j ) ) J(\theta)={ 1\over m}\sum_{i=1}^m{1\over2}( {1\over(1+e^{-\theta^Tx^{(j)}})}-y^{(j )})J(θ)=m1i=1m21((1+eiTx(j))1y(j))

is a non-convex function, as shown below:
Insert image description here
Such a function has multiple local minima, which makes when we use the gradient descent method to solve the minimum value of the function, the result obtained is not always the global minimum, but has a larger It is possible to obtain a local minimum

Logistic regression where MSE is the loss function is a non-convex function. How to prove this? To prove the convexity of a function, just prove that its second-order derivative is always greater than or equal to 0. If it is not always greater than or equal to 0, then is a non-convex function.

  • Convex: Every point on the line segment connected by any two points on the graph of the interval function is located below (or above) the graph of the function.
    A typical convex function y = − x 2 y=-x^2y=x2.
    All points on the line connecting any two points are below the function graph, as shown below:
    Insert image description here

  • Non-Convex: The function has multiple extreme values ​​in this interval, that is, the system has multiple stable equilibrium states.
    Non-convex function y = sin ( x ) y=sin(x)y=s in ( x ) , the points on the line connecting the two points may be distributed on both sides of the function image, as shown below:
    Insert image description here
    Here you only need to prove that the above function is not always greater than or equal to 0. I will not prove it here. Baidu.

CE (cross entropy)

The logistic regression loss function is used to measure the difference between the model's output and the true output.
Suppose there are only two labels 1 and 0, yn ∈ { 0 , 1 } y_n \in\{0, 1\}yn{ 0,1 } . If we regard any set of collected samples as an event, then the probability of this event occurring is assumed to be p. The value of our model y is equal to the probability that the label is 1, which is p.
P y = 1 = 1 1 + e − θ T x = p P_{y=1}=\frac{1}{1+e^{-\bm{\theta}^T\bm{x}}} = pPy=1=1+eiTx1=p
is because the label is either 1 or 0, so the probability of the label being 0 is:P y = 0 = 1 − p P_{y=0} = 1-pPy=0=1pWe
regard a single sample as an event, then the probability of this event occurring is:
P ( y ∣ x ) = { p , y = 1 1 − p , y = 0 P(y|\bm{x})=\ left\{ \begin{aligned} p, y=1 \\ 1-p,y=0 \end{aligned} \right.P(yx)={ p,y=11p,y=0
This function is inconvenient to calculate. It is equivalent to:
P ( yi ∣ xi ) = pyi ( 1 − p ) 1 − yi P(y_i|\bm{x}_i) = p^{y_i}(1-p)^ {1-{y_i}}P ( andixi)=pyi(1p)1yi
To explain the meaning of this function, we collected a sample (xi, yi) (\bm{x_i},y_i)(xi,yi) for this sample, its label isyi y_iyiThe probability of is
pyi ( 1 − p ) 1 − yip^{y_i}(1-p)^{1-{y_i}}pyi(1p)1yi(When y=1, the result is p; when y=0, the result is 1-p).
If we collect a set of data totaling N, { (x 1, y 1), (x 2, y 2), (x 3, y 3) . . . (x N, y N) } \{(\ bm{x}_1,y_1),(\bm{x}_2,y_2),(\bm{x}_3,y_3)...(\bm{x}_N,y_N)\}{(x1,y1),(x2,y2),(x3,y3)...(xN,yN)} , how to find the total probability of occurrence of this combined event? In fact, just multiply the probability of occurrence of each sample, that is, the probability of collecting this set of samples:
P total = P ( y 1 ∣ x 1 ) P ( y 2 ∣ x 2 ) P ( y 3 ∣ x 3 ) . . . . P ( y N ∣ x N ) = ∏ n = 1 N pyn ( 1 − p ) 1 − yn \begin{aligned} P_{total} &= P(y_1|\bm{x}_1) P(y_2|\bm{x}_2)P(y_3|\bm{x}_3)....P(y_N|\bm{x}_N) \\ &= \prod_{n=1}^{ N}p^{y_n}(1-p)^{1-y_n} \end{aligned}Ptotal=P ( and1x1) P ( and2x2) P ( and3x3) .... P ( andNxN)=n=1Npyn(1p)1yn

Note that P total P_totalPtotalis a function, and the only unknown quantity is θ \thetaθ (inside p).
Since continuous multiplication is very complicated, we change the continuous multiplication into the form of continuous addition by taking the logarithm of both sides. The logarithm can prevent underflow and will not affect monotonicity. At the same time, the limit value is found at the same point as the original function, that is What is found is the sameθ \thetaFor example, F ( θ ) = ln ( P 总 ) = ln ( ∏ n = 1 N pyn ( 1 − p ) 1 − yn
) = ∑ n = 1 N ln ( pyn ( 1 − p ) 1 − yn ) = ∑ n = 1 N ( ynln ( p ) + ( 1 − yn ) ln ( 1 − p ) ) \begin{aligned} F(\bm{\theta})=ln(P_{dimension} ) &= ln (\prod_{n=1}^{N}p^{y_n}(1-p)^{1-y_n} ) \\ &= \sum_{n=1}^{N}ln (p^{y_n }(1-p)^{1-y_n}) \\ &= \sum_{n=1}^{N}(y_n ln(p) + (1-y_n)ln(1-p)) \end{ aligned}F ( i )=l n ( Ptotal)=ln(n=1Npyn(1p)1yn)=n=1Nl n ( pyn(1p)1yn)=n=1N(ynl n ( p )+(1yn)ln(1p))

For example, p = 1 1 + e − θ T xp = \frac{1}{1+e^{-\bm{\theta}^T\bm{x}}}p=1+eiTx1
This function F ( w ) F(\bm{w})The value of F ( w ) is equal to the total probability of an event occurring, and we hope that the larger it is, the better. But it is a bit contrary to the meaning of loss, so you can also put a negative sign in front. The smaller the negative number, the better,− F ( w ) -F(\bm{w})F ( w ) is also called its loss function. The loss function can be understood as a function that measures the difference between the output of our current model and the actual output.
The loss function is cross entropy

J ( θ ) = − ln ( P 总 ) = − ∑ n = 1 N ( ynln ( p ) + ( 1 − yn ) ln ( 1 − p ) ) \begin{aligned} J(\bm{\theta}) =-ln(P_{总} ) = - \sum_{n=1}^{N}(y_n ln(p) + (1-y_n)ln(1-p)) \end{aligned}J(θ)=l n ( Ptotal)=n=1N(ynl n ( p )+(1yn)ln(1p))

In logistic regression, we usually use cross-entropy as a loss function to evaluate the difference between model predictions and actual labels. The goal of the loss function is to minimize the model's prediction error so that the model can more accurately predict the label of the data.

In actual training, we usually divide the entire training set into several small batches (batch), and use the data of each small batch to update the model parameters. The advantage of this is that it can increase the speed of model training and reduce the volatility of the model during the training process.

In order to avoid the impact of the number of samples on the size of the loss function, the average of all sample loss function values ​​is usually used as the final loss function value when calculating the loss function. In other words, what is calculated using the above formula is the average loss function of all samples in the training set.
J ( θ ) = − 1 n ∑ n = 1 N ( ynln ( p ) + ( 1 − yn ) ln ( 1 − p ) ) \begin{aligned} J(\bm{\theta})= -\dfrac{ 1}n\sum_{n=1}^{N}(y_n ln (p) + (1-y_n)ln(1-p)) \end{aligned}J(θ)=n1n=1N(ynl n ( p )+(1yn)ln(1p))
Dividing by the number of samples m actually converts the average loss function into the loss function of a single sample, that is, using the average value of the loss function value of each sample to measure the degree of error of the model on a single sample. Doing so makes the optimization algorithm more stable and reliable.
Okay, let's say we have 1000 samples in our training set and we divide it into 10 mini-batches (each mini-batch contains 100 samples) for training. The model makes predictions for each mini-batch of samples

Then, use the following formula to calculate the cross-entropy loss of this model:

CE ( y , y ^ ) = J ( θ ) = − 1 10 × 100 ∑ i = 1 10 × 100 [ y i log ⁡ e y ^ i + ( 1 − y i ) log ⁡ e ( 1 − y ^ i ) ] \text{CE}(\boldsymbol{y}, \boldsymbol{\hat{y}}) =J(\bm{\theta})= -\dfrac{1}{10 \times 100} \sum_{i=1}^{10 \times 100} \bigg[ y_i \log_e \hat{y}_i + (1 - y_i) \log_e (1 - \hat{y}_i) \bigg] CE ( y ,y^)=J(θ)=10×1001i=110×100[yilogey^i+(1yi)loge(1y^i)]

The average loss function value of 1000 samples. In actual training, the optimization algorithm is usually iteratively updated in the form of small batches. Therefore, using the average loss function can completely reflect the learning degree of the model on the entire training set data for a single small batch of data.

And if we do not divide the loss function by the number of samples m, then training sets with different sample sizes will have an impact on the size of the loss function, causing unnecessary fluctuations and instability during training of the optimization algorithm.

CE gradient derivation

We direct the input to find the minimum value of θ \theta using the gradient descent method.θ , you need to obtain the gradient of the loss function first.
First, we need to know how vectors are differentiated. For the specific derivation process and principles, pleaserefer to Gradient Descent

First we know that
p = 1 1 + e − θ T xp=\frac{1}{1+e^{-\bm{\theta}^T\bm{x}}}p=1+eiTx1
那么
1 − p = 1 − 1 1 + e − θ T x = 1 + e − θ T x 1 + e − θ T x − 1 1 + e − θ T x = e − θ T x 1 + e − θ T x 1-p=1-\frac{1}{1+e^{-\bm{\theta}^T\bm{x}}}=\frac{1+e^{-\bm{\theta}^T\bm{x}}}{1+e^{-\bm{\theta}^T\bm{x}}}-\frac{1}{1+e^{-\bm{\theta}^T\bm{x}}}=\frac{e^{-\bm{\theta}^T\bm{x}}}{1+e^{-\bm{\theta}^T\bm{x}}} 1p=11+eiTx1=1+eiTx1+eiTx1+eiTx1=1+eiTxeiTx
p is a variable θ \thetaAs a function of θ , we derive the derivative of p, and through the chain derivation rule, we can slowly expand it to get:
p ′ = f ′ ( θ ) = ( 1 1 + e − θ T x ) ′ = − 1 ( 1 + e − θ T x ) 2 ⋅ ( 1 + e − θ T x ) ′ = − 1 ( 1 + e − θ T x ) 2 ⋅ e − θ T x ⋅ ( − θ T x ) ′ = − 1 ( 1 + e − θ T x ) 2 ⋅ e − θ T x ⋅ ( − x ) = e − θ T x ( 1 + e − θ T x ) 2 ⋅ x = 1 1 + e − θ T x ⋅ e − θ T x 1 + e − θ T x ⋅ x = p ( 1 − p ) x \begin{aligned} p' = f'(\bm{\theta})&= (\frac{1}{1+e^{-\bm{\theta}^T\bm{x}}} )' \\ &= -\frac{1}{ (1+e^{-\bm{\theta}^T\bm{x}} )^2} · ( 1+e^{-\bm{\theta}^T\bm{x}})' \\ &= -\frac{1}{ (1+e^{-\bm{\theta}^T\bm{x}} )^2} · e^{-\bm{\theta}^T\bm{x}} · (-\bm{\theta}^T\bm{x})' \\ &= -\frac{1}{ (1+e^{-\bm{\theta}^T\bm{x}} )^2} · e^{-\bm{\theta}^T\bm{x}} · (-\bm{x} ) \\ &= \frac{e^{-\bm{\theta}^T\bm{x}} }{ (1+e^{-\bm{\theta}^T\bm{x}} )^2} · \bm{x} \\ &= \frac{1}{ 1+e^{-\bm{\theta}^T\bm{x}} } · \frac{e^{-\bm{\theta}^T\bm{x}} }{ 1+e^{-\bm{\theta}^T\bm{x}} } · \bm{x} \\ &= p(1-p)\bm{x} \end{aligned} p=f (i)=(1+eiTx1)=(1+eiTx)21(1+eiTx)=(1+eiTx)21eiTx( iTx)=(1+eiTx)21eiTx(x)=(1+eiTx)2eiTxx=1+eiTx11+eiTxeiTxx=p(1p)x

The above are all the preparations we have done. In short, we have to remember:
p ′ = p ( 1 − p ) x p' = p(1-p)\bm{x}p=p(1p)x
那么
( 1 − p ) ′ = 1 ′ − p ′ = − p ′ = − p ( 1 − p ) x (1-p)'=1'-p'=-p'= -p(1-p)\bm{x} (1p)=1p=p=p(1p ) x
Next wetake the derivative J_\theta with respect to J θJiWhen deriving the derivation, please always remember that our variables are onlyθ \theta∇ J ( θ
) = ∇ ( ∑ n = 1 N ( ynln ( p ) + ( 1 − yn ) ln ( 1 − p ) ) .)
= ∑ ( ynln ′ ( p ) + ( 1 − yn ) ln ′ ( 1 − p ) ) = ∑ ( ( yn 1 pp ′ ) + ( 1 − yn ) 1 1 − p ( 1 − p ) ′ ) = ∑ ( yn ( 1 − p ) xn − ( 1 − yn ) pxn ) = ∑ n = 1 N ( yn − p ) xn \begin{aligned} \nabla J(\bm{\theta})& = \nabla ( \sum_{ n=1}^{N}(y_n ln(p) + (1-y_n)ln(1-p)) )\\ &= \sum(y_n ln'(p) + (1-y_n)ln'( 1-p)) \\ &= \sum( (y_n \frac{1}{p}p')+(1-y_n)\frac{1}{1-p}(1-p)') \\ &= \sum(y_n(1-p)\bm{x}_n - (1-y_n)p\bm{x}_n) \\ &= \sum_{n=1}^{N}{(y_n- p)\bm{x}_n}\end{aligned}J θ=n=1N(ynl n ( p )+(1yn)ln(1p))=(ynln(p)+(1yn)ln(1p))=((ynp1p)+(1yn)1p1(1p))=(yn(1p)xn(1yn)pxn)=n=1N(ynp)xn

Finally, we found the gradient J (θ) J (\bm{\theta})The expression of J ( θ ) , now let’s take a look at what it looks like:
∇ J (θ) = ∑ n = 1 N (yn − p) xn \begin{aligned} \nabla J (\bm{\theta })&= \sum_{n=1}^{N}{(y_n-p)\bm{x}_n} \end{aligned}J θ=n=1N(ynp)xn
It is so simple and elegant, which is one of the reasons why we chose the sigmoid function. Of course, we can also expand p again, that is:

∇ J ( θ ) = ∑ n = 1 N ( yn − 1 1 + e − θ T xn ) xn \begin{aligned} \nabla J(\bm{\theta})&= \sum_{n=1}^ {N}{(y_n- \frac{1}{1+e^{-\bm{\theta}^T\bm{x}_n}} )\bm{x}_n} \end{aligned}J θ=n=1N(yn1+eiTxn1)xn

Gradient Descent (GD) and Stochastic Gradient Descent (SGD)

Now we have solved for the loss function J θ J_\thetaJiAt any θ \thetaGradient at θ ∇ J (θ) \nabla J (\bm{\theta})J ( θ ) , but how do we calculateθ ∗ \theta*What about θ ? Returning to the previous question, we now require θ ∗ \theta*when the loss function takes the minimum valueθ 值:
θ ∗ = arg min ⁡ w J ( θ ) \bm{\theta^*} = arg\min_{w}J(\bm{\theta})i=argminwJ(θ)

Gradient Descent can be used to solve this problem. The core idea is to initialize a θ 0 \theta_0 casually.i0
Then given a step size eta \etaηBy constantly modifyingθ \bm{\theta}θ , so that it finally reaches the point where the minimum value is obtained, that is, the following iterative process is continued until the specified number of times is reached, or the gradient is equal to 0.
θ t + 1 = θ t + η ∇ F ( θ ) \bm{\theta}_{t+1} = \bm{\theta}_t + \eta\nabla F(\bm{\theta})it+1=it+η F θ

Stochastic Gradient Descent method (Stochastic Gradient Descent), if we can add a little noise perturbation in each update process, it may approach the optimal value more quickly. In SGD, we do not directly use ∇ F (θ) \nabla F (\bm{\theta})F ( θ ) , instead use another alternative function G (θ) G(\bm{\theta}) whoseoutput is a random variableG ( θ )
θ t + 1 = θ t + η G ( θ ) \bm{\theta}_{t+1} = \bm{\theta}_t + \eta G(\bm{\theta})it+1=it+η G ( θ )
Of course, this alternative functionG ( θ ) G(\bm{\theta})G ( θ ) needs to satisfy its expectation value equal to∇ F (θ) \nabla F (\bm{\theta})F ( θ ) , equivalent to this function around∇ F (θ) \nabla F (\bm{\theta})The output value of ∇ F ( θ ) fluctuates randomly.

Here I will first explain a question: Why can the gradient descent method be used?

Because the loss function L of logistic regression is a continuously convex function (conveniently convex). The characteristic of such a function is that it will only have a global optimal point and no local optimal point. The biggest potential problem with GD and SGD is that they may fall into local optima. However, this problem does not exist in logistic regression, because due to the good characteristics of its loss function, it does not have several local optima. When our GD and SGD converge, the extreme point we get must be the global optimal point, so we can safely use GD and SGD to solve it.

Okay, so how do we implement the learning algorithm? In fact, it is very simple. Note that our GD derivation uses all sample points directly every time, and all sample points from 1 to N are involved in the gradient calculation.
∇ J ( θ ) = − ∑ n = 1 N ( yn − 1 1 + e − θ T xn ) xn \begin{aligned} \nabla J (\bm{\theta})&= -\sum_{n=1 }^{N}{(y_n- \frac{1}{1+e^{-\bm{\theta}^T\bm{x}_n}} )\bm{x}_n} \end{aligned}J θ=n=1N(yn1+eiTxn1)xn
In SGD, we only need to uniformly and randomly select one of the samples (xi, yi) (\bm{x_i},y_i) each time(xi,yi) , using it to represent the overall sample, that is, multiplying its value by N, is equivalent to obtaining the unbiased estimate of the gradient, that is,E ( G ( θ ) ) = ∇ F ( θ ) E(G(\bm{ \theta})) = \nabla F(\bm{\theta})E ( G ( θ ))=F ( θ )

E represents the expected value, usually expressed as E(X), where X is a random variable, which represents the probability-weighted average of all possible values ​​of this random variable.

In this way, our previous summation is gone, and at the same time η N \eta Nη N are all constants, and the value of N can just be incorporated intoeta \etaFor example, the SGD function of the equation is
θ t + 1 = θ t + η ( yn − 1 1 + e − θ T xn ) xn \bm{\theta}_{t+1} = \bm{\ theta}_t + \eta {(y_n- \frac{1}{1+e^{-\bm{\theta}^T\bm{x}_n}} )\bm{x}_n}it+1=it+h ( yn1+eiTxn1)xn
Among them ( xi , yi ) (\bm{x_i},y_i)(xi,yi) is a result of random sampling of all samples.

Interpretability

The biggest feature of editorial regression is its strong interpretability.
After the model training is completed, we obtain a set of n-dimensional weight vectors θ ∗ \theta*θ and deviationθ 0 \theta_0i0
For the weight vector θ ∗ \theta*θ , the value of each dimension represents the contribution of the features of this dimension to the final classification result. If this dimension is positive, it means that this feature has a positive contribution to the result. Then the larger its value, the more important this feature is for classifying it as positive.

For the deviation θ 0 \theta_0i0, represents to a certain extent the ease of judgment between positive and negative categories. If θ 0 \theta_0i0is 0, then the positive and negative categories are uniform. If θ 0 \theta_0i0If it is greater than 0, it means that it is more likely to be classified into the positive category, and vice versa.

According to the size of the weight vector in logistic regression on each feature, you can have a clear and quantitative understanding of the importance of each feature. This is why the logistic regression model has strong interpretability.

regular term

For linear regression models, the model using L1 regularization is called Lasso regression, and the model using L2 regularization is called Ridge regression (ridge regression). In order to solve the over-fitting problem, please refer to: https://blog.csdn.net/liaomin416100569 /article/details/130289602?spm=1001.2014.3001.5501.

How to handle multi-label problems with logistic regression

Logistic regression itself can only be used for binary classification problems. If the actual situation is multi-class, then some changes to the model are required. The following are three commonly used methods of using logistic regression for multi-class classification:

One vs One

OvO's method is to extract two categories from multiple categories, then input the corresponding samples into a logistic regression model, learn a classifier for the two categories, and then repeat the above steps until There is a classifier between all categories.
Assuming that there are four categories, then the number of classifiers is 6. The table is as follows:
Insert image description here
The number of classifiers directly uses C 2 k C_2^kC2kThat's it, k represents the number of categories.

When predicting, you need to run each model and then record the prediction results of each classifier. That is, each classifier votes once, and the category with the most votes is the final multi-classification result.

For example, in the above example, 3 of the 6 classifiers voted for category 3, 1 voted for category 2, 1 voted for category 1, and the last one voted for category 0, then category 3 will be taken as the final prediction. result.

In the OvO method, when there are many categories to be predicted, then there are also many classifiers that we need to train. This increases the training overhead, but on the other hand, in each trainer, Because you only need to input training samples corresponding to two categories, this reduces the overhead.

From the perspective of prediction, this method requires running a lot of classifiers and cannot reduce the prediction time complexity of each classifier, so the prediction overhead is large.

One vs All

Targeted at the problem: One sample corresponds to multiple labels.
OvA's method is to select one category from all categories as 1 and all other categories as 0 to train the classifier, so the number of classifiers is much smaller than that of OvO.

Insert image description here
As you can see from the above example, the number of classifiers is actually the number of categories, which is k.

Although the number of classifiers has decreased, for each classifier, all training data needs to be input for training, so the training time complexity of each classifier is higher than OvO.

From the perspective of prediction, because the number of classifiers is small and the prediction time complexity of each classifier remains unchanged, the overall prediction time complexity is smaller than OvA.

The prediction results are determined by ranking each classifier according to the probability of its corresponding category 1, and selecting the category with the highest probability as the final prediction category.

sklearn actual combat

LogisticRegression trains breast cancer tumor classification

The klearn.linear_model module provides many models for us to use, such as Logistic regression, Lasso regression, Bayesian ridge regression, etc. It can be seen that there are still many things to learn. We use LogisticRegressioin.
Let us first take a look at the LogisticRegression function, which has a total of 14 parameters:

Parameter description is as follows:

  • penalty: penalty item, str type, optional parameters are l1 and l2, the default is l2. Used to specify the specification used in the penalty term. Newton-cg, sag and lbfgs solution algorithms only support L2 specification. The L1G specification assumes that the model parameters satisfy the Laplace distribution, and the L2 assumes that the model parameters satisfy the Gaussian distribution. The so-called paradigm is to add constraints on the parameters so that the model will not overfit (overfit), but if you want No one can answer whether it will be better if we add constraints. We can only say that if we add constraints, we should theoretically be able to obtain results with stronger generalization capabilities.
  • dual: dual or primitive method, bool type, default is False. The dual method is only used to solve the L2 penalty term of linear multi-core (liblinear). When the number of samples > sample features, dual is usually set to False.
  • tol: The criterion for stopping solving, float type, default is 1e-4. That is, when the solution reaches a certain value, it stops and it is considered that the optimal solution has been found.
  • c: The reciprocal of the regularization coefficient λ, float type, default is 1.0. Must be a positive floating point number. Like SVM, smaller values ​​indicate stronger regularization.
  • fit_intercept: Whether there is an intercept or deviation, bool type, default is True.
  • intercept_scaling: Only useful when the regularization term is "liblinear" and fit_intercept is set to True. float type, default is 1.
  • class_weight: Used to mark various types of weights in the classification model. It can be a dictionary or a 'balanced' string. The default is not to input, that is, the weight is not considered, which is None. If you choose to input, you can select balanced to let the class library calculate the type weights by itself, or enter the weights of each type yourself. For example, for a binary model of 0,1, we can define class_weight={0:0.9,1:0.1}, so that the weight of type 0 is 90%, and the weight of type 1 is 10%. If class_weight selects balanced, the class library will calculate the weight based on the training sample size. The larger the sample size of a certain type, the lower the weight, and the smaller the sample size, the higher the weight. When class_weight is balanced, the class weight calculation method is as follows: n_samples / (n_classes * np.bincount(y)). n_samples is the number of samples, n_classes is the number of categories, np.bincount(y) will output the number of samples of each class, for example, y=[1,0,0,1,1], then np.bincount(y)=[2 ,3].
    So what does class_weight do?
    In classification models, we often encounter two types of problems:
    1. The first is that misclassification is very costly. For example, classifying legal users and illegal users, the cost of classifying illegal users as legal users is very high. We would rather classify legal users as illegal users. At this time, we can manually screen again, but we do not want to classify illegal users as legal users. . At this time, we can appropriately increase the weight of illegal users.
  1. The second is that the sample is highly unbalanced. For example, we have 10,000 binary sample data of legal users and illegal users, of which there are 9,995 for legal users and only 5 for illegal users. If we do not consider the weight, we can combine all The test set is predicted to be legitimate users, so the prediction accuracy is theoretically 99.95%, but it is meaningless. At this time, we can select balanced to let the class library automatically increase the weight of illegal user samples. By increasing the weight of a certain classification, more samples will be classified into high-weight categories than if the weight is not considered, thus solving the above two types of problems.
  • random_state: random number seed, int type, optional parameter, default is None, only useful when the regularization optimization algorithm is sag, liblinear.
  • solver: Optimization algorithm selection parameters, there are only five optional parameters, namely newton-cg, lbfgs, liblinear, sag, saga. Default is liblinear. The solver parameters determine our optimization method for the logistic regression loss function. There are four algorithms to choose from, namely:
    liblinear: implemented using the open source liblinear library, and internally using the coordinate axis descent method to iteratively optimize the loss function.
    lbfgs: A type of quasi-Newton method that uses the second-order derivative matrix of the loss function, the Hessian matrix, to iteratively optimize the loss function.
    newton-cg: It is also a member of the Newton method family. It uses the second derivative matrix of the loss function, the Hessian matrix, to iteratively optimize the loss function.
    sag: Stochastic average gradient descent is a variant of the gradient descent method. The difference from the ordinary gradient descent method is that each iteration only uses a part of the samples to calculate the gradient, which is suitable when there is a lot of sample data.
    saga: Variation of linearly convergent stochastic optimization algorithms.
    Summary:
    liblinear is suitable for small data sets, while sag and saga are suitable for large data sets because it is faster.
    For multi-classification problems, only newton-cg, sag, saga and lbfgs can handle multiple losses, while liblinear is limited to one pair of residuals (OvR). What it means is that when using liblinear, if it is a multi-classification problem, you must first treat one category as one category, and all the remaining categories as another category. By analogy, traverse all categories and classify them.
    The three optimization algorithms of newton-cg, sag and lbfgs all require the first or second order continuous derivative of the loss function, so they cannot be used for L1 regularization without continuous derivatives, but can only be used for L2 regularization. And liblinear and saga take all L1 regularization and L2 regularization.
    At the same time, sag only uses part of the samples for gradient iteration each time, so do not choose it when the sample size is small. If the sample size is very large, such as greater than 100,000, sag is the first choice. But sag cannot be used for L1 regularization, so when you have a large number of samples and need L1 regularization, you have to make your own choice. Either reduce the sample size by sampling the samples, or go back to L2 regularization.
    From the above description, you may think that since Newton-cg, lbfgs and sag have so many restrictions, if it is not a large sample, we can just choose liblinear! Wrong, because liblinear also has its own weaknesses! We know that logistic regression includes binary logistic regression and multiple logistic regression. There are two common types of multivariate logistic regression: one-vs-rest (OvR) and many-vs-many (MvM). MvM classification is generally more accurate than OvR classification. The frustrating thing is that liblinear only supports OvR and does not support MvM, so if we need relatively accurate multiple logistic regression, we cannot choose liblinear. It also means that if we need relatively accurate multiple logistic regression, we cannot use L1 regularization.
  • max_iter: The maximum number of iterations for algorithm convergence, int type, default is 10. Only useful when the regularized optimization algorithm is newton-cg, sag and lbfgs, the maximum number of iterations for the algorithm to converge.
  • multi_class: classification method selection parameter, str type, optional parameters are ovr and multinomial, the default is ovr. ovr is the one-vs-rest (OvR) mentioned above, and multinomial is the many-vs-many (MvM) mentioned above. If it is binary logistic regression, there is no difference between ovr and multinomial. The difference is mainly in multiple logistic regression.
    What is the difference between OvR and MvM?
    The idea of ​​OvR is very simple. No matter how many meta-logistic regression you have, we can regard it as binary logistic regression. The specific method is that for the classification decision of the Kth class, we take all the samples of the Kth class as positive examples, and all samples except the Kth class samples as negative examples, and then do binary logistic regression on it to get the Kth class class classification model. Classification models for other classes are obtained by analogy.
    MvM is relatively complicated. Here is a special case of MvM, one-vs-one (OvO), for explanation. If the model has T class, we select two types of samples from all T class samples each time. We might as well record them as T1 class and T2 class. Put all the samples whose output is T1 and T2 together, and use T1 as a positive example. , T2 is used as a negative example, binary logistic regression is performed, and the model parameters are obtained. We need a total of T(T-1)/2 classifications.
    It can be seen that OvR is relatively simple, but the classification effect is relatively poor (here refers to most sample distributions, OvR may be better under certain sample distributions). MvM classification is relatively accurate, but the classification speed is not as fast as OvR. If ovr is selected, the four loss function optimization methods liblinear, newton-cg, lbfgs and sag can be selected. But if you choose multinomial, you can only choose newton-cg, lbfgs and sag.
  • verbose: log verbosity, int type. Default is 0. That is, the training process is not output, and the results are occasionally output when it is 1. If it is greater than 1, it is output for each sub-model.
  • warm_start: warm start parameter, bool type. Default is False. If True, the next training is performed as an append tree (reusing the last call as initialization).
  • n_jobs: Number of parallel jobs. Int type, default is 1. When 1, use one core of the CPU to run the program, when 2, use 2 cores of the CPU to run the program. When set to -1, all CPU cores are used to run the program.

Binary classification using clinical measures of breast cancer tumors to demonstrate logistic regression

load_breast_cancer: breast cancer data set, with a total of 569 samples, of which 212 are malignant and 357 are benign.
1 in label means malignant, 0 means benign

#%%
"""
sklearn中的load_breast_cancer数据集是一个二分类的数据集,包含了乳腺癌肿瘤的临床测量指标
"""
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = load_breast_cancer()
# 加载数据集
data = load_breast_cancer()
X = data.data    # 特征
y = data.target  # 标签

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 定义逻辑回归模型
"""
逻辑回归是一种广泛使用的二分类模型,其常用的优化算法是迭代算法,
比如梯度下降算法。收敛是指在训练过程中,模型参数的更新已经收敛到某个稳定的值,
此时继续迭代将不会产生更好的训练效果。max_iter是scikit-learn中LogisticRegression类的一个参数,
表示最大迭代次数,一旦达到这个迭代次数,则认为模型已经收敛。
"""
lr_model = LogisticRegression(max_iter=2100)

# 拟合模型
lr_model.fit(X_train, y_train)

# 预测训练集和测试集上的结果
train_pred = lr_model.predict(X_train)
test_pred = lr_model.predict(X_test)

# 输出准确率
print('Train accuracy score:', accuracy_score(y_train, train_pred))
print('Test accuracy score:', accuracy_score(y_test, test_pred))  # 输出数据集中标签的维度



Train accuracy score: 0.9538461538461539 Test
accuracy score: 0.956140350877193

Pay attention to the logistic regression configuration: max_iter=5000, which represents the number of convergence times. By default, STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. will be thrown after 100 times. If the number of convergence times is not enough, the gradient descent may not reach the minimum position.
If you use the default 100, the calculated accuracy is 0.94. If it is set to 2100, the accuracy is 0.95. If it is set to 2100 or above, 0.95 will also take a very slow time, so 2100 is a suitable value.

Supplementary knowledge
Accuracy_score and mean_squared_error are both indicators used to evaluate model performance, but they are suitable for different types of problems.
accuracy_score is usually used in classification problems. It can measure the classification accuracy of the classifier on the data set. Its calculation formula is as follows:
A accuracy = TP + TNTP + TN + FP + FN Accuracy = \frac{TP + TN}{TP + TN + FP + FN}Accuracy=TP+TN+FP+FNTP+TN
in:

  • TP represents True Positive, that is, the number of samples that are actually positive and predicted as positive by the classifier;
  • TN represents True Negative, that is, the number of samples that are actually negative and predicted as negative by the classifier;
  • FP represents False Positive, that is, the number of samples that are actually negative but are predicted to be positive by the classifier;
  • FN stands for False Negative, which is the number of samples that are actually positive but are predicted to be negative by the classifier.

Mean_squared_error is usually used in regression problems. It can measure the difference between the predicted value and the true value. Its calculation formula is as follows:
MSE = 1 n ∑ i = 1 n ( yi − yi ^ ) 2 MSE = \frac{1 }{n}\sum_{i=1}^{n} (y_i - \hat{y_i})^2MSE=n1i=1n(yiyi^)2

In general, accuracy_score and mean_squared_error are both standard indicators for evaluating model performance and are used for classification and regression models, but they are suitable for different types of problems. Accuracy_score is suitable for the evaluation of classification tasks, while mean_squared_error is suitable for regression tasks. Evaluate.

OneVsRestClassifier Wine Dataset

OVR (O vs Rest [remainder]) classification wine data, that is, the previous One vs All
load_wine data set is a classic, easy-to-understand, multi-category classification data set, containing a total of 178 wine samples, each sample has 13 features, divided into three categories. These three categories represent three different wine varieties. Specifically, these three categories are:

  • class_0: represents the first wine variety.
  • class_1: Represents the second wine variety.
  • class_2: Represents the third wine variety.

code

#%%

"""
sklearn中的load_breast_cancer数据集是一个二分类的数据集,包含了乳腺癌肿瘤的临床测量指标
"""
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier,OneVsOneClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = load_wine()
X = data.data    # 特征
y = data.target  # 标签
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 定义逻辑回归模型
lr_model = LogisticRegression(max_iter=2100)
ovr = OneVsRestClassifier(lr_model)
# 拟合模型
ovr.fit(X_train, y_train)

# 预测训练集和测试集上的结果
train_pred = ovr.predict(X_train)
test_pred = ovr.predict(X_test)

# 输出准确率
print('Train accuracy score:', accuracy_score(y_train, train_pred))
print('Test accuracy score:', accuracy_score(y_test, test_pred))  # 输出

Output:

Train accuracy score: 0.9788732394366197
Test accuracy score: 1.0

The accuracy rate of the test data set is 100%

OneVsOneClassifier Wine Dataset

Ditto data set

ovo = OneVsOneClassifier(lr_model)
ovo.fit(X_train, y_train)
# 拟合模型
ovo.fit(X_train, y_train)

# 预测训练集和测试集上的结果
train_pred1 = ovo.predict(X_train)
test_pred1 = ovo.predict(X_test)

# 输出准确率
print('Train accuracy score:', accuracy_score(y_train, train_pred1))
print('Test accuracy score:', accuracy_score(y_test, test_pred1))  

Output:
Train accuracy score: 0.9929577464788732
Test accuracy score: 1.0

The accuracy rate of the test data set is 100%

Guess you like

Origin blog.csdn.net/liaomin416100569/article/details/130362121