Machine Learning Algorithm Principles and Practical Code - Logistic Regression (LR)

✍️Level 1 - Logistic Regression

1. Introduction to the basics of logistic regression

Logistic regression is a classification model

It can be used to predict whether something will happen or not. Classification problems are the most common problems in life:

In life: such as predicting whether the Shanghai Composite Index will rise tomorrow, whether it will rain in a certain area tomorrow, whether the watermelon is ripe

Financial field: Whether a certain transaction is suspected of violating regulations, whether a certain company is currently violating regulations, and whether there will be violations in the future

Internet: Will users buy something, will they click something

For known results, the answer to the above question is only: 0 , 1 {0,1}0,1

Let's take the following binary classification as an example. For a given data set, there is a straight line that can divide the entire data set into two parts:

Image Name

At this point, the decision boundary is w 1 x 1 + w 2 x 2 + b = 0 w_1x_1+w_2x_2+b=0w1x1+w2x2+b=0 , at this point we can easily put h ( x ) = w 1 x 1 + w 2 x 2 + b > 0 h(x)=w_1x_1+w_2x_2+b>0h(x)=w1x1+w2x2+b>A sample of 0 is set to 1, otherwise it is set to 0. But this is actually a decision-making process of a perceptron. On this basis, logistic regression needs to add a layerto find the relationship between the classification probability and the input variable, and judge the category through probability.

We can review the linear regression model first:

h ( x ) = w T x + b h(x)=w^T x +b h(x)=wTx+b

def linearRegression(x):
    return sum([x[i]*w[i] for i in range(len(x))])+b

Add a function gg to the linear modelg,即 h ( x ) = g ( w T x + b ) h(x)=g(w^T x+b) h(x)=g(wTx+b) g ( z ) = 1 / ( 1 + e − z ) g(z)=1/(1+e^{-z}) g(z)=1/(1+ez ). This function is the sigmoid function, also known as the logistic function.
It converts the result of a linear regression into a probability value. At this timeh ( x ) h(x)h ( x ) represents the probability of something happening, we can also record it asp ( Y = 1 ∣ x ) p(Y=1|x)p ( Y=1∣x)

import numpy as np
def sigmoid(x):
    return 1/(1+np.exp(-x))

You can look at the graph of the sigmoid function:

import matplotlib.pyplot as plt
x = np.arange(-10, 10, 0.01)
y = sigmoid(x)
plt.plot(x, y)

Through the above content, we know the expression of logistic regression, so how can we optimize it? x is the parameter we input, which is known to us, to predict whether the watermelon is ripe, we need to know its size, color and other information. Feed it into the predictive model and return a probability of whether the watermelon is ripe or not. So how to get the parameters w and b in the model?

Second, the optimization method of logistic regression

1: Loss function of logistic regression

Logistic regression uses a cross-entropy loss function. For general binary logistic regression, the cross-entropy function is: J ( θ ) = − [ yln ( y ' ) + ( 1 − y ) ln ( 1 − y ' ) ] J(\theta )=-[yln(y`)+(1-y)ln(1-y`)]J(θ)=[ y l n ( y ' )+(1y ) l n ( 1y ' )] ,其中y ' y`y ' is the predicted value.

Note: In some places the base of the log of cross entropy is 2, in some places it is e. Since log 2 ( x ) loge ( x ) = log 2 ( e ) \frac{log_2(x)}{log_e(x)}=log_2(e)loge(x)log2(x)=log2( e ) is a constant, so no matter what it is, it will not affect the final result, but due to the simplicity of calculation, e will be used more.

In fact, what we are looking for is the loss of all samples in training, so:

J ( θ ) = − 1 m ∑ [ y i l n ( y i ‘ ) + ( 1 − y i ) l n ( 1 − y i ‘ ) ] J(\theta )=-\frac{1}{m}\sum[y_i ln(y_i`)+(1-y_i )ln(1-y_i`)] J(θ)=m1[yil n ( yi)+(1yi)ln(1yi)]

注: θ \thetaθ represents the set of all parameters

Q: Why not use the least squares method for optimization? (Squared difference)

A: Because the loss function is non-convex if the least squares method is used (the definition of a convex function is that there is only one extreme value in the entire domain of definition, which is extremely large or small, and the extreme value is the maximum or minimum of all). Explain in more
detail You can see it here:

There may be many places that are difficult to explain or require a lot of space to explain. For most of the more extensible knowledge, I may still put a link. Friends in need can learn about it as an extracurricular expansion.

Note: The origin of the loss function

In statistics, suppose we already have a set of samples (X,Y), in order to calculate the parameters that can generate this set of samples. Usually we will use the method of maximum likelihood estimation (a commonly used method of parameter estimation). If we use the maximum likelihood estimation, we also need a hypothetical estimation, here we are assuming YYY is subject to the Bernoulli distribution.

P ( Y = 1 ∣ x ) = p ( x ) P(Y=1|x)=p(x) P ( Y)=1∣x)=p(x)

P ( Y = 0 ∣ x ) = 1 − p ( x ) P(Y=0|x)=1-p(x) P ( Y)=0∣x)=1p(x)

Thanks to YYY obeys the Bernoulli distribution, we can easily have the likelihood function:

L = ∏ [ p ( x i ) y i ] [ 1 − p ( x i ) ] ( 1 − y i ) L=\prod[p(x_i)^{y_i}][1-p(x_i)]^ {(1-y_i)} L=[p(xi)yi][1p(xi)]1yi

For the convenience of solving, we can take the logarithm on both sides:

l n ( L ) = ∑ [ y i l n ( p ( x i ) ) + ( 1 − y i ) l n ( 1 − p ( x i ) ) ] ln(L)=\sum[y_iln(p(x_i))+(1-y_i)ln(1-p(x_i))] l n ( L )=[yil n ( p ( xi))+(1yi)ln(1p(xi))]

Everyone actually discovered that LLL andJJThe relationship between the two expressions of J , in fact, forLLSeeking the maximum of L is equivalent toJJJ seeks the minimum. This is also the relationship between the maximum likelihood function and the minimum loss functionthat you will hear later. If you are asked why logistic regression uses cross entropy as a loss function, you can also answer: this is the assumption that YY in logistic regressionY obeys the Bernoulli distribution, and then derived from the maximum likelihood estimation.

2: Gradient descent method

The optimization method of logistic regression is the gradient descent method.
From a mathematical point of view, the direction of the gradient of the function is the direction in which the function grows fastest, whereas the opposite direction of the gradient is the direction in which the function decreases the fastest. So if we want to calculate the minimum of a function, we go in the opposite direction of the gradient of the function.
Let's briefly introduce the algorithm first.
Suppose we need to optimize the function: f ( X ) = f ( x 1 , . . . , xn ) f(X)=f(x_1,...,x_n)f(X)=f(x1,...,xn)

First we initialize the independent variables from X ( 0 ) = ( x 1 ( 0 ) , . . . xn ( 0 ) ) X^(0)=(x_1^{(0)},...x_n^{(0 )})X(0)=(x1(0),...xn(0)) to start. Set a learning rateη \etan .
for anyi >= 0 i>=0i>=0:

If it is minimized fff

x 1 i + 1 = x 1 i − η ∂ f ∂ x 1 ( x ( i ) ) x_1^{i+1}=x_1^{i}-\eta \frac{\partial{f}}{\partial{x_1}}(x^{(i)}) x1i+1=x1ithex1f(x(i))

x n i + 1 = x n i − η ∂ f ∂ x n ( x ( i ) ) x_n^{i+1}=x_n^{i}-\eta \frac{\partial{f}}{\partial{x_n}}(x^{(i)}) xni+1=xnithexnf(x(i)),

Conversely, if we ask for ffThe maximum value of f , then

x 1 i + 1 = x 1 i + η ∂ f ∂ x 1 ( x ( i ) ) x_1^{i+1}=x_1^{i}+\eta \frac{\partial{f}}{\partial{x_1}}(x^{(i)}) x1i+1=x1i+thex1f(x(i)),

x n i + 1 = x n i + η ∂ f ∂ x n ( x ( i ) ) x_n^{i+1}=x_n^{i}+\eta \frac{\partial{f}}{\partial{x_n}}(x^{(i)}) xni+1=xni+thexnf(x(i)),

Many people may have heard of stochastic gradient descent, batch gradient descent, and gradient descent method . The difference between the three is very simple, that is, to look at the sample data, stochastic gradient descent calculates the loss of one sample each time, and
then updates the parameter θ \thetaθ , the batch gradient descent is to update the parameters according to a batch of samples each time, and the gradient descent is all samples. Generally speaking, the stochastic gradient descent is fast (parameters can be updated every time a sample is calculated, and the parameter update speed is extremely fast). Similarly, the convergence of stochastic gradient descent will be relatively poor, and it is easy to fall into local optimum. Conversely, the more samples are used for parameter update each time, the slower the speed will be, but it will be more able to achieve the global optimum.

3: Optimization of logistic regression

The above is the objective function of logistic regression optimization J ( w , b ) = − 1 m ∑ [ yiln ( σ ( w T x + b ) ) + ( 1 − yi ) ln ( 1 − σ ( w T x + b ) ) ] J(w,b )=-\frac{1}{m}\sum[y_i ln(\sigma(w^T x +b))+(1-y_i )ln(1-\sigma(w^T x +b))]J(w,b)=m1[yil n ( σ ( wTx+b))+(1yi)ln(1s ( wTx+b))]

We need to optimize the parameters w , bw,bw,b , so that in our known sampleX , y X,yX,The upper value of y is the smallest. That is, we often say that the empirical risk is the smallest.

Since we want to optimize the objective function, first we need to J ( w , b ) J(w,b)J(w,b ) Derivation.

We shill g = σ ( w T x + b ) g=\sigma(w^T x +b)g=s ( wTx+b)

∂ J ( g ) ∂ g = − ∂ ∂ g [ y l n ( g ) + ( 1 − y ) l n ( 1 − g ) ] = − y g + 1 − y 1 − g \frac{\partial J(g)}{\partial g}=-\frac{\partial}{\partial g}[yln(g)+(1-y)ln(1-g)]=-\frac{y}{g}+\frac{1-y}{1-g} gJ(g)=g[ y l n ( g )+(1y ) l n ( 1g)]=gy+1g1y

再令: a = w T x + b a=w^T x +b a=wTx+b

∂ g ∂ a = ∂ ( 1 1 + e − a ) ∂ a = − ( 1 + e − a ) − 2 − e − a = 1 1 + e − a 1 + e − a − 1 1 + e − a = σ ( a ) ( 1 − σ ( a ) ) = g ( 1 − g ) \frac{\partial g}{\partial a}=\frac{\partial ({\frac{1}{1+e^{-a}}})}{\partial a}=-(1+e^{-a})^{-2}-e^{-a}=\frac{1}{1+e^{-a}}\frac{1+e^{-a}-1}{1+e^{-a}}=\sigma(a)(1-\sigma (a))=g(1-g) ag=a(1+ea1)=(1+ea)2ea=1+ea11+ea1+ea1=s ( a ) ( 1s ( a ))=g(1g)

It can be found that g = σ ( a ) g=\sigma(a)g=σ ( a ) , butgggaaAfter derivation of a , it turns out to be g ( 1 − g ) g(1-g)g(1g ) , this is also a particularly interesting point of the Sigmoid function. In the subsequent gradient descent optimization, this property can help us reduce a lot of unnecessary calculations.

With the above basis, we can find the parameters we need to optimize w , bw,bw,The gradient of b is up. According to chain derivation:

∂ J ∂ w = ∂ J ∂ g ∂ g ∂ a ∂ a ∂ w = ( − y g + 1 − y 1 − g ) g ( 1 − g ) x = ( g − y ) x \frac{\partial J}{\partial w}=\frac{\partial J}{\partial g}\frac{\partial g}{\partial a}\frac{\partial a}{\partial w}=(-\frac{y}{g}+\frac{1-y}{1-g})g(1-g)x=(g-y)x wJ=gJagwa=(gy+1g1y)g(1g)x=(gy)x

∂ J ∂ b = ∂ J ∂ g ∂ g ∂ a ∂ a ∂ b = ( − y g + 1 − y 1 − g ) g ( 1 − g ) = ( g − y ) \frac{\partial J}{\partial b}=\frac{\partial J}{\partial g}\frac{\partial g}{\partial a}\frac{\partial a}{\partial b}=(-\frac{y}{g}+\frac{1-y}{1-g})g(1-g)=(g-y) bJ=gJagba=(gy+1g1y)g(1g)=(gy)

The above is an introduction to logistic regression optimization. According to the above formula, we simply write a function optimized according to the stochastic gradient descent method.

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def cal_grad(y, t):
    grad = np.sum(t - y) / t.shape[0]
    return grad

def cal_cross_loss(y, t):
    loss=np.sum(-y * np.log(t)- (1 - y) * np.log(1 - t))/t.shape[0]
    return loss

class LR:
    def __init__(self, in_num, lr, iters, train_x, train_y, test_x, test_y):
        self.w = np.random.rand(in_num)
        self.b = np.random.rand(1) = lr
        self.iters = iters
        self.x = train_x
        self.y = train_y

    def forward(self, x):
        self.a =, self.w) + self.b
        self.g = sigmoid(self.a)
        return self.g

    def backward(self, x, grad):
        w = grad * x
        b = grad
        self.w = self.w - * w
        self.b = self.b - * b

    def valid_loss(self):
        pred = sigmoid(, self.w) + self.b)
        return cal_cross_loss(self.test_y, pred)

    def train_loss(self):
        pred = sigmoid(, self.w) + self.b)
        return cal_cross_loss(self.y, pred)

    def train(self):
        for iter in range(self.iters):

            for i in range(self.x.shape[0]):
                t = self.forward(self.x[i])
                grad = cal_grad(self.y[i], t)
                self.backward(self.x[i], grad)

            train_loss = self.train_loss()
            valid_loss = self.valid_loss()
            if iter%5==0:
                print("当前迭代次数为:", iter, "训练loss:", train_loss, "验证loss:", valid_loss)

The above is a handwritten logistic regression, just use it for your own practice, usually we still adjust the sklearn package.

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

                 random_state=None,  max_iter=100),y_train)

penalty : The penalty coefficient, which is what we often call regularization, defaults to "l2", and "l1" can also be used. Later, we will introduce the l1 and l2 regularization of logistic regression.

class_weight : category weight, generally we use it when the classification is unbalanced, such as {0:0.1,1:1} means that when calculating the loss, the loss of category 0 is multiplied by 0.1. In this way, when there is too much data in category 0, it is equivalent to raising the right to category 1. Personally, I prefer this over adopting when the classes are not balanced.

max_iter : The maximum number of iterations.

The above is the overall introduction of logistic regression. I believe that after reading it, you will have a more comprehensive understanding of logistic regression. But logistic regression is a very strange algorithm. It looks relatively simple on the surface, but it has a lot of knowledge when you dig deeper. Don't say that you are proficient in logistic regression.

Third, the expansion of logistic regression

1: Regularization of logistic regression

First of all, let's introduce regularization, which can be skipped for comparison in this aspect. Generally, our model is trained to minimize empirical risk. Regularization is to add constraints on this basis (it can also be said to introduce prior knowledge) . This constraint can guide the optimization of the error function to tend to choose to satisfy the constraints. The direction of gradient descent.

Note: Here we can add empirical risk, expected risk and structural risk.
The empirical risk is the average loss in the training set, and the expected risk is the expected loss of the (X, y) joint distribution. When the number of samples N tends to infinity, the empirical risk also tends to the expected wind direction. What machine learning does is to estimate the expected risk through empirical risk.
Structural risk is to prevent overfitting and add regular terms based on empirical risk.

Common in logistic regression are L1 regularization and L2 regularization.

Add the sum of the absolute value of the parameter w

L 1 : J ( θ ) = − 1 m ∑ [ yiln ( yi ' ) + ( 1 − yi ) ln ( 1 − yi ' ) ] + ∣ w ∣ L1:J(\theta )=-\frac{1} {m}\sum[y_i ln(y_i`)+(1-y_i )ln(1-y_i`)]+|w|L 1 : J ( θ )=m1[yil n ( yi)+(1yi)ln(1yi)]+w

plus the parameter w and the sum of squares

L 2 : J ( θ ) = − 1 m ∑ [ y i l n ( y i ‘ ) + ( 1 − y i ) l n ( 1 − y i ‘ ) ] + ∣ ∣ w 2 ∣ ∣ L2:J(\theta )=-\frac{1}{m}\sum[y_i ln(y_i`)+(1-y_i )ln(1-y_i`)]+||w^2|| L2J(θ)=m1[yil n ( yi)+(1yi)ln(1yi)]+∣∣w2∣∣

Take L 2 L2L 2 as an example, the objective function of our optimization is no longer just empirical risk, we also need to∣ ∣ w 2 ∣ ∣ ||w^2||∣∣w2 ∣∣Minimum empirical risk on the smallest basis. Using the example of our life is to achieve one thing on the basis of the minimum cost, which is definitely the most reasonable.

It feels easy if you just get here, but generally people will ask you:

1: L 1 L1 L1, L 2 L2 What is the theoretical basis for L2 regularization?

2: Why L 1 L1L1 regularization prone to discrete values ?

Tucao: just add a ∣ w ∣ |w|w sum∣ ∣ w 2 ∣ ∣ ||w^2||∣∣w2 ∣∣Ah, letwwThe value of w is a little smaller, so there is nothing to say.

Tucao belongs to Tucao, the question still needs to be answered.

When we assume the parameter wwWhen w obeys the normal distribution, according to the Bayesian model,L 2 L2L2 regularization, when we assume the parameter wwWhen w obeys the Laplace distribution, according to the Bayesian model,L 1 L1L1 regularization _

The specific derivation is as follows:

Logistic regression actually assumes that the parameters are determined, and we need to solve the parameters. Bayesian assumes that the parameters of logistic regression follow a certain distribution. Assume parameter wwThe probability model of w is p ( w ) p(w)p ( w ) . Using Bayesian inference:

p ( w ∣ D ) ≺ p ( w ) p ( D ∣ w ) = Π j = 1 M p ( w j ) Π i = 1 N p ( D i ∣ w j ) p(w|D)\prec p(w)p(D|w)=\Pi^{M}_{j=1}p(w_j)\Pi^{N}_{i=1}p(D_i|w_j) p(wD)p(w)p(Dw)=Pij=1Mp(wj) Pi=1Np(Diwj)

Take log log for the above formulalog

= > a r g m a x w [ Σ j = 1 M l o g p ( w j ) + Σ i = 1 N l o g p ( D i ∣ w ) ] = > a r g m a x w [ Σ i = 1 N l o g p ( D i ∣ w ) + Σ j = 1 M l o g p ( w j ) ] = > a r g m i n w ( − 1 N Σ i = 1 N l o g p ( D i ∣ w ) − 1 N Σ j = 1 M l o g p ( w j ) ) \begin{aligned} &=>argmax_w[\Sigma^{M}_{j=1}logp(w_j)+\Sigma^{N}_{i=1}logp(D_i|w)]\\ &=>argmax_w[\Sigma^{N}_{i=1}logp(D_i|w)+\Sigma^{M}_{j=1}logp(w_j)]\\ &=>argmin_w (-\frac{1}{N}\Sigma^{N}_{i=1}logp(D_i|w)-\frac{1}{N}\Sigma^{M}_{j=1}logp(w_j)) \end{aligned} =>argmaxw[ Sj=1Ml o g p ( vj)+Si=1Nlogp(Diw)]=>argmaxw[ Si=1Nlogp(Diw)+Sj=1Ml o g p ( vj)]=>argminw(N1Si=1Nlogp(Diw)N1Sj=1Ml o g p ( vj))

Everyone is familiar with the previous formula, which is the loss function of logistic regression. Now suppose p ( w ) p(w)p ( w ) obeys a certain distribution, which is equivalent to introducing a priori knowledge. ifp ( w ) p(w)p ( w ) follows a Laplace distribution with mean 0:

p ( w ) = 1 2 λ e − ∣ w ∣ λ p(w)=\frac{1}{2\lambda}e^{\frac{-|w|}{\lambda}}p(w)=2 min1elw

put p ( w ) p(w)Substituting p ( w ) into the above formula can get:

Σ j = 1 M l o g p ( w j ) = > Σ j = 1 M ( − ∣ w i ∣ λ l o g ( 1 2 λ ) ) \Sigma^{M}_{j=1}logp(w_j)=>\Sigma^{M}_{j=1}(-\frac{|w_i|}{\lambda}log(\frac{1}{2\lambda})) Sj=1Ml o g p ( vj)=>Sj=1M(lwilog(2 min1))

将parameter1 λ log ( 1 2 λ ) \frac{1}{\lambda}log(\frac{1}{2\lambda})l1log(2 min1) with a new parameterλ \lambdaλ instead, and then we can get the regularization term of our regularized L1:

λ Σ j = 1 M ∣ w i ∣ \lambda\Sigma^{M}_{j=1}|w_i| l Sj=1Mwi

The sum of the two parts is the loss function under the logistic regression L1 regularization.

The same is true for L2 regularization, but it is assumed that p ( w ) p(w)p ( w ) follows a normal distribution with mean 0.

About L 1 L1L1, L 2 L2 Everyone must have been exposed to the L 2 contour map.
From the perspective of model optimization:

At this time lolWhen w is greater than 0 ∣ w ∣ |w|The derivative of ∣ w is 1. According to the gradient descent, w needs to subtract lr ∗ 1 lr*1after updating the parameter of logistic regressionlr1 . In this way, w will approach 0,
otherwise whenwwWhen w is less than 0, the gradient of |w| according to the gradient descent will also make w approach to 0. untilwww∣ w ∣ |w|The gradient of ∣ w is also 0.

L 2 L2 L 2 does not have the above situation, assuming we copy a feature once. The weight of this feature before copying is w1, and after copying, L1 regularization is used, which tends to optimize the weight of another feature to 0, while L2 regularization tends to optimize the weight of both features to w1/2, because we are very Obviously, when both weights are w1/2. ∣ ∣ w 2 ∣ ∣ ||w^2||∣∣w2 ∣∣will be the smallest.

The above is just my personal opinion. Regarding this issue, I have also seen various explanations from the great gods on the Internet. You can refer to it:

2: Why are features often discretized in logistic regression.

This is a common operation in the industry. Generally, we do not input continuous values ​​as features into the logistic regression model, but discretize them into 0, 1 variables. The benefits of this are:

1: The inner product multiplication of sparse variables is fast, the calculation results are convenient to store, and easy to expand;

2: The discretized features are very robust to abnormal data: for example, if a feature is age > 30, it is 1, otherwise it is 0. If the features are not discretized, an abnormal data "300 years old" will cause great disturbance to the model.

3: Logistic regression belongs to the generalized linear model, and its expressive ability is limited; after the single variable is discretized into N, each variable has a separate weight, which is equivalent to introducing nonlinearity into the model, which can improve the expressive ability of the model and increase the fitting ;

4: After discretization, feature crossover can be performed, changing from M+N variables to M*N variables, further introducing nonlinearity and improving expression ability;

5: After the features are discretized, the model will be more stable. For example, if the user's age is discretized, 20-30 is used as an interval, and a user will not become a completely different person just because he is one year older. Of course, the samples that are adjacent to the interval will be just the opposite, so how to divide the interval is a matter of knowledge.

4. Homework

STEP1: Calculate the results of the following questions according to the requirements

Homework 1: Expression for Logistic Regression:

A: h(x)=wx+b


C: h(x)=sigmoid(wx+b)

D: h(x)=sigmoid(wx)


Homework 2: The following statements about logistic regression are correct (multiple choices):

A: The output of logistic regression is a probability value, between 0-1

B: Using regularization can improve the generalization of the model

C: Logistic regression can be used directly for multi-classification

D: Logistic regression is a non-parametric model

E: The loss function of logistic regression is cross entropy


Homework 3: Calculate y = sigmoid ( w 1 ∗ x 1 + w 2 ∗ x 2 + 1 ) y=sigmoid(w1*x1+w2*x2+1)y=sigmoid(w1x 1+w2x2 _+1 ) When w=(0.2, 0.3), the gradient and loss of w1 and w2 when sample X=(1,1), y=1: ​​(
Save 3 decimal places, rounded)


Homework 4: Add L2 regularization to the cal_grad gradient function, is the following function correct? (Y/N)

def cal_grad(y, t, x, w):
x: input X
y: sample y
t: prediction t
w: parameter w
grad = np.sum(t - y) / t.shape[0 ]
return grad x+2 w


Guess you like