Principles of Deep Learning ----- Logistic Regression Algorithm

Series Article Directory

Principles of deep learning ----- linear regression + gradient descent method Principles
of deep learning ----- logistic regression algorithm
Principles of deep learning ----- fully connected neural network Principles
of deep learning ----- convolutional neural network
depth Learning principle-----recurrent neural network (RNN, LSTM)
time series forecasting-----based on BP, LSTM, CNN-LSTM neural network algorithm, single-feature electricity load forecasting
time series forecasting (multi-features)-- ---Multi-feature electricity load forecasting based on BP, LSTM, CNN-LSTM neural network algorithm


Series of teaching videos

Quick introduction to deep learning and actual combat
[hands-on teaching] based on BP neural network single-feature electricity load forecasting
[hands-on teaching] based on RNN, LSTM neural network single-feature electricity load forecasting
[hands-on teaching] based on CNN-LSTM neural network single-feature electricity consumption Load forecasting
[Multi-feature forecasting] Multi-feature electric load forecasting based on BP neural network
[Multi-feature forecasting] Multi-feature power load forecasting based on RNN and LSTM [Multi-feature forecasting
] Multi-feature power load forecasting based on CNN-LSTM network



foreword

  In supervised learning of machine learning, it can be roughly divided into two major tasks, one is regression task and the other is classification task; so what is the difference between these two tasks? According to a more official statement, the prediction problem in which both the input variable and the output variable are continuous variables is a regression problem, and the prediction problem in which the output variable is a finite number of discrete variables becomes a classification problem.
  For example, input a person's daily exercise time, sleep time, working time, diet and other features to predict a person's weight. The value of a person's weight can have unlimited values. So the predicted result is an infinite, indeterminate continuous value. One such machine learning task is a regression task.
  If a person's daily exercise time, sleep time, working time, diet and other characteristics are used to judge whether the person's physical condition is healthy, then the final judgment result is only two kinds of healthy and unhealthy. Such output results are discrete values, and the predicted results are also a finite value to represent the species.
  The linear regression algorithm has been explained earlier. Linear regression algorithm can solve some simple regression tasks. Compared with linear regression, logistic regression is used to solve binary classification problems. Therefore, the name logistic regression can easily confuse beginners, because the name with regression is used for classification tasks. After further study, you will find that logistic regression is actually a regression method for classification. Logistic regression The essential output of is still a continuous value between 0-1, and a threshold is artificially set to map the output value into a category. Let's take a closer look at the principle of the logistic regression algorithm.


1. The principle of logistic regression algorithm

  First, let's introduce the sigmoid function. The definition of the sigmoid function is as follows: g ( z ) = 1 1 + e − zg(z)= \frac{1}{1+e^{-z}}g(z)=1+ez1  In the sigmoid function, in the formula, e is Euler's constant, the independent variable z can take any real number, and the value range of the function is [0, 1], which is equivalent to mapping the input independent variable value to 0-1 ; The function image of sigmoid is shown in the figure below:
insert image description here  We can imagine that the probability of something happening must be between 0 and 1. Therefore, we can use the output value to judge the probability of things happening. Assuming that a threshold of 0.3 is set at this time, then when the score is greater than 0.3, it is a positive example, and if it is less than 0.3, it is a negative example, so the threshold does not necessarily have to be 0.5. For example, in real life, using a machine learning model to judge whether a person is sick often sets the threshold to a smaller value. Because if the threshold is set to 0.5, then when a person's final logistic regression score is 0.49, in fact, this person is likely to be sick, and the model judges that he is not sick, which is actually irresponsible. When the threshold is set to a small point, further detection can be carried out to see if there is a physical problem, which can ensure that the possibility of misjudgment is reduced, and it can also help screen out some healthy people and improve the efficiency of doctors in treating diseases.
  Next, we transform the independent variables of the function into representations using the input data features, so as to map them using the sigmoid function. z = w 0 x 0 + w 1 x 1 + + wnxn + bz = W ⊤ X + b \begin{gathered} z=w_0 x_0+w_1 x_1++w_n x_n+b \\ z=W^{\top} X+b \end{gathered}z=w0x0+w1x1++wnxn+bz=WX+b  At this time, the independent variable is transformed into the input data feature, and the feature is represented by a vector X. At the same time, the feature also has a corresponding weight parameter vector W to represent the importance of this data feature, and there is also a bias parameter b. Therefore, the logistic regression model can be expressed by the following formula: g ( X ) = 1 1 + e − w TX + bg(X)=\frac{1}{1+e^{-w^T X+b} }g(X)=1+ewTX+b1  Therefore, by inputting the characteristics of the data into the logistic regression model, the probability of the event being a positive example can be calculated. Therefore, for a two-category problem, the function expressions of the positive and negative examples are as follows: the
  expression of the predicted result is a positive example: p ( y = 1 ∣ X ) = 1 1 + e − w TX + bp(y=1 \mid X)=\frac{1}{1+e^{-w^T X+b}}p ( and=1X)=1+ewTX+b1  The expression that the prediction result is a negative example: p ( y = 0 ∣ X ) = e − w TX + b 1 + e − w TX + b = 1 − p ( y = 1 ∣ X ) p(y=0 \mid X )=\frac{e^{-w^T X+b}}{1+e^{-w^T X+b}}=1-p(y=1 \mid X)p ( and=0X)=1+ewTX+bewTX+b=1p ( and=1X )   The above formula can uniformly express the two situations when the case is a positive example and a negative example, for example:
  when it is a positive example, that is, when y=1, put y=1 into the above formula, you can get :p ( 1 ∣ X ) = p ( y = 1 ∣ X ) = 1 1 + e − w TX + bp(1 \mid X)=p(y=1 \mid X)=\frac{1}{1 +e^{-w^T X+b}}p(1X)=p ( and=1X)=1+ewTX+b1  As a counter example, that is, when y=0, put y=0 into the above formula, we can get: p ( 0 ∣ ​​X ) = 1 − p ( y = 1 ∣ X ) = e − w TX + b 1 + e − w TX + bp(0 \mid X)=1-p(y=1 \mid X)=\frac{e^{-w^T X+b}}{1+e^{- w^T X+b}}p(0X)=1p ( and=1X)=1+ewTX+bewTX+b  Therefore, it is obvious that this formula can take both positive and negative situations into consideration, and it is also convenient for subsequent solution and derivation.


2. Loss function and parameter update

  The previous logistic regression model is also available, and this model can be used to classify and judge the input data characteristics. However, the judgment ability of the model depends on the parameters w and b in the model, so we need to continuously learn from the provided data samples, so as to update the parameters w and b to maximize the probability that the predicted results are all correct. Simply put That is, the probability of correct prediction of all samples is multiplied to obtain the largest value. According to this requirement, the data expression is as follows, which is the loss function of logistic regression.
L ( w , b ) = ∏ i = 1 m ( P ( 1 ∣ xi ) ) yi ( 1 − P ( 1 ∣ xi ) ) 1 − yi L(w, b)=\prod_{i=1}^m \left(P\left(1 \mid x_i\right)\right)^{y_i}\left(1-P\left(1 \mid x_i\right)\right)^{1-y_i}L(w,b)=i=1m(P(1xi))yi(1P(1xi))1yi  其中: P ( 1 ∣ x i ) = g ( x i ) = 1 1 + e − w T x i + b P\left(1 \mid x_i\right)=g\left(x_i\right)=\frac{1}{1+e^{-w^T x_i+b}} P(1xi)=g(xi)=1+ewTxi+b1  But we are used to write the sigmoid function in the following form: g ( xi ) = 1 1 + e − w T xi + b = σ ( w ⊤ xi + b ) g\left(x_i\right)=\frac{1} {1+e^{-w^T x_i+b}}=\sigma\left(w^{\top} x_i+b\right)g(xi)=1+ewTxi+b1=p(wxi+b )   Therefore, the above loss function can be written as follows:L ( w , b ) = ∏ i = 1 m ( σ ( w ⊤ xi + b ) ) yi ( 1 − σ ( w ⊤ xi + b ) ) 1 − yi ψ L(w, b)=\prod_{i=1}^m\left(\sigma\left(w^{\top} x_i+b\right)\right)^{y_i}\left( 1-\sigma\left(w^{\top} x_i+b\right)\right)^{1-y_i} \psiL(w,b)=i=1m( p(wxi+b))yi(1p(wxi+b))1yiψCompared   with the multiplication operation, the calculation of the multiplication operation is much simpler than the multiplication operation, so the multiplication can be converted into an addition by using the log form of both sides at the same time. At this time, it becomes as follows:l ( w , b ) = log ⁡ L ( w , b ) = ∑ i = 1 m ( yi log ⁡ σ ( w ⊤ xi + b ) + ( 1 − yi ) log ⁡ ( 1 − σ ( w ⊤ xi + b ) ) ) l(w, b)=\log L(w, b)=\sum_{i=1}^m\left(y_i \log \sigma\left( w^{\top} x_i+b\right)+\left(1-y_i\right) \log \left(1-\sigma\left(w^{\top} x_i+b\right)\right)\ right)l(w,b)=logL(w,b)=i=1m(yilogp(wxi+b)+(1yi)log(1p(wxi+b ) )   At this point we only need to find a set of parameters w and b such thatl ( w , b ) l(w, b )l(w,b ) The maximum is enough, but in machine learning, it is usually hoped to convert the ascending problem into a descending problem, so just take the inverse of this function. The following formula is obtained: J ( w , b ) = − 1 ml ( w , b ) J_{(w, b)}=-\frac{1}{m} l(w, b)J(w,b)=m1l(w,b )   After derivationJ ( w , b ) J_{(w, b)}J(w,b)This function is the final loss function of logistic regression, we call it the cross entropy loss function.
  At this point, as long as the gradient descent method is used to update the parameters, the parameters can be continuously learned and updated through the data samples. The steps of the gradient descent method are as follows: Step 1, use the loss function
  to solve the partial derivative of the corresponding parameter:
  Calculation steps of the partial derivative of the w parameter As follows: ∂ J ( w , b ) ∂ w = − 1 m ∑ i = 1 m ∂ ( y i log ⁡ σ ( ω ⊤ x i + b ) + ( 1 − y i ) log ⁡ ( 1 − σ ( ω ⊤ x i + b ) ) ∂ w = − 1 m ∑ i = 1 m y i σ ( ω ⊤ x i + b ) [ 1 − σ ( ω ⊤ x i + b ) ] σ ( ω ⊤ x i + b ) x i + ( 1 − y i ) − σ ( ω ⊤ x i + b ) [ 1 − σ ( ω ⊤ x i + b ) ] 1 − σ ( ω ⊤ x i + b ) x i = − 1 m ∑ i = 1 m y i x i − σ ( w ⊤ x i + b ) x i = 1 m ∑ i = 1 m ( σ ( w ⊤ x i + b ) − y i ) x i \begin{gathered} \frac{\partial J_{(w, b)}}{\partial w}=-\frac{1}{m} \sum_{i=1}^m \frac{\partial\left(y_i \log \sigma\left(\omega^{\top} x_i+b\right)+\left(1-y_i\right) \log \left(1-\sigma\left(\omega^{\top} x_i+b\right)\right)\right.}{\partial w} \\ =-\frac{1}{m} \sum_{i=1}^m y_i \frac{\sigma\left(\omega^{\top} x_i+b\right)\left[1-\sigma\left(\omega^{\top} x_i+b\right)\right]}{\sigma\left(\omega^{\top} x_i+b\right)} x_i \\ +\left(1-y_i\right) \frac{-\sigma\left(\omega^{\top} x_i+b\right)\left[1-\sigma\left(\omega^{\top} x_i+b\right)\right]}{1-\sigma\left(\omega^{\top} x_i+b\right)} x_i \\ =-\frac{1}{m} \sum_{i=1}^m y_i x_i-\sigma\left(w^{\top} x_i+b\right) x_i \\ =\frac{1}{m} \sum_{i=1}^m\left(\sigma\left(w^{\top} x_i+b\right)-y_i\right) x_i \end{gathered} wJ(w,b)=m1i=1mw(yilogp( ohxi+b)+(1yi)log(1p( ohxi+b))=m1i=1myip( ohxi+b)p( ohxi+b)[1p( ohxi+b)]xi+(1yi)1p( ohxi+b)p( ohxi+b)[1p( ohxi+b)]xi=m1i=1myixip(wxi+b)xi=m1i=1m( p(wxi+b)yi)xi  最终 ∂ J ( w , b ) ∂ w \frac{\partial J_{(w, b)}}{\partial w} wJ(w,b)的计算结果如下: ∂ J ( w , b ) ∂ w = 1 m ∑ i = 1 m ( σ ( w ⊤ x i + b ) − y i ) x i \frac{\partial J_{(w, b)}}{\partial w}=\frac{1}{m} \sum_{i=1}^m\left(\sigma\left(w^{\top} x_i+b\right)-y_i\right) x_i wJ(w,b)=m1i=1m( p(wxi+b)yi)xi  The calculation steps of the partial derivative of the b parameter are as follows: ∂ J ( w , b ) ∂ b = − 1 m ∑ i = 1 m ∂ ( y i log ⁡ σ ( ω ⊤ x i + b ) + ( 1 − y i ) log ⁡ ( 1 − σ ( ω ⊤ x i + b ) ) ∂ b = − 1 m ∑ i = 1 m y i σ ( ω ⊤ x i + b ) [ 1 − σ ( ω ⊤ x i + b ) ] σ ( ω ⊤ x i + b ) + ( 1 − y i ) − σ ( ω ⊤ x i + b ) [ 1 − σ ( ω ⊤ x i + b ) ] 1 − σ ( ω ⊤ x i + b ) = − 1 m ∑ i = 1 m y i [ 1 − σ ( ω ⊤ x i + b ) ] + ( y i − 1 ) σ ( ω ⊤ x i + b ) = 1 m ∑ i = 1 m σ ( ω ⊤ x i + b ) − y i \begin{gathered} \frac{\partial J_{(w, b)}}{\partial b}=-\frac{1}{m} \sum_{i=1}^m \frac{\partial\left(y_i \log \sigma\left(\omega^{\top} x_i+b\right)+\left(1-y_i\right) \log \left(1-\sigma\left(\omega^{\top} x_i+b\right)\right)\right.}{\partial b} \\ =-\frac{1}{m} \sum_{i=1}^m y_i \frac{\sigma\left(\omega^{\top} x_i+b\right)\left[1-\sigma\left(\omega^{\top} x_i+b\right)\right]}{\sigma\left(\omega^{\top} x_i+b\right)} \\ +\left(1-y_i\right) \frac{-\sigma\left(\omega^{\top} x_i+b\right)\left[1-\sigma\left(\omega^{\top} x_i+b\right)\right]}{1-\sigma\left(\omega^{\top} x_i+b\right)} \\ =-\frac{1}{m} \sum_{i=1}^m y_i\left[1-\sigma\left(\omega^{\top} x_i+b\right)\right]+\left(y_i-1\right) \sigma\left(\omega^{\top} x_i+b\right) \\ =\frac{1}{m} \sum_{i=1}^m \sigma\left(\omega^{\top} x_i+b\right)-y_i \end{gathered} bJ(w,b)=m1i=1mb(yilogp( ohxi+b)+(1yi)log(1p( ohxi+b))=m1i=1myip( ohxi+b)p( ohxi+b)[1p( ohxi+b)]+(1yi)1p( ohxi+b)p( ohxi+b)[1p( ohxi+b)]=m1i=1myi[1p( ohxi+b)]+(yi1)p( ohxi+b)=m1i=1mp( ohxi+b)yi  Ultimately ∂ J ( w , b ) ∂ b \frac{\partial J(w, b)}{\partial b}bJ(w,b)The calculation result of is as follows: ∂ J ( w , b ) ∂ b = 1 m ∑ i = 1 m σ ( ω ⊤ xi + b ) − yi \frac{\partial J_{(w, b)}}{\partial b }=\frac{1}{m} \sum_{i=1}^m \sigma\left(\omega^{\top} x_i+b\right)-y_ibJ(w,b)=m1i=1mp( ohxi+b)yi  Step 2, use the calculated partial derivatives to update parameters: w = w − α ∂ J ( w , b ) ∂ wb = b − α ∂ J ( w , b ) ∂ b \begin{aligned} w &=w- \alpha \frac{\partial J_{(w, b)}}{\partial w} \\ b &=b-\alpha \frac{\partial J_{(w, b)}}{\partial b} \ end{aligned}wb=wawJ(w,b)=babJ(w,b)  Step 3, repeat step 1 and step 2 until the loss function becomes stable and no longer decreases.
  According to the above steps, the most suitable model parameters are found, and the classification of logistic regression can be performed by using the trained parameters.


Summarize

  The two core algorithms in machine learning, linear regression and logistic regression, are applied to regression tasks and classification tasks respectively. The core part is to use the gradient descent method to update the parameters. The gradient descent method reflects the process that the model can continuously use the existing data for continuous learning and complete self-renewal. Therefore, learning linear regression and logistic regression well is conducive to understanding Subsequent deep learning neural network algorithms. Therefore, linear regression and logistic regression are necessary pre-reserve knowledge for learning deep learning neural networks.

Guess you like

Origin blog.csdn.net/didiaopao/article/details/126483343