Machine Learning Algorithms --- Logistic Regression and Gradient Descent

An introduction to logistic regression

  Logistic regression, also known as logistic regression analysis, is a generalized linear regression analysis model, which is often used in data mining, automatic disease diagnosis, economic forecasting and other fields.

  Logistic regression is a generalized linear model, so it has a lot in common with multiple linear regression analysis.

  Its formula is as follows:

        

  Its image is as follows:

        

  By observing the above image, we can find that the value range of logistic regression is (0, 1), when the input is 0, its output is 0.5; when the input is less than 0, and getting smaller and smaller, its output is getting closer and closer to 0; on the contrary, when its input is greater than 0 and becomes larger and larger, its output is getting closer and closer to 1.

  Usually we use linear regression to predict values, but logistic regression has the word "regression" along with it, but it is usually used to solve binary classification problems.

  When its output is greater than 0.5, we can think that the sample belongs to class A; when it is less than 0.5, it is considered that the sample belongs to class A.

  However, since a sample data usually has multiple features, we cannot directly bring it into the logistic regression formula. Therefore, we need to use the linear regression introduced earlier to generate a specific value from multiple eigenvalues ​​of the sample. In the bring-in formula, it is classified, so the expression of z is as follows:

    

  You can get a detailed expression of logistic regression for a data:

    

  Through the above formula, we can perform logistic regression analysis on an arbitrary data, but there is a problem in this, that is, regarding the value of θ, only when θ in the formula is known, we can apply this formula to an unclassified data. , then how to find θ?

Please see the formula derivation below.

2. Derivation of Logistic Regression formula

  In the above,  after need to obtain θ. How to obtain θ will be analyzed in detail here.

  Usually in machine learning, we often have a process called training. The so-called training is to obtain a model (or separator) through the data of known classification (or label), and then use this model to label the data of unknown label. (or categorize it).

  So, we use samples (that is, data with known classifications), and make a series of estimates to get θ. This process is called parameter estimation in probability theory.

  Here, we will use the derivation of maximum likelihood estimation to find the formula for calculating θ:

    (1) First we make:

      

    (2) Integrate the above two formulas:

        

    (3) Find its likelihood function:

      

    (4) Take the logarithm of its likelihood function:

       

    (5) When the likelihood function is the maximum value, the obtained θ can be regarded as the parameter of the model. To find the maximum value of the likelihood function, we can use a method, gradient ascent, but we can do a little processing of the likelihood function to make it gradient descent, and then use the idea of ​​gradient descent to solve this problem, transform

  The expression is as follows:

       (Gradient ascent becomes gradient descent due to the multiplication of a negative coefficient.)

    (6) Because we want to use the current θ value to get a new θ value by updating, we need to know the direction of θ update (that is, whether the current θ is plus a number or minus a number is close to the final result), so get J After (θ), the update direction can be obtained by deriving it (why is the update direction so requested? And why is the update direction processed according to the following formula? Please see the deductive derivation of the gradient descent formula below), and the derivation process is as follows:

      

    (7) After the update direction is obtained, the following formula can be used to iteratively update to obtain the final result.

        

3. Deductive derivation of gradient descent formula

  Regarding the optimal solution (maximum value and minimum value) of solving a function, in mathematics, we generally take the derivative of the function, and then set the derivative equal to 0 to obtain the equation, and then directly obtain the result by solving the equation. However, in machine learning, our functions are often multi-dimensional and high-order, and it is difficult to directly solve the equation with a derivative of 0 (sometimes even cannot be solved), so we need to obtain this result through other methods, and gradient descent is one of them.

  For the simplest function: , how do we find the smallest value of y that is x (not by solving 2x = 0)?  

    (1) First take any value of x, such as x = -4, you can get a y value.  

    (2) Find the update direction (if the update direction is not required to update x, such as x-0.5, or x+0.5, the image is as follows).

      It can be found that if we update x in the negative direction, then I deviate from the final result. At this time, we should update in the positive direction, so we need to find the update direction of x before updating x (this update direction is not fixed , it should be determined according to the current value, such as when x=4, it should be updated in the negative direction)

      Find the value of its derivative function at this point, y' = 2x, x = -4, y' = -8, then its update direction is y', for x update we only need x:=x-α y' (α (greater than 0) is the update step size, in machine learning, we call it the learning rate).

      PS: I said before that it is a multi-dimensional higher-order equation that cannot be solved, not that it cannot be differentiated, so it can be differentiated, and then the current x is brought in.

    (3) Repeat steps (1) and (2) until x converges.

  

  Gradient descent method:

    For this formula , if:

      (1) m is the total number of samples, that is, all samples are considered in each iteration update, then it is called batch gradient descent (BGD). The feature of this method is that it is easy to obtain the global optimal solution, but when the number of samples is large, The training process will be slow. Use it when the sample size is small.

      (2) When m = 1, that is, only one sample is considered for each iteration update, the formula is called Stochastic Gradient Descent (SGD). For example, for the following function (when x=9.5, the final optimal solution is obtained):

      (3) So to sum up the above two methods, when m is a part of the number of all samples (such as m=10), that is, we consider a small part of the samples for each iteration update, the formula is , called mini-batch gradient descent (MBGD), It overcomes the shortcomings of the above two methods and takes into account their advantages, and is most commonly used in practical environments.

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326620734&siteId=291194637