Machine Learning-Logistic Regression Logistic Regression

Logistic regression

1. Problem

In actual work, we may encounter the following problems:

  1. Predict whether a user clicks on a specific product
  2. Determine the gender of the user
  3. Predict whether users will buy a given category
  4. Determine whether a comment is positive or negative

These can all be regarded as classification problems , and more accurately, they can all be regarded as binary classification problems.

2. Model

2.1 sigmoid function

Before introducing the logistic regression model, we first introduce the sigmoid function , whose mathematical form is:

                                        g(x) = \frac{1}{1+e^{-x}}

The corresponding function curve is shown in the figure below:


2.2 Decision function

Let x be an m-dimensional sample feature vector (input); y is a label, which is a positive example and a negative example. Here θ is the model parameter, which is the regression coefficient, and σ is the sigmoid function. Then the probability that the sample is a positive example is:

For the two-class classification, it can be simply considered: if the probability that the sample x belongs to the positive class is greater than 0.5, then it is judged to be a positive class, otherwise it is a negative class.

2.3 Parameter solving

After the mathematical form of the model is determined, the rest is how to solve the parameters in the model \theta. A commonly used method in statistics is maximum likelihood estimation , that is, to find a set of parameters so that under this set of parameters, the probability of our data is greater.

Suppose we have n independent training samples {( x1 , y1) ,( x2 , y2),..., ( xn , yn)}, y={0, 1}. Then the probability of occurrence of each observed sample ( xi , yi) is:


This is because y_i = 1when the probability P(y_i = 1|x_i); when y_i = 0the time, the probability is 1- P(y_i = 1| x_i).

The sum of the probability of occurrence of all samples {( x1 , y1) ,( x2 , y2),…, ( xn , yn)} is:


Take the logarithm and get:

                                  log(L(\theta)) = \sum y_i* logP(y_i = 1|x_i)+ (1-y_i)*log(1-P(y_i = 1|x_i))\\ ~~~~~~~~~~~~~~~~~= \sum y_i *log\sigma(\theta^Tx)+(1-y_i)log(1-\sigma(\theta^Tx))

At this time, use L(θ) to derive θ, and get:

                                                                 \frac{\partial ln L(\theta)}{\partial \theta} = \sum_{i=1}^n (y_i - \sigma(\theta^T x_i)) x_i

Then set the derivative to 0, and you will be disappointed to find that it cannot be solved analytically. So it can only be iterated with the help of gradient descent algorithm.

2.4 Iterative solution

(x_i, y_i)The probability of a certain sample being a positive example is, and the probability of \sigma(\theta^Tx)being a negative example is 1- \sigma(\theta^Tx).

The cross-entropy loss function ( log loss ) of this sample is:

                                          l(\theta) = -y_i log(\sigma(\theta^T x))-(1-y_i)log(1-\sigma(\theta^Tx))

Do gradient descent on this sample :

\theta^{t+1} = \theta^t - \alpha \frac{\partial l(\theta)}{\partial \theta} = \theta^t - \alpha (y_i - \sigma(\theta^T x_i))x_i


If we take the average log loss on the entire data set, we can get

                                                                     J(\theta) = \frac{1}{N} \sum l(\theta)

That is, in the logistic regression model, we maximize the likelihood function and minimize the log loss function are actually equivalent .

2.5 Parallelization

If you use stochastic gradient descent, you can only use one sample to participate in the iteration at a time, so the speed of traversing the data set is too slow. Therefore, use mini-batch gradient descent to parallelize, and take m samples each time to do gradient descent in parallel.

A very important benefit of LR is that it can be parallelized and is highly efficient in engineering.

\theta^{t+1} = \theta^t - \alpha \frac{\partial \sum_{i=1}^m l_i(\theta)}{\partial \theta} = \theta^t - \alpha \sum_{i=1}^m (y_i - \sigma(\theta^T x_i))x_i



Q1:  The difference and connection between LR and linear regression

  • The optimization objective function of linear regression is the least squares, while the logistic regression is the likelihood function
  • Linear regression makes predictions in the entire real number domain, while logistic regression reduces the prediction range and limits the predicted value to between [0,1].

Q2:  Discretization of continuous features: Under what circumstances can we obtain better results after discretizing continuous features? I just came into contact with CTR estimation recently, and found that CTR estimation generally uses LR, and the features are all discrete. Why do we have to use discrete features? What are the benefits of this?

A2: In the industry, continuous values ​​are rarely directly used as the feature input of the logistic regression model. Instead, the continuous feature is discretized into a series of 0, 1 features and handed over to the logistic regression model. The advantages of this are as follows:

  1. The increase and decrease of discrete features are easy, and it is easy to quickly iterate the model;
  2. The sparse vector inner product multiplication operation speed is fast, the calculation result is convenient to store, and it is easy to expand;
  3.  Discretized features are very robust to abnormal data: for example, a feature is 1 if age>30, otherwise 0. If the features are not discretized, an abnormal data "age 300 years" will cause great interference to the model;
  4.  Logistic regression is a generalized linear model with limited expression ability; after the single variable is discretized into N, each variable has a separate weight, which is equivalent to introducing nonlinearity to the model, which can improve the expression ability of the model and increase the fitting;
  5. After discretization, feature crossover can be performed, changing from M+N variables to M*N variables, further introducing non-linearity, and improving expression ability;
  6. After the features are discretized, the model will be more stable. For example, if the user age is discretized, 20-30 as an interval will not become a completely different person just because a user is one year older. Of course, the samples that are adjacent to the interval will be just the opposite, so how to divide the interval is a matter of knowledge;
  7.  After the feature discretization, it simplifies the logistic regression model and reduces the risk of model overfitting.

Simply put, whether the model uses discrete features or continuous features is actually a trade-off between "massive discrete features + simple models" and "a few continuous features + complex models". Either a linear model can be used for discretization, or a continuous feature plus deep learning can be used. It depends on whether you like the tossing feature or the tossing model.


Generative model and discriminant model


An example of discriminant model: To determine whether a sheep is a goat or a sheep, the method of using a discriminant model is to learn the model from historical data, and then extract the characteristics of the sheep to predict the probability that the sheep is a goat, and it belongs to the sheep. Probability.

An example of a generative model: Using a generative model is to first learn a goat model based on the characteristics of the goat, then learn a sheep model based on the characteristics of the sheep, and then extract the features from the sheep, and put it into the goat model to see if the probability is How much, put it in the sheep model to see what the probability is, whichever is bigger is whichever.


Logistic regression is a discriminative model that directly models the conditional probability P(y|x) without caring about the data distribution P(x,y) behind it.

The main common discriminant models are:

    Logistic Regression


    Traditional Neural Networks

    Nearest Neighbor


    Linear Discriminant Analysis


    Linear Regression

The main common production models are:


              Naive Bayes

              Mixtures of Multinomials

              Mixtures of Gaussians

              Mixtures of Experts


    Sigmoidal Belief Networks, Bayesian Networks

    Markov Random Fields

    Latent Dirichlet Allocation


Reference materials:

Guess you like