Machine learning (2)-logistic regression

1. What is logistic regression

Logistic regression is used as a classification algorithm. Everyone is familiar with linear regression. The general form is Y = aX + b, and the range of y is [-∞, + ∞]. With so many values, how to classify? Don't worry, great mathematicians have found a way for us.

That is, take the result of Y into a non-linear transformation Sigmoid function, you can get the number S in the range of values ​​between [0,1], S can be regarded as a probability value, if we set the probability threshold If it is 0.5, then S greater than 0.5 can be regarded as a positive sample, and less than 0.5 can be regarded as a negative sample, and it can be classified.

2. What is the Sigmoid function

The function formula is as follows:

image

Regardless of the value of t in the function, the result is in the range of [0, -1]. Recall that a classification problem has two answers, one is "yes" and the other is "no", then 0 Corresponding to "no", 1 corresponds to "yes", then someone asked, are you not the interval of [0,1], how can there be only 0 and 1? This question is well asked. We assume that the threshold for classification is 0.5, then those that exceed 0.5 are classified as 1 and those below 0.5 are classified as 0. The threshold can be set by yourself.

Well, next we bring aX + b into t to get the general model equation of our logistic regression:

The result P can also be understood as a probability. In other words, the probability greater than 0.5 belongs to the 1 category, and the probability less than 0.5 belongs to the 0 category. This achieves the purpose of classification.

3. What is the loss function

The loss function of logistic regression is log loss, which is the log-likelihood function. The function formula is as follows:

image

The y = 1 in the formula means that the first formula is used when the true value is 1, and the second formula is used to calculate the loss when the true y = 0. Why add log function? It can be imagined that when the true sample is 1, but h = 0 probability, then log0 = ∞, which is the maximum penalty for the model; when h = 1, then log1 = 0, which is equivalent to no penalty, that is, no Losses to achieve optimal results. So mathematicians came up with the log function to represent the loss function.

Finally, according to the gradient descent method, the minimum point is solved to obtain the desired model effect.

4. Can multiple classifications be performed?

Yes, in fact, we can go from a two-class problem to a multi-class problem (one vs rest). The steps are as follows:

1. Consider the type class1 as a positive sample, and all other types as negative samples, then we can get the probability p1 that the sample label type is that type.

2. Then consider the other type class2 as a positive sample, and all other types as negative samples. Similarly, p2 is obtained.

3. In this cycle, we can get the probability pi when the label type of the sample to be predicted is the type class i, and finally we take the sample label type corresponding to the probability with the largest probability as the sample type to be predicted.

image

In short, it is divided into two categories in turn, and the maximum probability result is obtained.

5. What are the advantages of logistic regression

  • LR can output the result in the form of probability, not just 0,1 judgment.
  • LR is highly interpretable and highly controllable (you have to tell the boss ...).
  • Fast training and great effect after feature engineering.
  • Because the result is a probability, a ranking model can be done.

6. What are the applications of logistic regression

  • CTR prediction / recommendation system learning to rank / various classification scenarios.
  • The baseline version of the advertising CTR estimated by a search engine factory is LR.
  • The baseline version of an e-commerce search ranking / advertising CTR is LR.
  • A certain amount of LR is recommended for shopping matching of an e-commerce company.
  • The ranking baseline of a news app that earns 1000w + a day is LR.

7. What are the commonly used optimization methods for logistic regression

7.1 First-order method

Gradient descent, stochastic gradient descent, mini stochastic gradient descent method. Stochastic gradient descent is not only faster than the original gradient descent, but the local optimization problem can suppress the occurrence of local optimal solutions to a certain extent.

7.2 Second-order method: Newton method, quasi-Newton method:

Here is a detailed description of the basic principles of Newton's method and the application of Newton's method. The Newton method is actually to continuously update the position of the tangent line through the intersection point of the tangent line and the x axis until the intersection point of the curve and the x axis is reached to obtain the equation solution. In practical applications, we often need to solve convex optimization problems, that is, the position where the first derivative of the solution function is 0, and Newton's method can just provide a solution to this problem. In practical applications, the Newton method first selects a point as the starting point, and performs a second-order Taylor expansion to obtain a point with a derivative of 0 to perform an update until the requirements are met. At this time, the Newton method becomes a second-order solution problem. The method is faster. We often see that x is usually a multidimensional vector, which leads to the concept of Hessian matrix (that is, the second derivative matrix of x).

Disadvantages: Newton's method is a fixed-length iteration without a step factor, so it cannot guarantee a stable decline in the value of the function, and even fails in severe cases. Also, Newton's method requires that the function must be second-order derivable. Moreover, the inverse complexity of calculating the Hessian matrix is ​​very large.

Quasi-Newton method: The method of constructing approximate positive definite symmetric matrix of Hessian matrix without second-order partial derivative is called quasi-Newton method. The idea of ​​the quasi-Newton method is to simulate the Hessian matrix or its inverse with a special expression form so that the expression meets the quasi-Newton condition. There are mainly DFP method (approximating the inverse of Hession), BFGS (approximating the Hession matrix directly), L-BFGS (can reduce the storage space required by BFGS).

8. Why logistic regression should discretize features.

  1. Non-linear! Non-linear! Non-linear! Logistic regression is a generalized linear model with limited expressive power; after discretizing univariate into N, each variable has a separate weight, which is equivalent to introducing nonlinearity to the model, which can improve the expressive power of the model and increase the fitting; discrete It is easy to increase and decrease the features, and it is easy to quickly iterate the model;
  2. high speed! high speed! high speed! The sparse vector inner product multiplication operation is fast, and the calculation result is convenient to store and easy to expand;
  3. Robustness! Robustness! Robustness! The discretized features are very robust to abnormal data: for example, one feature is that age> 30 is 1, otherwise 0. If the features are not discretized, an abnormal data "age 300 years" will cause great interference to the model;
  4. Convenient intersection and feature combination: After discretization, feature intersection can be performed, from M + N variables to M * N variables, further introducing non-linearity and improving expression ability;
  5. Stability: After discretization of features, the model will be more stable. For example, if you discretize the user's age, 20-30 is used as an interval, and it will not become a completely different person because a user is one year older. Of course, the samples that are adjacent to the interval will be just the opposite, so how to divide the interval is a science;
  6. Simplified model: After the features are discretized, it plays the role of simplifying the logistic regression model and reduces the risk of overfitting the model.

9. What is the result of increasing L1 regularization in the objective function of logistic regression.

All parameters w will become 0.

Guess you like

Origin www.cnblogs.com/dhName/p/12737846.html