Logistic regression is a classification algorithm that can handle binary classification and multivariate classification. Although its name there are "return" word, is not a regression algorithm. So why has this misleading word "return" mean? Personally I think that although the logistic regression model is classified, but it is the principle which has remained in the shadow of the regression model. From < https://www.cnblogs.com/pinard/p/6029432.html >
Prelude:
How to highlight you are a logistic regression has been very understanding of people do. That's summarize it in one sentence! Logistic regression data obey the Bernoulli distribution is assumed, by maximizing the likelihood function method, using a gradient descent solved parameters to achieve the purpose of the binary data.
In fact, there is included a 5 points: assumption of logistic regression, 2: loss function of logistic regression, 3: Solution logistic regression method, 4: purpose logic regression, 5: how to classify logistic regression. These issues are fundamental test of your knowledge of logistic regression.
1. From the linear regression to logistic regression
We know that the model is a linear regression to obtain the output characteristic linear relation between the coefficients and the input sample vector Y matrix X θ, satisfies Y = Xθ. At this time, our Y is continuous, so is the regression model.
If we want Y is discrete, then how to do it? Can think of a way that we do it again for this conversion function Y, it becomes g (Y). If we let g (Y) values when a real interval is category A, while the other section is the real class B, and so on, we get a classification model. If the result of only two categories, then that is a model of a binary classification. Logistic regression starting point is to come from. Here we begin to introduce binary logistic regression.
2. binary logistic regression model
On the one we mentioned to do a conversion function g on the results of linear regression, logistic regression may change. The function g in the logistic regression, we generally taken to be the sigmoid function, the following form:
It has a very good nature, that is, when z tends to positive infinity, g (z) tends to 1, and when z tends to negative infinity, g (z) tends to zero, which is very suitable for our classification probability model. In addition, it also has a good derivative nature:
g′(z)=g(z)(1−g(z))g′(z)=g(z)(1−g(z))
This of g (z) g (z) is easily obtained by derivation function, later we will use this formula.
z (z) in If we let g (z) g is: z = xθ, thus obtaining a general form of binary logistic regression model:
3. binary logistic regression loss function
Recalling loss function of linear regression, since the linear regression is continuous, it is possible to use the square of the model error and defined loss function. But not continuous logistic regression, linear regression experience the natural loss of function definition would be irrelevant. But we can use the maximum likelihood method to derive our loss function.
We know that, by definition, the second binary logistic regression, our assumption is 0 or 1 output sample types. So we have:
These two formulas written as a formula, that is:
Wherein the value of y is 0 or 1 only.
Y obtained probability distribution function expression, we can use to maximize the likelihood function like we need to solve the model coefficients θ.
Solving for convenience, here we use the log likelihood function is maximized the loglikelihood function is the inverse of our loss function J (θ). Where: like algebraic expressions likelihood function is:
Wherein m is the number of samples.
Logarithmic likelihood function expression of negated, i.e., loss of function expression:
4. The method of optimizing loss function binary logistic regression
For binary logistic regression minimize loss function, there are more methods, the most common are the gradient descent method, the axis descent method, Newton method and the like. Here gradient descent equation is derived for each iteration of θ.
Matrix derivative method:
Where, α is the gradient descent method step size.
In practice, we generally do not worry about optimization methods, most machine learning libraries have built all kinds of optimization of logistic regression, but at least understand an optimization method is still necessary.
Method algebraic derivation:
The binary logistic regression Regularization
Logistic regression fit also faced problems, so we have to consider regularization. Common are L1 and L2 regularization regularization.
Logistic regression L1 regularization loss following function expression, compared to ordinary logistic regression function loss, increases as the L1 norm do penalty, the penalty coefficient as hyperparameters αα, adjust the size of the penalty term:
Binary logistic regression L1 regularization loss function expression is as follows:
Logistic regression L1 regularization optimization of the loss function with axis common descent angle and the minimum regression.
Binary logistic regression L2 regularization loss function expression is as follows:
Logistic regression L2 regularization optimization method and general loss function similar to logistic regression.
6. Promotion of binary logistic regression: multivariate logistic regression
The previous sections of our logistic regression models and functions are limited to the loss of binary logistic regression, and loss of function in fact model of binary logistic regression can be easily extended to multiple logistic regression. For example, some types always think positive, and the remaining value is zero, this method is the most commonly used onevsrest, referred OvR.
Another method is multiple logistic regression ManyvsMany (MvM), it selects another portion of the sample in the sample and categories of the categories do binary logistic regression. The most commonly used is the OneVsOne (OvO). OvO MvM is a special case. Every time we choose to do two types of sample binary logistic regression.
Here only a special case of multiple logistic regression derived softmax regression:
If we are to be extended to multiple logistic regression, the model slightly to do the next expansion.
We assume that K is a classification model element, i.e. the sample value of the output y is 1, 2 ,. . . , K.
Solve the linear equation set K, Kmembered logistic regression to obtain the probability distribution is as follows:
Multivariate logistic regression loss function derivation and optimization methods and binary logistic regression Similarly, where not tired.
7. Summary
Especially logistic regression binary logistic regression model was very common, fast training, although it does not support vector machine (SVM) so mainstream, but to solve the common problem of classification is sufficient, the training speed compared to SVM a lot faster. If you want to understand the machine learning classification algorithm, then the first should learn classification algorithm personally think it should be a logistic regression. Understand the logistic regression, other classification algorithms to learn it should not be that difficult.
Summary: practical application, should be taken into account, especially when cases of severe imbalance of positive and negative data or our intentions are allowed a certain probability can be appropriately reduced (such as atrial fibrillation judgment, 10,000 people in 10 is not normal, you are determined to be normal accuracy is 99.9% , without any sense)
Logistic regression function parameters Description: https://blog.csdn.net/jark_/article/details/78342644
General questions:
Here the more important error linear regression and error likelihood function is relevant here is how you understand?
Answer: Hello, errors are generally used to measure the mean square error or absolute error. As for the likelihood function, personally I feel can be simply understood the antithesis of loss function. I.e., if the loss function is to minimize f (x) f (x) , the likelihood function is maximized  F (X)  F (X). Of course, only part of the algorithm and the loss of function of the likelihood function can satisfy such a relationship.
Paradigm and explain the role:


The method of solving logistic regression
Since the maximum likelihood function can not be solved directly, we generally by gradient descent to continuously desperate optimal solution to the function. In fact, there will be a bonus item, examine your knowledge of other optimization methods this place. Because, gradient descent itself, then there stochastic gradient descent, batch gradient descent, small batch gradient descent in three ways, the interviewer might ask the pros and cons of these three methods and how to select the most appropriate gradient descent method.
1. Briefly batch gradient descent will get global optimal solution, the drawback is updated each parameter when you need to traverse all the data to calculate the amount will be significant, and there will be a lot of redundant computation, the result is that when when a large amount of data, each parameter updates will be very slow.
2. The stochastic gradient descent is a high variance frequent updates, the advantage of making sgd will jump to new and potentially better local optima, the disadvantage is making the process of convergence to local optimal solution more complicated.
3. Small quantities of gradient descent and combines the advantages of batch gd sgd updated every time when the use of n samples. Reducing the number of parameters update, you can achieve a more stable convergence results, usually in deep learning which we have adopted this approach.
 In fact, there's a more deep hidden bonus items, see you do not understand optimization methods such as Adam, momentum method. Since the above method in fact there are two fatal problems.
1. The first is how to choose the right model of learning rate. From start to finish to maintain the same rate of learning in fact not suitable. For a start parameter when just beginning to learn, this time parameters and optimal solutions across the distant, the need to maintain a larger learning rate close to the optimal solution as soon as possible. But learning to the back when the parameters and the optimal solution has been separated relatively close, you still maintain the initial learning rate, easily cross the most advantages, oscillating back and forth near the optimum, simple language that is easy to learn too far, He went wide.
2. The second is how to choose the appropriate learning rate parameters. In practice, it is very unfair to the same learning rate for each parameter are kept. Some parameters are updated frequently, so you can learn the proper rate a little. Some parameters are slow to update, then the learning rate should be bigger. Here we do not start, I will specifically a free presentation.
The purpose of logistic regression
The purpose of this function is the binary data to improve accuracy.
Here we summarize some of the advantages of logistic regression applied to the industrial sector which:
 In the form of simple, explanatory models can be very good. Heavy can see the impact of different characteristics of the final results from the weight of the feature, the weight of a feature relatively high value, then the impact on the characteristics of the final result will be relatively large.
 Model good results. In engineering is acceptable (as a baseline), features works if done well, the effect is not too bad, and we may feature works developed in parallel, greatly accelerate the speed of development.
 Training faster. Classification, and just only the number of calculation related features. And logistic regression distributed optimization sgd relatively mature, the training speed can be further enhanced through the stack machine, so several versions of the model we can iterate good in a short time.
 Small footprint, especially memory. Because only you need to store the respective dimensions of a feature value.
 Easy output adjustment. Logistic regression can easily get the final classification results, because the output is a probability score for each sample, we can easily these probabilities scores cutoff, which is divided threshold (greater than a certain threshold value is a class of less than a certain threshold is a class).
But logistic regression itself also has many drawbacks:
 Accuracy is not very high. Because the form is very simple (very similar to the linear model), it is difficult to fit the actual distribution of the data.
 Difficult to deal with the problem of unbalanced data. For example: If we are positive and negative samples for a very uneven problems such as positive and negative samples than 10,000: 1. We put all the samples were positive predictive value of the loss function also enables relatively small. But as a classifier, its ability to distinguish between positive and negative samples is not very good.
 Nonlinear data is troublesome. Logistic regression without introduction of other methods, processing data only linearly separable, or further, dealing dichotomous.
 Logistic regression itself can not filter characteristics. Sometimes we will use gbdt to filter characteristics, and then on logistic regression
来自 <https://www.cnblogs.com/ModifyRong/p/7739955.html>