## Machine Learning: Logistic Regression

Definition: Logistic regression assumes that the data obeys Bernoulli distribution , and uses gradient descent to solve the parameters by maximizing the likelihood function to achieve the purpose of classifying the data into two categories .

Input: The input to logistic regression is a linear combination, just like linear regression, but the output becomes a probability. We can get the sigmoid function from the probability formula of Bernoulli (probability distribution similar to tossing a coin).

The final form of logistic regression:

${h}_{\theta }\left(x;\theta \right)=\frac{1}{1+{e}^{-{\theta }^{T}x}}$$h_{\theta }(x;\theta)=\frac{1}{1+e^{-\theta^{T}x}}$

The function image is as follows, generally classified by 0.5.

The loss function of logistic regression is its maximum likelihood function:

${L}_{\theta }\left(x\right)=\prod _{i=1}^{m}{h}_{\theta }\left({x}^{i};\theta {\right)}^{{\mathrm{and}}^{i}}\ast \left(1-{h}_{\theta }\left({x}^{i};\theta \right){\right)}^{1-{\mathrm{and}}^{i}}$$L_{\theta }(x)=\prod_{i=1}^{m}h_{\theta }(x^{i};\theta )^{y^{i}}*(1-h_{\theta }(x^{i};\theta ))^{1-y^{i}}$

This maximum likelihood function cannot be solved directly, so the optimal solution is continuously approached by gradient descent on it.

Handling overfitting: Take L1 or L2 regularization to prevent overfitting by adding a penalty to the weights. L2 regularization is generally used, L1 regularization is the truncation effect, and L2 regularization is the scaling effect.

1) The form is simple and the interpretability is good, and the influence of different features on the final result can be seen from the size of the weight, so that the top k features with the greatest influence can be screened;
2) The model works well and can be used as a baseline, if If the feature engineering is done well, the effect will not be too bad;
3) The training speed is fast, the calculation amount is only related to the feature data, and the resource consumption is small, especially the memory, because only the feature values ​​of each dimension need to be stored;
4) Convenient output The result is adjusted, and the final output probability score is divided into a threshold value. The sample result is larger than a certain threshold value is a class, and the sample result is smaller than a certain threshold value is a class;