Logistic Regression Logistic regression (brief introduction)

Logistic regression, also known as logistic regression analysis, is a generalized linear regression analysis model that belongs to supervised learning in machine learning. Its derivation process and calculation method are similar to the regression process, but in fact it is mainly used to solve binary classification problems (multiple classification problems can also be solved). Train the model by given n sets of data (training set), and classify one or more sets of given data (test set) after training. Each set of data is composed of p indicators.

--transported from someone else

(1) Data processed by logistic regression

Logistic regression classifies the data to determine which category, first know the height, weight and category of n people

Train a model (of a line or a plane). Divide the scatter points on both sides of the plane

(2) Algorithm principle

First, we deal with binary classification problems. Since it is divided into two categories, we let one of the labels be 0 and the other one is 1.

Input x, mapped to 0-1, greater than 0.5, less than 0.5, using the sigmoid function

\sigma (x)=\frac{1}{1+e^{-x}}

set to

h(\textup{x}^{i})=\frac{1}{1+e^{-(\textup{w}^{T}\textup{x}^{i}+b)}}

Here is the i-th data of the test set, which is a p-dimensional column vector \begin{pmatrix} x^{i}_{1}&x^{i}_{2}&...&x^{i}_{p}\end{}^{T};

w is a p-dimensional column vector, \textbf{w}=\begin{pmatrix}w_{1}&w_{2}&...&w_{p} \end{}^{T}which is the parameter to be requested;

b is a number and is also a parameter to be requested.

We found that, for w^{T}x+b, the result was w_{1}x_{1}+w_{2}x_{2}+...+w_{p}x_{p}+b. So we can put

 W is written \begin{pmatrix}w_{1}&w_{2}&...&w_{p}&b \end{}^{T}, and x is written \begin{pmatrix}x_{1}^{i}&x_{2}^{i}&...&x_{p}^{i}&1 \end{}^{T}. w^{T}x+bcan be written as:

 h(\mathbf{x}^{i})=\frac{1}{1+e^{-\textbf{w}^{T}\textbf{x}^{i}}}

This allows another parameter b to be incorporated into w. It is also much more convenient to deduce later. Of course, we can also use the first form to do it, the essence is the same. After that, the parameter w is calculated according to the training samples.

(3) Solving parameters

(1) Maximum likelihood estimation

That is, when an event occurs, the probability is the greatest

Sample i y_{i}\epsilon (0,1)is regarded h(\mathbf{x}_{i})as a probability. When yi corresponds to 1, the probability is h(xi), that is, the possibility that xi belongs to 1; when yi corresponds to 0, the probability is 1- , that h(\mathbf{x}_{i})is, the possibility that xi belongs to 0. Then it constructs the maximum likelihood function

\prod _{i=1}^{i=k}h(x_{i})\prod _{i=k+1}^{n}(1-h(x_{i}))

.where i from 0 to k is the number k belonging to category 1, and i from k+1 to n is the number nk belonging to category 0. Since y is the label 0 or 1, the above formula can also be written as:

                                                         \prod _{i=1}^{n} h(\mathbf{x}_{i})^{y_{i}}(1-h(\textbf{x}_{i}))^{1-y_{i}}

 Regardless of whether y is 0 or 1, one of them is 0 to the power, which is 1, for convenience. To take the logarithm, in order to find the maximum value, multiply the formula by minus one to get the minimum value. For the accumulated value of n data is very large, use

L(\textbf{w})=\frac{1}{n}\sum_{i=1}^{n} -y_{i}ln(h(\mathbf{x}_{i}))- (1-y_{i})ln(1-h(\textbf{x}_{i}))

 There are many ways to find the minimum value, and the gradient descent series method is commonly used in machine learning . Newton's method can also be used, or the value of w when the derivative is zero, etc.

(2) Loss function

Commonly used cross entropy, or squared loss function

Explain some parameters in logistic regression

LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='liblinear', max_iter=100, multi_class='ovr', verbose=0, warm_start=False, n_jobs=1)

1. penalty: str type, the choice of regularization items. There are two main types of regularization: l1 and l2, and the default is l2 regularization. Penalties regularize the road. liblinear two supports, newton-cg and IBfgs, sag supports l2 regularization

2dual:

bool(True、False), default:False

If it is True, it will solve the dual form, only when penalty='l2' and solver='liblinear', there is a dual form; usually when the number of samples is greater than the number of features, the default is False, and the original form is solved.

3.tol : float, default:1e-4

The criterion for stopping the solution, when the default error does not exceed 1e-4, stop further calculation

You can set which step you fall to

4.C

float,default:1.0

The reciprocal of the regularization coefficient λ; must be a float greater than 0. Like SVM, the smaller the value, the stronger the regularization, usually the default is 1.

5.fit_intercept:bool(True、False),default:True

Whether there is an intercept, the default is True.

6.intercept_scaling :float,default :1.0

Only useful when using solver as "liblinear" and fit_intercept=True. In this case, x becomes [x, intercept_scaling], i.e. "synthetic" features with a constant value equal to intercept_scaling are appended to the instance vector. The intercept becomes intercept_scaling * synthetic_feature_weight. Note: Synthetic feature weights are subject to l1/l2 regularization like all other features. To reduce the effect of regularization on the synthetic feature weights (and thus the intercept), intercept_scaling must be increased. It is equivalent to man-made a feature, the feature is always 1, and its weight is b.

Feature extraction is very useful

7.class_weight :dict or ‘balanced’,default:None

The class_weight parameter is used to indicate the various types of weights in the classification model. It can be omitted, that is, the weights are not considered, or all types of weights are the same. If you choose to input, you can choose balanced to let the class library calculate the type weight by itself, or we can input the weight of each type by ourselves. For example, for the binary model of 0,1, we can define class_weight={0:0.9, 1:0.1}, This way type 0 has a weight of 90% and type 1 has a weight of 10%.

If class_weight selects balanced, then the class library will calculate the weight based on the training sample size. The larger the sample size of a certain type, the lower the weight, and the smaller the sample size, the higher the weight. When class_weight is balanced, the class weight calculation method is as follows: n_samples / (n_classes * np.bincount(y)) (n_samples is the number of samples, n_classes is the number of categories, np.bincount(y) will output the number of samples of each class)

8.random_state:int,default:None

Random number seed, int type, optional parameter, the default is none, it is only useful when the regularization optimization algorithm is sag, liblinear.

9.solver :‘newton-cg’,‘lbfgs’,‘liblinear’,‘sag’,'saga',default:liblinear

liblinear: The open source liblinear library is used to implement it, and the coordinate axis descent method is used internally to iteratively optimize the loss function.

lbfgs: A kind of quasi-Newton method, which uses the second-order derivative matrix of the loss function, namely the Hessian matrix, to iteratively optimize the loss function.

newton-cg: It is also a kind of Newton method family, which uses the second derivative matrix of the loss function, namely the Hessian matrix, to iteratively optimize the loss function.

sag: Stochastic average gradient descent, which is a variant of the gradient descent method. The difference from the ordinary gradient descent method is that only a part of the samples are used to calculate the gradient in each iteration, which is suitable for when there are many sample data.

saga: A linearly convergent stochastic optimization algorithm.

For small datasets 'liblinear' can be chosen, while 'sag' and 'saga' are faster for large datasets.

For multi-class problems, only 'newton-cg', 'sag', 'saga' and 'lbfgs' handle multiple losses; 'liblinear' is limited to one loss (that is, when using liblinear, if it is a multi-category problem, get First take one category as one category, and all the remaining categories as another category. By analogy, traverse all categories and classify.).

The three optimization algorithms 'newton-cg', 'lbfgs' and 'sag' only deal with the L2 penalty (these three algorithms require the first or second order continuous derivative of the loss function), while 'liblinear' and 'saga' can handle L1 and L2 penalties.

10.max_iter:int ,default:100

Only available for newton-cg, sag and lbfgs solvers. The maximum number of iterations for the solver to converge.

11.multi_class:str,{‘ovr’, ‘multinomial’},default:‘ovr’

'ovr': use one-vs-rest strategy, 'multinomial': directly use multi-class logistic regression strategy.

If you choose ovr, you can choose the 4 loss function optimization methods liblinear, newton-cg, lbfgs and sag. But if you choose multinomial, you can only choose newton-cg, lbfgs and sag.

12.verbose:int,default:0

Log verbosity, used to turn on/off the log output in the middle of iteration.

13.warm_start:bool(True、False),default:False

Hot start parameter, if it is True, then use the previous training result to continue training, otherwise start training from scratch. Useless for liblinear solvers.

14.n_jobs:int,default:1

Parallel number, int type, the default is 1. When 1, use one core of the CPU to run the program, and when 2, use 2 cores of the CPU to run the program. When -1, use all CPU cores to run the program.

Summary:
The purpose of Logistic regression is to find the best fitting parameters of a nonlinear function Sigmoid, and the solution process can be completed by an optimization algorithm.

Guess you like

Origin blog.csdn.net/jcandzero/article/details/127820864