Logistic Regression Principle

1. Construct the prediction function h(x)

1) Logistic function (or called Sigmoid function), the function form is: 
write picture description here 
write picture description here

For the case of linear boundaries, the boundary forms are as follows: 
write picture description here

Among them, the training data is the 
write picture description here 
best parameter  of the vector
write picture description here

Construct the prediction function as: 
write picture description here

The value of the function h(x) has a special meaning, it represents the probability that the result is 1, so the probability that the classification result of the input x is category 1 and category 0 are: 
P(y=1│x;θ)=h_θ ( x) 
P(y=0│x;θ)=1-h_θ (x)

2. Construct the loss function J (m samples, each sample has n features)

The Cost function and the J function are as follows, which are derived based on maximum likelihood estimation. 
write picture description here

3. Detailed derivation process of loss function

1) The probability of finding the cost function 
is written as: 
write picture description here 
the likelihood function is: 
write picture description here 
the log-likelihood function is: 
write picture description here

The maximum likelihood estimation is to find the θ when l(θ) takes the maximum value. In fact, the gradient ascent method can be used to solve the problem, and the obtained θ is the optimal parameter required.

In Andrew Ng's course, J(θ) is taken as the following formula, namely: 
write picture description here

2) Gradient descent method to find the minimum value 
write picture description here

The θ update process can be written as: 
write picture description here

4. Vectorization

Eectorization is to use matrix calculation instead of for loop to simplify the calculation process and improve efficiency. 
Vectorization process: 
The matrix form of the agreed training data is as follows, each row of x is a training sample, and each column has a different specific value:

write picture description here 
The parameter A of g(A) is a column vector, so when implementing the g function, the column vector should be supported as a parameter, and the column vector should be returned. 
The θ update process can be changed to: 
write picture description here

To sum up, the steps of θ update after Vectorization are as follows:

  1. Find A=x*θ
  2. Find E=g(A)-y
  3. begwrite picture description here

5. Applicability of Logistic Regression

1) It can be used for probability prediction and classification.

       Not all machine learning methods can make probability probability predictions (such as SVM, which can only get 1 or -1). The advantage of probability prediction is that the results are comparable: for example, after we get the probability of different advertisements being clicked, we can show the N with the highest probability of being clicked. In this way, even if the probability of obtaining is very high, or the probability is very low, we can take the optimal topN. When used for classification problems, only one threshold needs to be set. The possibility is higher than the threshold is one class, and the probability is lower than the threshold is another class.

2) Can only be used for linear problems

       Logistic Regression can only be used when the feature and target have a linear relationship (unlike SVM, which can deal with nonlinear problems). This has two guiding significance. On the one hand, when the model is known to be nonlinear in advance, Logistic Regression is decisively not used; on the other hand, when using Logistic Regression, pay attention to selecting features that are linearly related to the target.

3) The conditional independence assumption does not need to be satisfied between each feature, but the contribution of each feature is calculated independently.

       Logistic regression does not require conditional independence assumptions like Naive Bayes does (since it does not seek posteriors). But the contribution of each feature is calculated independently, that is, LR will not automatically help you combine different features to generate new features (you can't have this illusion at all times, it is a decision tree, LSA, pLSA, LDA or you want to things to do). For example, if you need a feature such as TF*IDF, you must give it explicitly. It is not enough to just give two-dimensional TF and IDF, and you will only get results similar to a*TF + b*IDF. without the effect of c*TF*IDF.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325965246&siteId=291194637