1. Construct the prediction function h(x)
1) Logistic function (or called Sigmoid function), the function form is:
For the case of linear boundaries, the boundary forms are as follows:
Among them, the training data is the
best parameter of the vector
Construct the prediction function as:
The value of the function h(x) has a special meaning, it represents the probability that the result is 1, so the probability that the classification result of the input x is category 1 and category 0 are:
P(y=1│x;θ)=h_θ ( x)
P(y=0│x;θ)=1-h_θ (x)
2. Construct the loss function J (m samples, each sample has n features)
The Cost function and the J function are as follows, which are derived based on maximum likelihood estimation.
3. Detailed derivation process of loss function
1) The probability of finding the cost function
is written as:
the likelihood function is:
the log-likelihood function is:
The maximum likelihood estimation is to find the θ when l(θ) takes the maximum value. In fact, the gradient ascent method can be used to solve the problem, and the obtained θ is the optimal parameter required.
In Andrew Ng's course, J(θ) is taken as the following formula, namely:
2) Gradient descent method to find the minimum value
The θ update process can be written as:
4. Vectorization
Eectorization is to use matrix calculation instead of for loop to simplify the calculation process and improve efficiency.
Vectorization process:
The matrix form of the agreed training data is as follows, each row of x is a training sample, and each column has a different specific value:
The parameter A of g(A) is a column vector, so when implementing the g function, the column vector should be supported as a parameter, and the column vector should be returned.
The θ update process can be changed to:
To sum up, the steps of θ update after Vectorization are as follows:
- Find A=x*θ
- Find E=g(A)-y
- beg
5. Applicability of Logistic Regression
1) It can be used for probability prediction and classification.
Not all machine learning methods can make probability probability predictions (such as SVM, which can only get 1 or -1). The advantage of probability prediction is that the results are comparable: for example, after we get the probability of different advertisements being clicked, we can show the N with the highest probability of being clicked. In this way, even if the probability of obtaining is very high, or the probability is very low, we can take the optimal topN. When used for classification problems, only one threshold needs to be set. The possibility is higher than the threshold is one class, and the probability is lower than the threshold is another class.
2) Can only be used for linear problems
Logistic Regression can only be used when the feature and target have a linear relationship (unlike SVM, which can deal with nonlinear problems). This has two guiding significance. On the one hand, when the model is known to be nonlinear in advance, Logistic Regression is decisively not used; on the other hand, when using Logistic Regression, pay attention to selecting features that are linearly related to the target.
3) The conditional independence assumption does not need to be satisfied between each feature, but the contribution of each feature is calculated independently.
Logistic regression does not require conditional independence assumptions like Naive Bayes does (since it does not seek posteriors). But the contribution of each feature is calculated independently, that is, LR will not automatically help you combine different features to generate new features (you can't have this illusion at all times, it is a decision tree, LSA, pLSA, LDA or you want to things to do). For example, if you need a feature such as TF*IDF, you must give it explicitly. It is not enough to just give two-dimensional TF and IDF, and you will only get results similar to a*TF + b*IDF. without the effect of c*TF*IDF.