Machine Learning Cornerstone Lecture 10 Notes

Lecture 10: Logistic Regression

10-1 Logistic Regression Problem

Heart disease prediction questions:

Is there a soft binary prediction problem for heart disease (20%? 80%?):


In the ideal training set, f is a number from 0 to 1; but in reality, f is only 0 or 1 (diseased or not).

So how to get a good hypothesis at this time?

Add x0 to get a weighted score s:


Then convert s to a number in the middle of 0-1, and the result is as follows:

So the expression of logistic regression is:

Use h to approximate the perfect objective function f, where h is :


10-2 Logistic regression error

Compared with linear bisection, linear regression:


Among them, the error of linear dichotomy is 0 or 1, and the error of linear regression is obtained by least squares, so what about logistic regression?

In logistic regression 1-h(x)=h(-x) (it is easy to see from the figure) and there are:


So it can be sorted out:

So now I want to find the h that makes the possibility the highest, that is, find the w that makes the possibility the highest. Take the log to make the multiplication into a joint addition, and add a negative sign from maximization to minimization, and after adding 1/N:


After bringing in the formula of theta(s), the error is obtained as :



10-3 Gradient of Logistic Regression Error

At this time, we already know the Ein of logistic regression. The next question is, how to find w to minimize Ein, namely:

Like Linear Regression, this time find the place where the gradient is 0 (valley bottom) .

After partial differentiation of w with Ein, we get:

When all thetas are 0, although the gradient is 0 at this time: the number in exp is negative infinity at this time, and each y and wx are required to have the same sign, that is , it can only happen when they are linearly separable.

So in most cases a gradient of 0 is the result of a sum of 0 .

Using the step-by-step correction method in PLA, the two formulas of PLA are written together as:


When the point is wrong, the item in the brackets is +1, so add ynxn; when the point is right, the item in the brackets is 0, no need to update .


10-4 Gradient Descent

Iterative Optimization iterates to get the optimal solution

Use the greedy way: find the direction of the greatest decline at each step


where v is:

Go in the opposite direction of the gradient.

What about yita (step)? If the yita is too small, the walk will be slow; if the yita is too large, the result will be inaccurate.

What kind of just right? It is possible to take large steps on steep slopes and small steps on small slopes (using changing yita). So take yita and || differential Ein|| positive correlation.

It can be simplified to:

The purple yita is a fixed value (fixed learning rate). (Yita is small, learning is slow; yita is big, learning is unstable)

In PLA, the iteration stops when the gradient is about 0 (probably at the bottom of the valley), and the last w is passed back as g .



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325419988&siteId=291194637