Wu Enda Machine Learning Notes Lesson 3

Locally weighted regression

Difference between "parametric" learning and "non-parametric" learning:

  • Parameterized learning: The purpose of the algorithm is to learn the specific values ​​of the parameters, and the number of parameters remains unchanged.
  • Non-parametric learning: The number of parameters increases with the amount of data (linearly), which is equivalent to saving the training set in the final result.

Locally weighted regression is a kind of non-parametric learning, that is, to save training data, when predicting new data, give higher weight to those data that are close to the input, use them to generate model parameters, and then substitute them into the solution.

To be precise:

We changed the cost function and added weights compared to ordinary linear regression

If it is small (x is the data to be predicted), the above formula is close to 1. If it is larger, the above formula is close to zero. It can be seen that only close data will be used for prediction, so it is called "partial".

τ is the "width", the larger the reference data, the more data.

Partially weighted regression is more suitable for data sets with a small amount of data , because it needs to save all training data. 

Probability interpretation using square error

Using least squares is equivalent to using the maximum likelihood estimation method in which the error is Gaussian and independent of each other. For details on MLE, see another article.

Logistic regression

Used to solve classification problems.

When y is 0 or 1 (discrete), we use logistic regression.

First, define the sigmoid function:

Then, set our hypothetical function as:

It can be assumed that our model has a probability for any value of x and y:

which is:

Then the MLE method can be used to find θ.

It’s easier to use log likelihood

We need to find the θ that maximizes l(θ)

The method is gradient ascent:

See the second lesson for details.

The algebraic method calculates that the above formula is equal to:

It can be seen that the above formula is consistent with the LMS of the second lesson, and the reason will be revealed later.

Unfortunately, logistic regression does not have an analytical solution (normal equation), so the parameters cannot be directly substituted, and iterative schemes must be resorted to.

Newton Method

Another method besides gradient descent is Newton's method.

Newton's method finds the zero point of the function f(θ) by constantly updating θ

Update with the following formula:

When θ is a vector:

Icon:

Since we are looking for the maximum point of l(θ) (or the minimum point of J), we use Newton's method to find the zero point of the derivative function l'(θ) :

Newton's method has the characteristics of quadratic convergence, and the speed of convergence must be much faster in gradient decline.

When θ is a vector:

H is n+1*n+1 matrix (including the 0th item, that is, the cut-off item)

The disadvantage of Newton's method is that when the dimensionality of θ is relatively large (such as tens of thousands), each step of the method is very expensive. Because it involves finding the inverse of a high-dimensional matrix.

Newton's algorithm can guarantee to find the solution of linear regression in one iteration

Guess you like

Origin blog.csdn.net/u012009684/article/details/112970389