Wu Enda Machine Learning Notes Lesson 2

Symbolic representation

Suppose h(x): a function that maps input to output, and the goal of learning is to derive this function.

(x^{i},y^{i}): The i-th training example, x is a vector, which means input; y is a scalar, which means output.

Training set size m: how many training examples are in the training set.

Feature number n: how many features each training example has.

Parameter θ: the parameter that determines h

Linear regression

Deciding θ makes the prediction

 

Error (cost function)

 

The smallest.

(Assuming x0=1)

Gradient descent

The process of constantly updating θ to make J constantly approach the minimum value.

 

α is the learning speed, and the gradient value affects the speed of descent together.

Gradient descent can find the local optimum.

Batch gradient descent

For linear regression problems, by substituting J, the update formula can be transformed into the following form:

 

This method is called batch gradient decent.

If the training set size is 1, the above formula becomes

 

It is called the least mean squares rule (LMS update rule).

The advantage of batch gradient descent is that it will eventually converge. For linear regression problems, batch gradient descent is guaranteed to converge to the global optimum.

The disadvantage is that it is time-consuming, each step has to traverse the entire training set.

Stochastic gradient descent

In order to solve the problem of slow batch gradient descent, we can use LMS once for each training example and directly update θ.

 

The advantage of this is that the rate of decline is greatly accelerated, but the disadvantage is that it will not converge in the end and will linger in a small range.

Normal equation

For linear regression problems, we don't actually need to use iterative algorithms to solve them. We have methods to find analytical solutions to linear regression problems.

Matrix derivative and trace

For a function, we define the operator to act to get a new function as

Define the function tr:

It has the following properties:

Define the design matrix , each training example in the training set constitutes its row. Among them, the first column is specified as 1. It is an m*(n+1) matrix

Define the target vector, which is the answer to the training example

h is expressed as a matrix:

Then

The cost J can be expressed as:

 

According to the properties, obtain the derivative of the matrix:

 

Let the lead be 0, which is the minimum point:

This is the analytical solution of linear regression. Called the normal equation.

 

 

Guess you like

Origin blog.csdn.net/u012009684/article/details/112955884