Linear regression reasoning

Remarks (for the first time to write a blog, correct me if wrong Wang)

Understanding of linear regression

Machine learning is divided into supervised learning and unsupervised learning, linear regression belongs to supervised learning. Wherein X and the need label Y, the mapping function between the training input (wherein X) and output (label Y).

At this point some people will certainly think, what is characteristic of what is a tag? What mapping function is it?

Simple linear function
Y = w 1 * x 1 + w 0 * x 0
If x0 = 1, then the
W = the Y . 1 * X . 1 + W 0 this is not like a equation y = k * x + b? Yes pretty much the same meaning.

Above is one of our very simple linear function, but is not particularly understand the features and labels. It does not matter to see the table below

Wage (X0) Age (X1) Flower chant amount (Y)
3000 22 1600
5000 26 4000
10000 30 8000

As shown in the table wages and age of these two determines the amount spent chant, where wages and age is X0 and X1 features we need, flower chanting limit is our final result Y.

This is not very clear that the know, the linear regression is to find the relationship between wages already know the condition and age of the X0 and X1 flower chant amount Y. This relationship is a formula where w determined. This was in doubt, and if it is more characteristic X? It does not matter, we use a feature vector X is represented.
Feature vector
This also shows a plurality of w, then also using the feature vector W represents.
Weight
Linear function is expressed as
Here Insert Picture Description
where H W (X) that is, we say that the predicted value, the same label value with the above said properties. Feature vector X in the formula and the weight vector W is a row vector, so only the final result H W (X) = W T * X.

loss

This time we know its predictive value and tag value (real value), then whether there will be between the predicted value and label it error?

Here Insert Picture Description
Here Insert Picture Description
i represents the sample, i.e., the first few samples. Y (I) represents the label value of several samples of the (true value), W T * X (I) representative of the predicted value.
Then the loss function with the least squares method to obtain the value of loss.
As shown in equation J (w) representative of the loss of all samples, i denotes the first few samples.
Here Insert Picture Description
By converting the
Here Insert Picture Description
given formulas X represents the feature vectors of all the samples, and weights w
Here Insert Picture Description

Here Insert Picture Description

You see X and y know, then how do we know that W is how much?

The first method, seek canonical equation

Here Insert Picture Description
J (w) for w partial derivative
Here Insert Picture Description
obtained its canonical equation, w is equal
Here Insert Picture Description
then the question is the X- T result was obtained by Matrix * X, but not all of the matrix are reversible, because the matrix is full rank only reversible. So you will not be able to find out w.

The second method, gradient descent

Well, let's look at the second method, there is a lot of methods used machine learning, gradient descent.
Here Insert Picture Description
Gradient What is it? It is a gradient vector. Our analogy, say you want to find the smallest value of the function, you need to search in the direction of decline, in fact, we seek out the fastest rising gradient is pointing in the direction , then how do we want to find direction drop it? It does not matter, with the minus sign in front of it gradient, -W . So in the end how we find out?
First is how the loss function, but this loss is slightly modify the function that is divided by a 2n, as is the size of the number of samples to avoid affecting the magnitude of the loss. That is seeking an average loss .

Here Insert Picture Description
Gradient is a vector, then let loss J (w) of the weight w m partial derivative
Here Insert Picture Description
of the w partial derivative obtained
Here Insert Picture Description
well, and now also seek out gradient, then have to start looking for the minimum of the gradient descent. Above that just beg to point out is the fastest rising gradient direction, so we need to get him back in the direction of the fastest decline.
Here Insert Picture Description
See it, w J is the original gradient, it is behind the new gradient, meaning that the original gradient minus the new fastest rising gradient, the maximum gradient has been dropped.

Gradient descent way

Here Insert Picture Description

Published an original article · won praise 0 · Views 10

Guess you like

Origin blog.csdn.net/weixin_44771532/article/details/103987016