Detailed mathematical derivation of regression of a linear function

Hypothetical premise: All data have the same distribution (conform to the same function)
function to be fitted: y ∗ = kx + by^*=kx+by=kx+bError
function (L2 norm):L ( y , y ∗ ) = ( y ∗ − y ) 2 = ( kx + b − y ) 2 L(y,y^*)=(y^*-y)^ 2=(kx+by)^2L ( y ,y)=(yy)2=(kx+by)2.
a= b − ya = bya=by , the function is:
L ( w ) = k 2 x 2 + 2 akx + a 2 L(w) = k^2x^2+2akx+a^2L(w)=k2x _2+2 a k x+a2
The lowest point of the L(w) function is
4 x 2 a 2 − 4 a 2 x 2 4 x 2 = 0 \frac {4x^2a^2-4a^2x^2} {4x^2} =04x _24x _2a _24 a2x _2=0
This simple function can directly calculate the result:
w = − 2 ax 2 x 2 = − axw= - \frac {2ax} {2x^2}=- \frac {a} {x}w=2x _22ax=xa
Therefore, there is a w that makes the L function take 0, and the error is the smallest at this time.
Similarly, the solution to b is the same
L ( b ) = b 2 + 2 ( kx − y ) b + ( kx − y ) 2 L(b)= b^2+2(kx-y)b+(kx-y )^2L(b)=b2+2(kxy)b+(kxy)2
The lowest point of the L(b) function is
4 ( kx − y ) 2 − 4 ( kx − y ) 2 4 = 0 \frac {4(kx-y)^2-4(kx-y)^2} {4 } =044(kxy)24(kxy)2=0
This simple function can directly calculate the result:
b = − 2 ( kx − y ) 2 = y − kxb= - \frac {2(kx-y)} {2}=y-kxb=22(kxy)=yk x
But according to complex functions, it is often difficult to judge its extreme points, so consider gradient descent.
Partial derivative of L with respect to k:
d ( kx + a ) 2 dk = d ( k 2 x 2 + 2 akx + a 2 ) dk = 2 x 2 k + 2 ax \frac {d(kx+a)^2} {dk} = \frac {d(k^2x^2+2akx+a^2)} {dk}=2x^2k+2axdkd(kx+a)2=dkd(k2x _2+2 a k x+a2)=2x _2 k+2 a x
In the same way, the partial derivative of L with respect to b:
d ( b + kx − y ) 2 db = 2 b + 2 ( kx − y ) \frac {d(b+kx-y)^2} { db}=2b+2(kx-y)dbd(b+kxy)2=2 b+2(kxy )
For any k, its partial derivative can be obtained. If its partial derivative>0, the formula for updating k is (η is the learning rate) k ∗ = k −η Δ k = k − η ( 2 x 2 k + 2 ax ) k^*=k-ηΔk=k-η(2x^2k+2ax)k=kthe D k=kη(2x2 k+2 a x )
Similarly, the update formula of b can be obtained as follows:
k ∗ = k − η ▽ k = k − η ( 2 b + 2 ( kx − y ) ) k^*=k-η▽k=k-η (2b+2(kx-y))k=kηk=kh ( 2 b+2(kxy ) )
Then a large amount of learning is carried out at an appropriate learning rate, and the function will slowly approach the optimal solution.
This is the idea of ​​using linear function fitting. If you increase the parameters, increase the order, etc., the situation of model fitting will become more complicated, and of course the effect will be better.

Guess you like

Origin blog.csdn.net/qq_45931661/article/details/124539353