Derivation of regression and linear regression Python implementation

Linear Regression:

\ (h_ \ theta (x)
= \ theta_0 + \ theta_1x_1 + \ cdots + \ theta_nx_n = X \ theta \) assuming the function \ (h_ \ theta (x) \) of \ (m + 1 \) vector. \ (\ Theta \) of \ ((n + 1) * 1 \) vector, inside + model parameters of the n 1 algebra, \ (X-\) of \ (m * (n + 1 ) \) of matrix

Maximum likelihood estimate

Principle: large probability event is more likely to occur in a single observation; observing that occurred in the first event of its probability to be big
goal: to find observable data system capable of producing at a higher probability tree

Derivation

\(y^{(i)} = \theta^Tx^{(i)} + \epsilon ^{(i)}\)

  • \ (^ {Y (i)} \) : the i-th label value
  • \ (^ {X (i)} \) : i-th sample
  • \ (\ Theta the Tx ^ {^ (i)} \) : the current (Theta \ \) \ , the predicted value of the i-th sample
  • \ (\ ^ {Epsilon (I)} \) : the current (Theta \ \) \ errors, the predicted and actual values of
  1. Error \ (\ epsilon ^ {(i )} (1 \ leq i \ leq n) \) are independent and identically distributed , and with mean 0 and variance \ (\ sigma ^ 2 \) a Gaussian distribution (central limit theorem )
  2. Practical problems, many random phenomenon can be seen as a comprehensive reflection of the many factors affecting the independent, often follow a normal distribution

Gaussian distribution: \ (P (X) = \ {FRAC. 1} {\ Sigma \ sqrt {2 \ E PI}} ^ {- \ {FRAC (Xu) ^ 2} {2 \ Sigma ^ 2}} \)

For the i th sample

  • \(p(\epsilon^{(i)})=\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(\epsilon^{(i)})^2}{2\sigma^2}}\)
  • \(p(y^{(i)}|x^{(i)};\theta)=\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}}\)

Likelihood function:

  • \ (L (\ theta) = \ prod ^ m_ {i = 1} p (y ^ {(i)} | x ^ {(i)}; \ theta) = \ prod ^ m_ {i = 1} \ frac {1} {\ sigma \ sqrt {2 \ pi}} e ^ {- \ frac {(y ^ {(i)} - \ theta ^ Tx ^ {(i)}) ^ 2} {2 \ sigma ^ 2 }} \)
    logarithm (logarithm does not affect the extreme values, and the simplified calculation):
  • \(l(\theta)=logL(\theta)\)
    =\(log\prod^m_{i=1}\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}}\)
    = \(\sum^m_{i=1}log\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}}\)
    = \(\sum_{i=1}^mlog\frac{1}{\sigma \sqrt{2\pi}}-\frac{1}{\sigma^2}\cdot{\frac{1}{2}}\sum_{i=1}^m(y^{(i)}-\theta^Tx^{(i)})^2\)

Purports \ (l (\ theta) \ ) Maximum the \ ({\ frac {1} {2}} \ sum_ {i = 1} ^ m (y ^ {(i)} - \ theta ^ Tx ^ { (i)}) ^ 2 \ ) minimum to. Reason: \ (\ Sigma \) is the error variance is constant.

损失函数\(loss(y,\hat{y})=J(\theta)={\frac{1}{2}}\sum_{i=1}^m(y^{(i)}-\theta^Tx^{(i)})^2\)

  • \ (^ {Y (i)} \) : the i-th label value
  • \ (^ {X (i)} \) : i-th sample
  • \ (\ Theta \) : variable model to learn the purpose of minimizing the loss function

Solution 1, so that the derivative is equal to 0:
\ (J (\ Theta) = {\ FRAC {1} {2}} \ sum_ {I = 1} ^ m (Y ^ {(I)} - \ Theta ^ the Tx ^ {(I)}) ^ 2 = \ FRAC. 1} {2} {(X-\ Theta -Y) ^ T (X-\ Theta -Y) \) -> \ (min_ \ Theta J (\ Theta) \)
\ (\ nabla_ \ Theta J (\ Theta) = \ nabla_ \ Theta (\ FRAC. 1} {2} {(X-\ Theta -Y) ^ T (X-\ -Y Theta)) \)
= \ (\ nabla_ \ Theta (\ FRAC. 1} {2} {(\ Theta the TX ^ -Y ^ T ^ T) (X-\ -Y Theta)) \)
= \ (\ nabla_ \ Theta (\ FRAC. 1} {2} {(\ ^ ^ the TX the TX Theta \ Theta - \ Theta ^ -Y ^ the TX the TX TY ^ \ ^ -Y TY Theta)) \)
= \ (\ FRAC. 1} {2} {(2X the TX ^ \ ^ TY the -X-Theta - (the TX the Y ^) ^ -Y ^ TY T) \)
= \ (X-the TX ^ \ ^ TY the -X-Theta \)
if \ (X ^ TX \) reversibly => \ (\ Theta = (X-the TX ^) ^ {-1} X ^ TY \)
is actually greater than the number of samples as the number of features and other reasons, \ (X-^ the TX \) often irreversible, additional data may be added, resulting in a final matrix is invertible => \ (\ Theta = (X- ^ TX + \ lambda I) ^ {- 1} X ^ TY \)(Solution ridge regression is the above formula: \ (J (\ Theta) = {\ FRAC {. 1} {2}} \ sum_ {I =. 1} ^ m (Y ^ {(I)} - \ Theta ^ the Tx ^ {(I)}) ^ 2 + \ the lambda \ sum_. 1} = {J} ^ {n-\ Theta ^ 2_j \) )

Here used to derive knowledge
$(XY)^T=Y^TX^T$
$\frac{d(u^Tv)}{dx} = \frac{du^t}{dx}\cdot v + \frac{dv^T}{dx}\cdot u$
$\frac{\partial AX}{\partial X} = A^T$
$\frac{\partial AX}{\partial X^T} =\frac{\partial AB^T}{\partial B} = A$
$\frac{\partial X^TA}{\partial X} =\frac{\partial A^TX}{\partial X} =(A^T)^T = A$
$\frac{\partial X^TX}{\partial X} = \frac{\partial X^T \cdot{(X)}}{\partial X} + \frac{ \partial (X^T)\cdot{} X}{\partial X} = 2X$
$\frac{\partial X^TAX}{\partial X} = \frac{\partial X^T \cdot{(AX)}}{\partial X} + \frac{ \partial (X^TA)\cdot{} X}{\partial X} = AX+A^TX$
----- If A is a symmetric matrix, there is $ \ frac {\ partial X ^ TAX} {\ partial X} = 2AX $
$\frac{\partial A^TXB}{\partial X} = (AB^T)^T=A^TB$
$ \ Frac {\ partial A ^ TX ^ Transform} {\ partial X} = Asa + A ^ TX ^ TA ^ T = SEND ^ TA + Return ^ T = 2XAA ^ T $
$\frac{\partial |A|}{\partial A} =|A|(A^{-1})^T$

Solving method, gradient descent method, global approach (partial) optimal solution:

Irresponsible first issue to be typeset and follow-up

Code

Least squares method

Derivation

Code

Gradient descent

Derivation

Code

Guess you like

Origin www.cnblogs.com/yunp-kon/p/11134816.html