Derivatives of Scalars and Vectors

1. Scalar Derivatives

y a (constant) xn { {\rm{x}}^n}xn exp(x) log(x) sin(x)
d y d x { {dy} \over {dx}} dxdy 0 nxn { {\rm{ {n}}{x}}^n}nxn exp(x) 1 x {1 \over x} x1 cos ⁡ x \cos x cosx

2. Vector derivatives

y a x Ax x T { { {x}}^T} xT A au Au u+v
∂ x ∂ y { {\partial x}\over{ {\partial y}}} yx 0 I Ax x T { {x}}^T xT a ∂ u ∂ y a{ { { {\partial u}} \over { {\partial y}}}} ayu A ∂ u ∂ y A{ { { {\partial u}} \over { {\partial y}}}} Ayu ∂ u ∂ x + ∂ v ∂ x { {\partial u} \over {\partial x}} + { {\partial v} \over {\partial x}} xu+xv

3. Linear regression

3.1 Square loss

Assuming y is the true value, y ^ \hat{y}y^Is the estimated value, square loss: ℓ ( y , y ^ ) = 1 2 ( y − y ^ ) 2 \ell(y, \hat{y})=\frac{1}{2}(y-\hat {y})^2( y ,y^)=21(yy^)2

3.2 Training data

Assuming there are n samples, record X = [ x 1 , x 2 , . . . , xn ] TX = [x_1,x_2,...,x_n]^TX=[x1,x2,...,xn]T ,Y = [ y 1 , y 2 , . . . , y 3 ] TY =[y_1,y_2,...,y_3]^TY=[y1,y2,...,y3]T

3.3 Parameter Learning

训练损失: ℓ ( X , y , w , b ) = 1 2 n ∑ i = 1 n ( y i − < x i , w > − b ) 2 = 1 2 n ∣ ∣ y − X w − b ∣ ∣ 2 \ell(X,y,w,b)=\frac{1}{2n}\sum\limits_{i = 1}^n(y_{i}-<x_{i},w>-b) ^2 =\frac{1}{2n}{||y-Xw-b||}^2 (X,y,w,b)=2 n1i=1n(yi<xi,w>b)2=2 n1∣∣yXwb∣∣2

Minimize the loss to learn the parameters w*,b*:

w ∗ , b ∗ = arg ⁡ min ⁡ w , b ℓ ( X , y , w , b ) \mathbf{w}^*, \mathbf{b}^*=\arg \min _{\mathbf{w} , b} \ell(\mathbf{X}, \mathbf{y}, \mathbf{w}, b)w,b=argw,bmin(X,y,w,b)

3.4 Display solution

The infinitesimal equation
l ( X , y , w ) = 1 2 n ∣ ∣ y − w X ∣ ∣ 2 = ∂ l ( X , y , w ) ∂ w = 1 n ( y − X w ) TX \ell( \mathbf{X},y,w) = \frac{1}{2n}||yw\mathbf{X}||^2 = { {\partial {\ell( \mathbf{X},y,w) } \over \partial{\mathbf{w}} }= \frac{1}{n}(y-\mathbf{X}w)^TX}(X,y,w)=2 n1∣∣ywX2=w(X,y,w)=n1(yXw)TX

Since it is a convex function, the optimal solution satisfies the point where the derivative is equal to 0, that is:

∂ ℓ ( X , y , b ) ∂ w = 0 {\partial {\ell (\mathbf{X},y,b)} \over \partial \mathbf{w}}=0 w(X,y,b)=0

⇒ 1 n ( y − X w ) T X = 0 \Rightarrow {\frac{1}{n} (y-\mathbf{X}{w})^T X}=0 n1(yXw)TX=0

⇒ w ∗ = ( X T X ) − 1 X T y \Rightarrow w^* = {(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^{T}y} w=(XTX)1XTy

3.5 Gradient Descent

pick an initial value w 0 w_0w0, repeat iteration parameters t = 1 , 2 , 3 , . . . t=1,2,3,...t=1,2,3,... , making w obtain a minimum value:
W t = W t − 1 − η ∂ ℓ ∂ W t − 1 W_t = W_{t-1} - \eta{\partial{\ell}\over{\partial {W_{t-1}}}}Wt=Wt1theWt1

  • The loss function value will increase along the gradient direction, so − ∂ ℓ ∂ W t − 1 - {\partial{\ell}\over{\partial{W_{t-1}}}}Wt1
  • the \etaη : learning rate, hyperparameter, control gradient descent step size

So for the hyperparameter-learning rate, how to choose the learning rate properly?

3.5.1 Mini-batch Gradient Descent

In the actual training process, gradient descent is rarely used directly, but a variant of gradient descent is used, for example: small batch random descent , because for each descent, the loss function is the average loss after deriving the entire sample. Therefore, to find a gradient, it is necessary to recalculate the entire sample set, and it may take several minutes or even hours in a deep neural network model. Therefore, the calculation cost of a gradient is huge.

So can we randomly sample b samples i 1 , i 2 , i 3 , . . . i_1,i_2,i_3,...i1,i2,i3,... to find the approximate loss?

The answer is yes, this is mini-batch stochastic gradient descent:
1 b ∑ i ∈ I b ℓ ( X i , yi , w ) \frac{1}{b}{\sum\limits_{i\in{I_b}} \ell({\mathbf{X}}_i,{\mathbf{y}}_i,\mathbf{w})}b1iIb(Xi,yi,w)

in,

  • b is the batch size, which is another important hyperparameter relative to the learning rate

So how to choose the batch size?

  • Can not be too small: the amount of calculation is too small, not suitable for GPU parallel computing
  • Can't be too large: too large increases memory consumption or wastes computation (when all samples are the same)

4. Summary

  • Gradient descent is to continuously update the parameters to solve the problem by continuously approaching the point where the loss function is the smallest along the opposite direction of the gradient.
  • The mini-batch stochastic gradient descent algorithm is the default solution algorithm for deep learning
  • Two important hyperparameters are the learning rate η \etaη and batch size b

Guess you like

Origin blog.csdn.net/qq_45801179/article/details/132416438