1. Scalar Derivatives
y | a (constant) | xn { {\rm{x}}^n}xn | exp(x) | log(x) | sin(x) |
---|---|---|---|---|---|
d y d x { {dy} \over {dx}} dxdy | 0 | nxn { {\rm{ {n}}{x}}^n}nxn | exp(x) | 1 x {1 \over x} x1 | cos x \cos x cosx |
2. Vector derivatives
y | a | x | Ax | x T { { {x}}^T} xT A | au | Au | u+v |
---|---|---|---|---|---|---|---|
∂ x ∂ y { {\partial x}\over{ {\partial y}}} ∂y∂x | 0 | I | Ax | x T { {x}}^T xT | a ∂ u ∂ y a{ { { {\partial u}} \over { {\partial y}}}} a∂y∂u | A ∂ u ∂ y A{ { { {\partial u}} \over { {\partial y}}}} A∂y∂u | ∂ u ∂ x + ∂ v ∂ x { {\partial u} \over {\partial x}} + { {\partial v} \over {\partial x}} ∂x∂u+∂x∂v |
3. Linear regression
3.1 Square loss
Assuming y is the true value, y ^ \hat{y}y^Is the estimated value, square loss: ℓ ( y , y ^ ) = 1 2 ( y − y ^ ) 2 \ell(y, \hat{y})=\frac{1}{2}(y-\hat {y})^2ℓ ( y ,y^)=21(y−y^)2
3.2 Training data
Assuming there are n samples, record X = [ x 1 , x 2 , . . . , xn ] TX = [x_1,x_2,...,x_n]^TX=[x1,x2,...,xn]T ,Y = [ y 1 , y 2 , . . . , y 3 ] TY =[y_1,y_2,...,y_3]^TY=[y1,y2,...,y3]T
3.3 Parameter Learning
训练损失: ℓ ( X , y , w , b ) = 1 2 n ∑ i = 1 n ( y i − < x i , w > − b ) 2 = 1 2 n ∣ ∣ y − X w − b ∣ ∣ 2 \ell(X,y,w,b)=\frac{1}{2n}\sum\limits_{i = 1}^n(y_{i}-<x_{i},w>-b) ^2 =\frac{1}{2n}{||y-Xw-b||}^2 ℓ(X,y,w,b)=2 n1i=1∑n(yi−<xi,w>−b)2=2 n1∣∣y−Xw−b∣∣2
Minimize the loss to learn the parameters w*,b*:
w ∗ , b ∗ = arg min w , b ℓ ( X , y , w , b ) \mathbf{w}^*, \mathbf{b}^*=\arg \min _{\mathbf{w} , b} \ell(\mathbf{X}, \mathbf{y}, \mathbf{w}, b)w∗,b∗=argw,bminℓ(X,y,w,b)
3.4 Display solution
The infinitesimal equation
l ( X , y , w ) = 1 2 n ∣ ∣ y − w X ∣ ∣ 2 = ∂ l ( X , y , w ) ∂ w = 1 n ( y − X w ) TX \ell( \mathbf{X},y,w) = \frac{1}{2n}||yw\mathbf{X}||^2 = { {\partial {\ell(
\mathbf{X},y,w) } \over \partial{\mathbf{w}} }= \frac{1}{n}(y-\mathbf{X}w)^TX}ℓ(X,y,w)=2 n1∣∣y−wX∣∣2=∂w∂ℓ(X,y,w)=n1(y−Xw)TX
Since it is a convex function, the optimal solution satisfies the point where the derivative is equal to 0, that is:
∂ ℓ ( X , y , b ) ∂ w = 0 {\partial {\ell (\mathbf{X},y,b)} \over \partial \mathbf{w}}=0 ∂w∂ℓ(X,y,b)=0
⇒ 1 n ( y − X w ) T X = 0 \Rightarrow {\frac{1}{n} (y-\mathbf{X}{w})^T X}=0 ⇒n1(y−Xw)TX=0
⇒ w ∗ = ( X T X ) − 1 X T y \Rightarrow w^* = {(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^{T}y} ⇒w∗=(XTX)−1XTy
3.5 Gradient Descent
pick an initial value w 0 w_0w0, repeat iteration parameters t = 1 , 2 , 3 , . . . t=1,2,3,...t=1,2,3,... , making w obtain a minimum value:
W t = W t − 1 − η ∂ ℓ ∂ W t − 1 W_t = W_{t-1} - \eta{\partial{\ell}\over{\partial {W_{t-1}}}}Wt=Wt−1−the∂Wt−1∂ℓ
- The loss function value will increase along the gradient direction, so − ∂ ℓ ∂ W t − 1 - {\partial{\ell}\over{\partial{W_{t-1}}}}−∂Wt−1∂ℓ
- the \etaη : learning rate, hyperparameter, control gradient descent step size
So for the hyperparameter-learning rate, how to choose the learning rate properly?
3.5.1 Mini-batch Gradient Descent
In the actual training process, gradient descent is rarely used directly, but a variant of gradient descent is used, for example: small batch random descent , because for each descent, the loss function is the average loss after deriving the entire sample. Therefore, to find a gradient, it is necessary to recalculate the entire sample set, and it may take several minutes or even hours in a deep neural network model. Therefore, the calculation cost of a gradient is huge.
So can we randomly sample b samples i 1 , i 2 , i 3 , . . . i_1,i_2,i_3,...i1,i2,i3,... to find the approximate loss?
The answer is yes, this is mini-batch stochastic gradient descent:
1 b ∑ i ∈ I b ℓ ( X i , yi , w ) \frac{1}{b}{\sum\limits_{i\in{I_b}} \ell({\mathbf{X}}_i,{\mathbf{y}}_i,\mathbf{w})}b1i∈Ib∑ℓ(Xi,yi,w)
in,
- b is the batch size, which is another important hyperparameter relative to the learning rate
So how to choose the batch size?
- Can not be too small: the amount of calculation is too small, not suitable for GPU parallel computing
- Can't be too large: too large increases memory consumption or wastes computation (when all samples are the same)
4. Summary
- Gradient descent is to continuously update the parameters to solve the problem by continuously approaching the point where the loss function is the smallest along the opposite direction of the gradient.
- The mini-batch stochastic gradient descent algorithm is the default solution algorithm for deep learning
- Two important hyperparameters are the learning rate η \etaη and batch size b