Deep Learning - Optimization Algorithms

Optimize algorithms to speed up training

1. mini-batch

Divide the training set into small subsets (mini-batch)

X {1} represents the first mini-batch

X (1) 1st sample

X [1] Input to the first layer

1.1 The mini-batch gradient descent method process: each time a mini-batch is calculated, including J, w are for a subset

epoch: one generation, indicating that the entire dataset (rather than a subset) has been traversed

 

1.2 Changes in the cost function using mini-batch: each iteration processes a mini-batch, and the cost function also refers to J {t}

1.3 How to choose the size of min-batch

If it is too large, each iteration will take too long, and if it is too small, it will only process a small amount of inefficiency at a time (lose the advantage of vector acceleration)

2. Exponentially weighted average (exponentially weighted moving average)

vt can be seen as the average temperature of the previous days. Because Vt includes the temperature of 1~t days, but because the power of b is too large, the temperature of the previous few days will have little effect on t, almost 0

 

2. Do bias correction

Setting v0=0 at the beginning will easily lead to the prediction in the early stage being much smaller than the actual one, and the estimation is inaccurate

So use vt/1-bt to calculate (for deviation correction), when t is large, the denominator is close to 1, so the purple line and the green line will coincide in the later stage

3. Optimizing the Gradient Descent Algorithm

- When starting from a certain point and moving towards the minimum value, there will be a certain fluctuation: reduce the fluctuation, be faster vertically and faster horizontally

- Cannot use a larger learning rate because it may deviate from the function range

3.1 Momentum gradient descent method: momentum

 3.2 RMSprop : root mean square prop

3.3 Adam algorithm: the combination of momentum and RMSprop, a commonly used method

3.4 Learning rate decay: gradually reduce the learning rate over time

Why: If you always use a relatively large learning rate, the results may fluctuate in a large range (not easy to converge) when approaching the minimum value, rather than in a small range that is very close to the minimum value

Importance: It is not the initial focus. You can set a fixed one first and adjust the others first.

4. Local Optimum

- This is not a problem, to find a local optimal point, it needs to be optimal in all dimensions (like all concave/convex), if the dimensions are large, it is difficult

- But it is relatively easy to find the saddle point (the derivative of each dimension is 0)

There is a problem of slow change caused by smoothing, but we have a way to speed up (above~)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325171455&siteId=291194637