Batch gradient descent, the comparison between the stochastic gradient descent algorithm with a small batch gradient descent

Three algorithms are used to optimize the loss function of the back-propagation algorithm. Updated weights w in each iteration, depending on a number of iterations, the final infinitely close to our expectations weights optimum value.

1. batch gradient descent algorithm:

(1) If the data set is relatively small, it can take the form of the full data set (Full Batch Learning), using the full data has two advantages:

a.  from the full data set to determine the direction of the more representative sample of the population, a more accurate direction toward the extremum located.
B.  because of the heavy weight large difference values of different gradients, thus selecting a global learning rate is difficult. Full Batch Learning Rprop may be used only updated based on a gradient for each symbol and the specific weight of the individual.

(2) batch gradient descent but there are also about these shortcomings:

a.  gradient descent algorithm does not guarantee that the function is optimized to achieve a global optimal solution, only if the loss function is a convex function when, in order to guarantee the achievement of gradient descent algorithm global optimal solution can not be guaranteed to achieve global optimization, subject to the loss of the function is convex function.

b.  Another problem is to compute the gradient descent algorithm too long, because in all the training data to minimize losses in the massive training data, so it is very time consuming.

2. The stochastic gradient descent algorithm:

In order to accelerate the process of training, stochastic gradient descent algorithm can be used (stochastic gradient descent-SGD). Stochastic gradient descent algorithm is also referred to as "online learning."

(1)  This algorithm is not optimized for the loss of function in all training data, but in each iteration, the stochastic optimization loss function on a pair of training data, so each round of parameter update speed greatly accelerated.

(2)  but it brings the following questions:

On a piece of data loss function is smaller does not mean that all data on the loss function is smaller, so the use of stochastic gradient descent neural network optimization may not even get the global optimum.

3. Small batch gradient descent algorithm:

For comprehensive batch gradient descent algorithm and disadvantages stochastic gradient descent algorithm, commonly used in the practical application of such two algorithms -----> loss function calculated every time a small portion of the training data. Small batch gradient descent algorithm (Mini-batches Learning) in depth learning algorithm of back-propagation algorithm lot of very commonly used. This small part of the training data is also referred to as a batch, and therefore also introduced the concept of batch_size, batch_size name suggests is to measure the number of instances of each batch.

(1)  the introduction of batch have a good advantage:

a.  by a matrix operation, each time optimization of neural network parameters in a batch will not be much slower than a single data.

B.  each time a batch can greatly reduce the number of iterations required for convergence, while allowing the results to converge closer to the effect of the gradient descent.

---> but small quantities gradient descent algorithm creates a problem, and that is how to select the optimal batch_size?

(2)  Can select a moderate Batch_Size value it?

Certainly, this is a small batch gradient descent (Mini-batches Learning). Because if the data set sufficient enough, then counted out by half (or even much less) of data training gradient and trained with all the data gradient is almost the same.

(3)  within reasonable limits, what the benefits of increased Batch_Size?

a.  memory utilization increased, a large matrix multiplication parallelization efficiency is improved.

B.  finish time Epoch (full data set) to reduce the required number of iterations for the same amount of data processing speed is further accelerated.

C.  in a certain range, Batch_Size general, the more it is determined quasi-lowering direction, causing shock training smaller.

(4)  blindly increase Batch_Size What are the disadvantages?

a.  memory utilization increased, but the memory capacity may barely.

b.  finish time epoch (the full data set) to reduce the number of iterations required, in order to achieve the same accuracy, the time it takes to greatly increased, thereby correcting the parameters of it appears more slowly.

C.  BATCH_SIZE increases to a certain extent, determines the direction of decline has not substantially changed.

 

---------------------------------------

Original: https://zhuanlan.zhihu.com/p/37714263

Guess you like

Origin blog.csdn.net/lcczzu/article/details/91413125