[论文]An overiview of gradient descent optimization algorithms

This article was first published in one of Sebastian Ruder January 16, 2016 blog .

Paper Translation:
An Optimization algorithms the Overview of gradient descent
gradient descent optimization algorithm outlined
0. Abstract Abstract:
Gradient descent Optimization algorithms, the while increasingly Popular, are Often field Used AS Black-Box optimizers, AS & PHARMACY EXPLANATIONS of Their Strengths and Weaknesses are Hard to come by.
gradient descent optimization algorithm, although very popular, but usually as a black box optimization, it is difficult to make a real explanation for their advantages and disadvantages.
This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use.
This paper is intended to help readers build intuition for the different performance of the algorithm performance, in order to make better use of these algorithms .
In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent.
This paper introduces a few different gradient descent algorithm, and the challenges they face. Also describes the strategy most commonly used optimization algorithms, parallel and distributed architectures, as well as other gradient descent optimization algorithm.

1. Introduction Introduction:
. Gradient Descent One of IS at The MOST Popular Optimization algorithms to the perform at The MOST FAR and by the Common Way to Optimize Neural Networks
gradient descent is one of the most popular implementation of the optimization algorithm is by far the most widely used the neural network optimization algorithms.
At the same time, every state- of-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (eg lasagne's, caffe's, and keras' documentation).
At the same time, the latest depth study libraries (eg: lasagne, caffe , keras) have achieved a variety of optimization algorithm gradient descent.
These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesss are hard to come by.
However, these algorithms typically optimized as a black box, it is difficult to make a practical explanation of their advantages and disadvantages.

This article aims at providing the reader with intuitions with regard to the behaviour of different algorithms for optimizing gradient descent that will help her to put them to use.
This paper is intended to help the reader establish decreased performance can be optimized for different gradient algorithm intuition in order to make better use of these algorithms.
In section 2, we are first going to look at the different variants of gradient descent. We will then briefly summarize challenges during training in Section 3.
In the second chapter, we first look at the different gradient descent algorithm, and then in the third chapter briefly summarize the challenges facing the algorithm training process.
Subsequently, in Section 4, we will introduce the most common optimization algorithms by showing their motivation to resolve there challenges and how this leads to the derivation of their update rules.
Then, in the fourth chapter describes the most common optimization algorithms, as well as how they respond to challenges, and update the rules.
Afterwards, in Section 5, we will take a short look at algorithms and architectures to optimize gradient descent in a parallel and distributed setting.
Then, in the fifth chapter introduces the optimization algorithms and frameworks in parallel and distributed architecture gradient descent .
Finally, we will consider additional strategies that are helpful for optimizing gradient descent in Section 6.
Finally, the sixth chapter describes some other useful gradient descent optimization strategy.

Gradient descent of IS A Way to Minimize AN Objective function \ (J (\ Theta) \) Parameterized by A Model apos Parameters \ (\ Theta \ in R & lt ^ D \) by Updating The Parameters in The Opposite direction of The gradient of The Objective function \ ({\ nabla} _ {\ Theta} J ({\ Theta}) \) WRT to The Parameters.
gradient descent method is objective function \ (J (\ Theta) \) , calculate the gradient \ ({\ nabla} {_ \ J Theta} ({\ Theta}) \) , and negative parameter update \ (\ Theta \ in R & lt ^ D \) , such that the target function is minimized.
Learning Rate at The \ (\ eta \) Determines at The Steps at The size of the Take to the REACH A WE (local) Minimum.
Learning rate \ (\ eta \) defines our approach (local) minimum step size.
In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.
In other words, what we go in the direction of the slope of the objective function decline, knowing that to reach the bottom.

2. Gradient descent variants gradient descent variants
There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function.
Gradient descent three variants, except that they used different amounts of data descent gradient of the objective function is calculated.
Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.
Depending on the amount of data, we make a trade-off between accuracy and updating the parameter update time.

2.1 Batch gradient descent 批量梯度下降
Vanilla gradient descent, aka batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters \(\theta\) for the entire training dataset:
\(\theta = \theta - \eta · {\nabla}_{\theta} J ({\theta})\) ---- (1)

Guess you like

Origin www.cnblogs.com/yanqiang/p/11301079.html