Neural network optimization method

I. Overview

After a good neural network training, only to be achieved by optimizing the effect we want, what we call the assistant, so that more accurate parameter update.
Neural network training process generally into two stages:

  • The first stage to calculate a forward-propagation algorithm predicted value and the predicted value and the actual value for comparison, the gap between the two.
  • The second phase loss function calculated by the gradient back propagation algorithm for each parameter, and then using a gradient descent algorithm updates each parameter according to a gradient and a learning rate.

Second, the method of optimizing network

1, optimizing the learning rate

Each parameter dependence on the different objective function, some parameters have been optimized to near minimum, some parameters still have a lot of gradients, so you can not use a uniform rate of learning, the learning rate is too small, there will be a very slow convergence speed, great learning rate, would have been about the same optimized parameters unstable, the general practice is reasonable to set a different learning rate for each parameter. There are three different adaptive learning rate optimization algorithm

  • AdaGrad
    AdaGrad algorithm independently to adapt to all the learning rate model parameters, scaling for each parameter is inversely proportional to the square root of the sum of all its historical average gradient . Having a maximum gradient of the cost function parameters have a rapid decline in learning rate, and the parameter has a small gradient in the relatively small decline in learning rate accordingly.
  • RMSProp
    RMSProp gradient algorithm modified AdaGrad accumulation as an exponentially weighted moving average , such that it is set in the non-convex better.
  • Adam

Independently set the learning rate of
exponential decay rate of learning
exponential decay rate is learning to use a larger learning rate to quickly obtain a superior solution, and then continue with iterations, gradually reducing the learning rate, making the model more training in the late stable.

tf.train.exponential_decay(learning_rate,global_step,decay_steps,decay_rate,staircase=False,name=None)

parameter

The initial learning rate learning_rate
attenuation coefficient decay_rate
decay_steps control the decay rate
global_steps current training rounds
staircase: Boolean value. If True, decay_steps represents the learning rate parameter update iteration how many rounds before eventually learning function will be a step function

2, gradient descent optimization

https://blog.csdn.net/itchosen/article/details/77200322
general idea:
gradient descent method refers to (the derivative is increased in the reverse direction) iteratively updated parameters along downslope loss function is referred to as gradient descent purpose is to look for a parameter, such that the minimum loss function value. Also need to define a learning rate parameter refers to the amplitude of each movement, the gradient descent method does not guarantee non-convex function that is optimized to achieve the global optimum, since the gradient descent received parameters affecting the final outcome of the initial value, or only when loss function can be achieved when the global optimum convex function. Gradient descent is divided into three, the main difference is the number of samples selected

  • Batch gradient descent method (Batch Gradient Descent)

Use when updating parameters for all samples to be updated

  • Stochastic gradient descent method (Stochastic Gradient Descent)

Stochastic gradient descent method, in fact, a similar principle and batch gradient descent method, there is no difference between data samples at all and demand gradient, but simply select a sample to find the gradient

  • Small batch gradient descent (Mini-batch Gradient Descent)

Small batch gradient descent method is a batch gradient descent and stochastic gradient descent compromise selected portion of the sample to update a gradient

Stochastic gradient descent method and the comparative batch gradient descent method

  • For training speed, the stochastic gradient descent method Because each iteration using only a sample to the training quickly, while a large batch gradient descent method when the sample size, speed training is not satisfactory.
  • For accuracy, the stochastic gradient descent method is used with only one sample gradient direction decision resulting solution is most likely not optimal. SGD usual training longer, but in case of good initialization and learning rate scheduling scheme, more reliable results.
  • For the convergence speed, because of stochastic gradient descent iteration of a sample, resulting in a large change in direction iteration, not quickly converge to a local optimum solution. SGD likely to converge to a local optimum, and in some cases may be trapped in the saddle point (not the smallest point is not the biggest point)

When using gradient descent, a tuning point

  1. Learning rate algorithm of choice. In the algorithm described earlier, I take the study mentioned was 1, but in fact, the value depends on the data samples may take more than some values, from largest to smallest, are running algorithm, a look at the effect of the iteration, if the loss of function smaller, indicating that the effective value, or to increase the learning rate. He said earlier. Learning rate is too big, too fast can cause iteration, there may even miss the optimal solution. Learning rate is too small, too slow iteration, the algorithm can not be the end of a very long time. So after learning rate algorithm needs to be run several times to get a more optimal value.

  2. The initial value of the parameter selection algorithm. Different initial values, the minimum value is obtained may also be different, so only determined gradient descent local minima; course, if the loss function is convex necessarily optimal solution. Due to the risk of local optima, several times with the minimum required to run the algorithm, the key loss function different initial values, select the initial value of the loss function is minimized.

  3. Normalized. Due to the different characteristics of the sample ranges are not the same, may lead to very slow iteration, in order to reduce the impact of the value characteristics, can feature data normalization, expect new features such as 0, 1 new variance, the number of iterations can be greatly accelerate.

Tensorflow is typically done using gradient descent optimization and adaptive learning rate, learning_rate learning rate, loss loss function. See page 1 , page 2 Introduction

train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
##梯度下降,使用全部样本更新参数

train_step = tf.train.AdagradOptimizer(learning_rate).minimize(loss)
##实现Adagrad优化器

train_step = tf.train.RMSPropOptimizer(learning_rate)).minimize(loss)
 ##实现RMSProp优化器
 
 train_step = tf.train.AdamOptimizer(learning_rate)).minimize(loss)
 ##实现Adam优化器

3, due to over-fitting fit and optimization

There are three general methods , see underfitting overfitting optimization process , underfitting overfitting optimization method .

  • Dropout
    in order to prevent neural network overfitting. Its main idea is to make the nodes in the hidden layer of each iteration (including forward and reverse propagation) have a chance to (keep-prob) fails , so to prevent over-fitting. It is mainly to avoid strong dependence on a node, so that the correction value can reverse the spread of a more balanced distribution to the various parameters
    (1) Dropout occurs only in the training phase model , predicts that the testing phase is not Dropout
    (2) Dropout after deleting random neurons, the network becomes smaller, the training phase will speed
    (3) With the dropout, the network will not be as high weight plus any of the features (as input neurons that feature the yuan is likely to be randomly deleted ), eventually produces the effect of shrinkage dropout squared norm weighting
    (4) Dropout is a major drawback is the loss of function is not clearly defined , so that during training, the value of the loss function is not monotonically decreasing
    (5) using , to close Dropout, provided keep-prob to 1, that the loss is a monotonically decreasing function, then open Dropout

  • Bagging See Integrated Learning Introduction

  • Regularization method
    (1) L1 regularization
    (2) L2 regularization

     #在权重参数上实现L2正则化
     regularizer = tf.contrib.layers.l2_regularizer(0.01)
     regularization = regularizer(weight1)+regularizer(weight2)+regularizer(weight3)
     tf.add_to_collection("losses",regularization)     #加入集合的操作
     
     #get_collection()函数获取指定集合中的所有个体,这里是获取所有损失值
     #并在add_n()函数中进行加和运算
     loss = tf.add_n(tf.get_collection("losses"))#损失包括误差和正则项
     tf.add_to_collection("losses", error_loss)      #加入集合的操作
    
  • And characterized in terms of data

1) re-cleaning the data, a cause of over-fitting the data could also be caused by impure, if there is over-fitting the data we need to re-wash.

2) increasing the amount of training data, and one reason is that we are too small for the amount of data resulting from training, the training data the proportion of the total data is too small.

3) For the less fit, you need to add another feature items, sometimes we owe the model appears to fit the time because of insufficient lead feature items, you can add other features items to a good solution. For example, "combination", "generalization", "relevance" three characteristics is an important means of added features. Wherein the polynomial add, for example, by addition of a linear model or a quadratic model that the cubic terms generalization ability. For over-fitting, it is necessary to delete feature
4) may be fitted under reduced regularization constraint, can also increase the over-fitting constrained regularization

Guess you like

Origin blog.csdn.net/qq_39751437/article/details/88541337