Tensorflow learning three: optimization problem of neural network

Backpropagation:

       The gradient descent algorithm is mainly used to optimize the value of a single parameter, which I documented in a previous article. And backpropagation gives an efficient way to use gradient descent algorithm on all parameters, so that the loss function of the neural network model on the training set is as small as possible. The specific implementation of backpropagation is not introduced here. We only know that backpropagation can calculate the gradient of the loss function for each parameter, and then use the gradient descent algorithm to update each parameter according to the gradient and learning rate.

It should be noted that the backpropagation algorithm does not guarantee that the optimized function reaches the global optimal solution. When training the network, the initial value of the parameters will greatly affect the final result.

Two: random gradient descent, a small part of the gradient descent:

      Since gradient descent is to minimize the loss over the entire data, the loss function is the sum of the losses over all the training data, so that the loss function over the entire training data is calculated in each iteration. Very time consuming. In order to speed up the training process, the stochastic gradient descent algorithm can be used, which randomly optimizes the loss function on a certain piece of training data, so its problem is also obvious: a smaller loss function on a certain piece of data does not mean a loss function on all data. smaller. Therefore, using stochastic gradient descent may not reach the global optimal solution. In actual network training, a compromise between the above two methods is generally taken, and the loss function of a small part of the training data is calculated each time. This small step of data is called a batch. Practice has proved that this method has improved speed and accuracy, so it is widely used.

Three: The design of the learning rate:

   The learning rate has a great influence on the training of the model, tf has a more flexible way to set the learning rate, the exponential decay method. The tf.train.exponential_decay function implements an exponentially decaying learning rate. Through this function, you can first try a larger learning rate to quickly get a better solution, and then gradually reduce the learning rate as the patch pocket continues, making the model more stable in the later stage of training. He implements the functionality of the following code:

decayed_learning_rate = learning_rate * decay_rate ^ (global_step / decay_steps)
Among them, decayed_learning_rate is the learning rate used in each round of optimization, and learning_rate is the pre-set learning rate. decay_rate is the decay coefficient, the value is less than 1, decay_steps is the decay speed, which can also be said to be the total number of iteration rounds. And global_steps is the current number of iterations, which is generally initialized to 0. The definition corresponding to tf is this:  
learn_rate = tf.train.exponential_decay(0.1, global_step, decay_steps, decay_rate, staircase=True)
Different decay methods can be selected by setting the staircase parameter. Its default value is False, the change curve of the learning rate is smooth at this time, but when it is set to True, it is stepped. In such a setting, decay_steps usually represents the number of iterations used for a full pass of the training data. The number of iterations is the total training data divided by the number of training samples in each batch. The scenario of this setting is usually that the learning rate is reduced once every complete pass of the training data. This makes all data in the training dataset have the same effect on model training. When using a continuous exponentially decaying learning rate, different training data have different learning rates. When the learning rate is reduced, the corresponding training data has less influence on the model training results.

global_step = tf.Variable(0)
learn_rate = tf.train.exponential_decay(0.1, global_step, 100, 0.96, staircase=True)
The above code multiplies the learning rate by 0.96 after every 100 epochs of training


Four: Overfitting problem:

     To solve the problem of overfitting, we often use L1 regularization and L2 regularization. All are by limiting the size of the weights, so that the model cannot fit the noise in any training data. The difference between the two is: L1 regularization will make the parameters more sparse, while L2 will not. The so-called parameters become sparser means that more parameters become 0, which can achieve a function similar to feature selection. Secondly, the calculation formula of L1 regularization is not derivable, while the L2 regularization formula is derivable. So L2 is simpler when optimizing. In practice, both can be used together.

loss=tf.reduce_mean(tf.square(y_-y)) + tf.contrib.layers.l2_regularizer(lambda)(w)

Next, I will introduce a code from page 88 of the book "Tensorflow Practical Google Deep Learning Framework".

Because the definition of the loss function plus the regular term is too long when the network becomes complex, we can use the set provided by tf at this time. It can hold a set of entities, such as tensors, in a computational graph.

#Get the weight on the edge of a layer of neural network, and add the L2 regularization loss of this weight to the set named; losses     
def get_weight(shape, lam):
     var = tf.Variable(tf.random_normal(shape), dtype=tf.float32)
     The #add_to_collection function adds the L2 loss function of this newly generated variable to the collection
     tf.add_to_collection('losses', tf.contrib.layers.l2_regularizer(lam)(var))
     return var
     
x = tf.placeholder(shape=[None,2], dtype=tf.float32)
y_ = tf.placeholder(shape=[None, 1], dtype=tf.float32)
batch_size = 8

#define the number of nodes in each layer
layer_dimension = [2, 10, 10, 10, 1]
n_layers = len(layer_dimension)

#This variable maintains the deepest node in forward propagation, which is the input layer at the beginning
cur_layer = x
#Number of nodes in the current layer
in_dimenson = layer_dimension[0]

#Generate a 5-layer fully connected network structure through a loop
for i in range(1, n_layers):
     out_dimension = layer_dimension[i]
     weight = get_weight([in_dimenson, out_dimension], 0.001)
     bias = tf.Variable(tf.constant(0.1, shape=[out_dimension]))
     #Use the relu activation function
     cur_layers = tf.nn.relu(tf.matmul(cur_layer, weight) + bias)
     in_dimenson = layer_dimension[i]

#When defining the forward propagation of the neural network, all the L2 regularization losses have been added to the set on the graph, only need to describe here
#The loss function of the model on the training data
mse_loss = tf.reduce_mean(tf.square(y_ - cur_layer))

#Add the mean squared error loss function to the loss set
tf.add_to_collection('losses', mse_loss)

#get_collection returns a list of all the elements in the collection
loss = tf.add_n(tf.get_collection('losses'))




Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326291589&siteId=291194637