One of the most popular algorithms: backpropagation training

Backpropagation is one of the most common methods for training neural networks. Rumelhart, Hinton, and Williams (1986) introduced backpropagation, which is still popular today. Programmers often use backpropagation to train deep neural networks because it scales well when running on a graphics processing unit. To understand this algorithm for neural networks, we must explore how to train it and how it handles patterns.

The classic backpropagation has been extended and modified, resulting in many different training algorithms. In this chapter, we will discuss the most commonly used training algorithms for neural networks. We start with the classic backpropagation and end this chapter with stochastic gradient descent.

6.1 Understanding the gradient

Backpropagation is a type of gradient descent, and these two terms are often used interchangeably in many textbooks. Gradient descent refers to calculating a gradient on each weight in the neural network for each training element. Since the neural network does not output the expected value of the training element, the gradient of each weight will prompt you how to modify the weight to achieve the desired output. If the neural network does output the expected result, the gradient of each weight will be 0, which means that there is no need to modify the weights.

The gradient is the derivative of the error function at the current value of the weight. The error function is used to measure the gap between the output of the neural network and the expected output. In fact, we can use gradient descent, in which the gradient of each weight can bring the error function to a lower value.

The gradient is essentially the partial derivative of the error function with respect to each weight in the neural network. Each weight has a gradient, which is the slope of the error function. The weight is the connection between two neurons. Calculating the gradient of the error function can determine whether the training algorithm should increase or decrease the weight. In turn, this determination will reduce the error of the neural network. Error is the difference between the expected output of the neural network and the actual output. Many different training algorithms called "propagation training algorithms" make use of gradients. In general, the gradient tells the neural network the following information:

  • Zero gradient-weights will not cause errors in the neural network;
  • Negative gradient-the weight should be increased to reduce the error;
  • Positive gradient-the weight should be reduced to reduce the error.

Since many algorithms rely on gradient calculations, we start by analyzing this process.

 6.1.1 What is a gradient

First, let's explore the gradient. Essentially, training is a search for the weight set, which will make the neural network have the smallest error for the training set. If we have unlimited computing resources, we only need to try various possible weight combinations to determine the weight that provides the least error during training.

Because we do not have unlimited computing resources, we must use some kind of shortcut to avoid the need to check every possible combination of weights. These training algorithms use clever techniques to avoid brute force searches for all weights. But this type of exhaustive search will be impossible, because even small networks have an unlimited number of weight combinations.

Consider an image that shows the neural network error for each possible weight. Figure 6-1 shows the error of a single weight.

 

Figure 6-1 The error of a single weight

It is easy to see from Figure 6-1 that the best weight is curved

 

(

 

) The position with the lowest value. The problem is that we only see the error of the current weight; we cannot see the entire image because the process requires an exhaustive search. However, we can determine the slope of the error curve for a specific weight. In Figure 6-1, we see that the error curve is

 

= Slope at 1.5. Tangent to the error curve (at

 

=1.5) the straight line gives the slope. In this example, the slope or gradient is −0.562 2. A negative slope means that increasing the weight will reduce the error.

Gradient refers to the instantaneous slope of the error function under a certain weight. The derivative of the error curve at this point gives the gradient. The slope of this line tells us how steep the error function is under a given weight.

Derivative is one of the most basic concepts in calculus. For this book, you only need to understand that the derivative provides the slope of the function at a specific point. The training technique and the slope can provide you with information that can be used to adjust the weights to reduce errors. Now, using the practical definition of gradient, we will show how to calculate it.

 6.1.2 Calculating the gradient

We will calculate a gradient for each weight separately. We focus not only on equations, but also on the application of gradients in actual neural networks with real values. Figure 6-2 shows the neural network we will use-XOR neural network.

 

Figure 6-2 XOR neural network

In addition, the same neural network is used in several examples in the online resources of this book (see introduction). In this chapter, we will show some calculations to illustrate the training of neural networks. We must use the same starting weights to keep these calculations consistent. However, the above weights have no characteristics and are randomly generated by the program.

The neural network mentioned above is a typical three-layer feedforward neural network. As we studied before, the circles represent neurons, the lines connecting the circles represent weights, and the rectangles in the middle of the connecting lines give the weight of each connection.

The problem we are facing now is to calculate the partial derivative of each weight in the neural network. When an equation has multiple variables, we use partial derivatives. Each weight is treated as a variable because these weights will change independently as the neural network changes. The partial derivative of each weight only shows the independent influence of each weight on the error function. The partial derivative is the gradient.

The chain rule of calculus can be used to calculate each partial derivative. We start with a training set element. For Figure 6-2, we provide [1,0] as input and expect the output to be 1. You can see that we applied the input to Figure 6-2. The input of the first input neuron is 1.0, and the input of the second input neuron is 0.0.

This input is fed through a neural network and ultimately produces an output. Chapter 4 "Feedforward Neural Networks" introduces the exact process of calculating the output and the sum. Backpropagation has both forward and reverse directions. When calculating the output of the neural network, forward propagation occurs. We only calculate the gradient for this data item in the training set, and other data items in the training set will have different gradients. In the following, we will discuss how to combine the gradient of each training set element.

Now we are ready to calculate the gradient. The following summarizes the steps to calculate the gradient of each weight:

  • Calculate the error according to the ideal value of the training set;
  • Calculate the increment of the output node (neuron);
  • Calculate the increment of internal neuron nodes;
  • Calculate a single gradient.

We will discuss these steps in the following content.

6.2 Calculate output node increment

Calculate a constant value for each node (neuron) in the neural network. We will start with the output node and then gradually backpropagate through the neural network. The term "backpropagation" comes from this process. We initially calculate the errors of the output neurons, and then propagate these errors backwards through the neural network.

The node increment is the value we will calculate for each node. The layer increment also describes this value, because we can calculate the increment of one layer at a time. When calculating output nodes or internal nodes, the method of determining the node increment may be different. First calculate the output node and consider the error function of the neural network. In this book, we will study the quadratic error function and the cross-entropy error function.

 6.2.1 Quadratic error function

Neural network programmers often use quadratic error functions. In fact, you can find many examples of using quadratic error functions on the Internet. If you are reading a sample program, but did not mention the specific error function, then the program may use a quadratic error function, also known as the MSE function, which we discussed in Chapter 5 "Training and Evaluation". Equation 6-1 shows the MSE function:

 

(6-1)

Equation 6-1 will be the actual output of the neural network (

 

) And expected output (

 

) Were compared. variable

 

Multiply the number of training elements by the number of output neurons. The case where MSE processes multiple output neurons into a single output neuron. Equation 6-2 shows the nodal increment using the quadratic error function:

 

(6-2)

The quadratic error function is very simple, because it takes the difference between the expected output of the neural network and the actual output.

 

′ Represents the derivative of the activation function.

 6.2.2 Cross entropy error function

The quadratic error function can sometimes take a long time to adjust the weights correctly. Equation 6-3 shows the cross-entropy error (Cross-entropy Error, CE) function:

 

(6-3)

As shown in Equation 6-4, the calculation of node increments using the cross-entropy error function is much simpler than using the MSE function.

 

(6-4)

The cross-entropy error function will usually give better results than the quadratic error function because the quadratic error function creates a steep gradient for the error. We recommend using the cross-entropy error function.

6.3 Calculate the remaining node increment

Now that the increment of the output node has been calculated according to the appropriate error function, we can calculate the increment of the internal node, as shown in Equation 6-5:

 

(6-5)

We will calculate node increments for all hidden and unbiased neurons, but there is no need to calculate node increments for input and bias neurons. Even though we can easily calculate the node increments of input and bias neurons using Equation 6-5, these values ​​are not required for gradient calculation. You will soon see that the gradient calculation of the weights only considers the neurons connected to the weights. Bias and input neurons are only the starting point of the connection, they are never the end.

If you want to see the gradient calculation process, there are several JavaScript examples showing these calculation processes. These examples can be found at the following URL:

http://www.heatonresearch.com/aifh/vol3/

6.4 The derivative of the activation function

The backpropagation process requires derivatives of the activation function, and they usually determine how the backpropagation process will be performed. Most modern deep neural networks use linear, Softmax, and ReLU activation functions. We will also explore the derivative of the S-type and hyperbolic tangent activation function to understand why the ReLU activation function performs so well.

 6.4.1 Derivative of linear activation function

A linear activation function is considered not an activation function because it just returns any value given. Therefore, the linear activation function is sometimes called the uniform activation function. The derivative of the activation function is 1, as shown in Equation 6-6:

 

(6-6)

As mentioned earlier, the Greek alphabet

 

Represents the activation function, in

 

The apostrophe on the upper right indicates that we are using the derivative of the activation function. This is one of several mathematical representations of derivatives.

 6.4.2 Derivative of the Softmax activation function

In this book, the Softmax activation function and linear activation function are only used on the output layer of the neural network. As mentioned in Chapter 1, "Neural Network Basics", the difference between the Softmax activation function and other activation functions is that its value also depends on other output neurons, not just the output neuron currently being calculated. For convenience, formula 6-7 shows the Softmax activation function again:

 

(6-7)

 

The vector represents the output of all output neurons. Equation 6-8 shows the derivative of the activation function:

 

(6-8)

For the above derivative, we used a slightly different notation. The ratio with the cursive symbol represents the partial derivative, which is used when you differentiate an equation with multiple variables. To take a partial derivative, you can differentiate the equation with respect to one variable while keeping all other variables constant. The upper part indicates the function to be differentiated. In this example, the function to be differentiated is the activation function

 

. The ∂ at the bottom represents the respective differentiation of partial derivatives. In this example, we are calculating the output of the neuron, and all other variables are treated as constants. Differentiation is the instantaneous rate of change: only one variable can change at a time.

If the cross-entropy error function is used, the derivative of the linear or Softmax activation function will not be used to calculate the gradient of the neural network. Usually you only use linear and Softmax activation functions in the output layer of the neural network. Therefore, we don't need to worry about their derivatives with respect to internal nodes. For output nodes that use the cross-entropy error function, the derivative of the linear and Softmax activation functions is always 1. Therefore, you will rarely use linear or Softmax activation function derivatives for internal nodes.

 6.4.3 Derivative of S-type activation function

Equation 6-9 shows the derivative of the S-type activation function:

 

(6-9)

Machine learning often uses the sigmoid activation function expressed in Equations 6-9. We derive this formula by performing algebraic operations on the derivative of the sigmoid function, so as to use the sigmoid activation function in its own derivative calculation. In order to improve computational efficiency, the Greek letters in the above activation function

 

Represents the sigmoid activation function. In the feedforward process, we calculated the value of the sigmoid activation function. Keeping the value of the sigmoid activation function makes the derivative of the sigmoid activation function easier to calculate. If you are interested in how to get formula 6-9, you can refer to the following website:

http://www.heatonresearch.com/aifh/vol3/deriv_sigmoid.html

 6.4.4 The derivative of the hyperbolic tangent activation function

Equation 6-10 gives the derivative of the hyperbolic tangent activation function:

 

(6-10)

In this book, we recommend using the hyperbolic tangent activation function instead of the sigmoid activation function.

 6.4.5 Derivative of ReLU activation function

Equation 6-11 shows the derivative of the ReLU activation function:

 

(6-11)

Strictly speaking, the ReLU activation function has no derivative at 0, but, due to convention, when

 

When it is 0, the gradient at 0 is replaced. Deep neural networks with sigmoid and hyperbolic tangent activation functions may be difficult to train through backpropagation. There are many factors that cause this difficulty, and the vanishing gradient problem is one of the most common reasons. Figure 6-3 shows the hyperbolic tangent activation function and its gradient/derivative.

 

Figure 6-3 Hyperbolic tangent activation function and its gradient/derivative

Figure 6-3 shows that when the hyperbolic tangent activation function (solid line) approaches −1 and 1, the derivative of the hyperbolic tangent activation (dotted line) disappears to 0. Both the S-type and hyperbolic tangent activation functions have this problem, but the ReLU activation function does not. Figure 6-4 shows the sigmoid activation function and its disappearance derivative.

 

Figure 6-4 S-type activation function and its disappearance derivative

6.5 Apply backpropagation

Backpropagation is a simple training algorithm that can use the calculated gradient to adjust the weight of the neural network. This method is a form of gradient descent because we reduce the gradient to a lower value. As the program adjusts these weights, the neural network will produce a more ideal output. The overall error of the neural network should decrease with training. Before discussing the update process of backpropagation weights, we must first discuss two different ways of updating weights.

 6.5.1 Batch training and online training

We have shown how to calculate the gradient for a single training set element. Earlier in this chapter, we calculated the gradient of the neural network input [1,0] and expected output 1. For a single training set element, this result is acceptable, but most training sets have many elements. Therefore, we can process multiple training set elements in two ways, namely online training and batch training.

Online training means that you need to modify the weights after each training set element. Using the gradients obtained in the first training set element, you can calculate the weights and change them. When training progresses to the next training set element, the neural network will also be calculated and updated. Training will continue until you have used up each training set element. At this point, an iteration or epoch of training has been completed.

Batch training also uses all training set elements, but we are not in a hurry to update all weights. Instead, we sum the gradient of each training set element. Once we have completed the summation of the gradients of the training set elements, we can update the neural network weights. At this point, the iteration is complete.

Sometimes, we can set the batch size. If your training set may have 10,000 elements, you can choose to update the weight of the neural network every 1,000 elements, so that the weight of the neural network is updated 10 times during the training iteration.

Online training is the original way of backpropagation. If you want to view the calculation of the batch version of the program, please refer to the following online example:

http://www.heatonresearch.com/aifh/vol3/xor_batch.html

 6.5.2 Stochastic gradient descent

Batch training and online training are not the only options for backpropagation. Stochastic Gradient Descent (SGD) is the most popular algorithm among backpropagation algorithms. SGD can work in batch or online training mode. Online SGD simply randomly selects training set elements, then calculates gradients and performs weight updates. This process continues until the error reaches an acceptable level. Compared with traversing the entire training set in each iteration, selecting random training set elements usually converges to acceptable weights faster.

Batch SGD can be achieved by selecting the batch size. For each iteration, the number of training set elements that should not exceed the selected batch size is randomly selected, so a small batch is selected. When updating, the gradients in the mini-batch processing are added up like regular back-propagation batch processing updates. This update is very similar to the regular batch processing update, the difference is that each time a batch is required, a small batch is randomly selected. Iterations usually process a single batch in SGD. The batch size is usually much smaller than the entire training set. A common choice for batch size is 600.

 6.5.3 Backpropagation weight update

Now, we are ready to update the weights. As mentioned earlier, we treat weights and gradients as one-dimensional arrays. Given these two arrays, we are ready to calculate weight updates for the iterations of backpropagation training. Equation 6-12 gives the formula for updating the weights for backpropagation:

 

(6-12)

Formula 6-12 calculates the weight change of each element in the weight array. You will also notice that Equation 6-12 requires changes to the weights from the previous iteration. You must store these values ​​in another array. As mentioned earlier, the direction of the weight update is opposite to the sign of the gradient: a positive gradient will cause the weight to decrease, and a negative gradient will cause the weight to increase. Because of this inverse relationship, Equations 6-12 start with a negative sign.

Equation 6-12 calculates the weight increment as the product of the gradient and the learning rate ( denoted by ε ). In addition, we add the product of the previous weight change and the momentum value ( denoted by α ). Learning rate and momentum are the two parameters that we must provide to the backpropagation algorithm. Choosing the values ​​of learning rate and momentum is very important for training performance. Unfortunately, determining the learning rate and momentum is mainly achieved through trial and error.

The learning rate scales the gradient, which may slow down or speed up the learning speed. A learning rate lower than 1.0 will slow down the learning speed. For example, a learning rate of 0.5 will reduce each gradient by 50%; a learning rate higher than 1.0 will speed up training. In fact, the learning rate is almost always lower than 1.

Choosing a learning rate that is too high will cause your neural network to fail to converge, and will produce a higher global error, instead of converging to a lower value. Choosing a learning rate that is too low will cause the neural network to spend a lot of time to achieve convergence.

Like the learning rate, momentum is also a scaling factor. Although optional, momentum determines what percentage of the weight change from the previous iteration should be applied to this iteration. If you don't want to use momentum, just specify its value as 0.

Momentum is a technique used for backpropagation to help train to avoid local minimums, which are the values ​​identified by the low points on the error map, rather than the true global minimums. Backpropagation tends to find a local minimum and cannot jump out again. This process leads to a higher training convergence error, which is not what we expected. Momentum can apply some force to the neural network in the direction of its current change, allowing it to break through the local minimum.

 6.5.4 Choosing the learning rate and momentum

Momentum and learning rate contribute to the success of training, but in fact they are not part of the neural network. Once the training is completed, the weights after training will remain unchanged, and momentum or learning rate will no longer be used. They are essentially a kind of temporary "scaffolding" used to create trained neural networks. Choosing the right momentum and learning rate will affect the effectiveness of training.

The learning rate will affect the speed of neural network training, and reducing the learning rate will make the training more detailed. A higher learning rate may skip the optimal weight setting, a lower learning rate will always produce better results, but reducing the training speed will greatly increase the running time. Reducing the learning rate in neural network training may be an effective technique.

You can use momentum to fight against local minima. If you find that the neural network is stagnating, the higher momentum value may cause the training to exceed the local minimum it encounters. After all, choosing good values ​​for momentum and learning rate is a process of trial and error. You can make adjustments according to the progress of the training. Momentum is usually set to 0.9, and the learning rate is usually set to 0.1 or lower.

 6.5.5 Nesterov Momentum

Due to the randomness introduced by small batches, the SGD algorithm may sometimes produce wrong results. The weights may be updated very usefully in one iteration, but improper selection of training elements will cause it to be revoked in the next mini-batch. Therefore, momentum is a resource-rich tool that can alleviate this unstable training result.

Nesterov Momentum is a relatively new technical application invented by Yu Nesterov in 1983, which was updated in his book Introductory Lectures on Convex Optimization: A Basic Course [1]. This technique is sometimes referred to as Nesterov's accelerated gradient descent. Although a complete mathematical explanation of Nesterov's momentum is beyond the scope of this book, we will cover weights in detail so that you can implement it. The examples in this book (including online JavaScript examples) contain the implementation of Nesterov's momentum. In addition, the online resources of this book include some JavaScript sample programs for Nesterov's momentum weight update.

Formula 6-13 calculates partial weight update based on learning rate ( ε ) and momentum ( α ):

 

(6-13)

For current iteration

 

Indicates that the previous iteration used

 

−1 means. This partial weight update is called

 

, Starting from 0 initially. The subsequent calculation of the partial weight update is based on the previous value of the partial weight update. The partial derivative in Equation 6-13 is the gradient of the error function under the current weight. Equation 6-14 shows the Nesterov momentum update, which replaces the standard backpropagation weight update shown in Equation 6-12:

 

(6-14)

The calculation of the weight update above is an amplification of part of the weight update. Equation 6-14 shows that the incremental weight has been added to the current weight. SGD with Nesterov momentum is one of the most effective training algorithms for deep learning.

6.6 Summary of this chapter

This chapter introduces classic backpropagation and SGD. These methods are based on gradient descent. In other words, they optimize individual weights with derivatives. For a given weight, the derivative provides the slope of the error function to the program. The slope allows the program to determine how to update the weights. Each training algorithm interprets the slope or gradient differently.

Although backpropagation is one of the oldest training algorithms, it is still one of the most popular algorithms. Backpropagation is to add gradients to the weights, negative gradients will increase the weights, and positive gradients will decrease the weights. We use the learning rate to scale the weights to prevent the weights from changing too quickly. A learning rate of 0.5 means a gradient that increases the weight by half, and a learning rate of 2.0 means a gradient that increases the weight by a factor of 2.

There are many variants of the backpropagation algorithm, some of which (such as elastic propagation) are popular. Chapter 7 will introduce some variants of backpropagation. Although it is useful to understand these variants, SGD is still one of the most common deep learning training algorithms.

This article is excerpted from "Artificial Intelligence Algorithms (Volume 3): Deep Learning and Neural Networks"

 

This book is Volume 3 in a series of books introducing AI. AI is a field with a wide range of research covering many sub-disciplines. For readers who have not read Volume 1 or Volume 2 of this series, the introduction of this book will provide some background information. Readers do not need to read Volume 1 or Volume 2 before reading this book. The information available from Volume 1 and Volume 2 is described below.

Since the early stages of artificial intelligence, neural networks have played a vital role. Now, exciting new technologies, such as deep learning and convolution, are taking neural networks in a whole new direction. This book combines neural network applications in various real-world tasks, such as image recognition and data science, and introduces current neural network technologies, including ReLU activation, stochastic gradient descent, cross entropy, regularization, dropout, and visualization.

The target audience of this book is those who are interested in artificial intelligence, but suffer from no good mathematical foundation. Readers only need to have a basic understanding of university algebra courses. This book provides readers with supporting sample program codes, and currently there are Java, C# and Python versions.

Guess you like

Origin blog.csdn.net/epubit17/article/details/115185415