Detailed explanation of back propagation algorithm

Introduction

BP algorithm , called " error back propagation algorithm ", also known as " back propagation algorithm ", is by far the most used multi-layer neural network learning algorithm. A multilayer feedforward neural network trained with the BP algorithm is called a " BP network ".

Algorithm flow

The BP algorithm adopts a strategy based on gradient descent to adjust the parameters in the direction of the negative gradient of the target, and its goal is to minimize the training error . For each training example, the algorithm does the following:

  1. The input example is first provided to the input layer neurons, and then the signal is forwarded layer by layer until the output layer produces a result;
  2. Then calculate the error of the output layer, and then propagate the error back to the hidden layer neurons;
  3. Finally, the connection weight and threshold are adjusted according to the error of the hidden layer neurons.

This iterative process loops until a certain stopping condition is reached, such as the training error has reached a small value. The pseudocode in the watermelon book is as follows (P104):

Cumulative BP Algorithm

What we introduced above is the " standard BP algorithm ", which only updates the connection weights and thresholds for one training example at a time, that is, the update rule of the algorithm is derived based on a single E_k , if it is derived similarly based on the cumulative error The minimum update principle is the cumulative BP algorithm . The cumulative BP algorithm only updates the parameters after reading the entire training set once, and the frequency of parameter update is much lower than that of the standard BP algorithm. But in many tasks, after the cumulative error drops to a certain level, the further drop will be very slow, which is more suitable for the standard BP algorithm.

handle overfitting

Due to the powerful representation ability of the BP algorithm, the BP network often encounters the phenomenon of overfitting, which can be alleviated by two strategies:

  • One is " early stop ", which means that the data is divided into a training set and a validation set. If the error of the training set decreases but the error of the validation set increases, the training will be stopped;
  • One is " regularization ", the basic idea is to add a regularization term describing the complexity of the network to the error objective function.

local optimum

Since the BP algorithm adopts a method based on gradient descent, it may fall into the local optimum. In practical tasks, the following strategies are often used to try to "jump out" of the local optimum, so as to approach the global optimum solution:

  • Train with multiple sets of different initialization values, and take the solution with the smallest error as the final solution;
  • Use a "simulated annealing" technique , that is, at each step, accept a situation worse than the current solution with a certain probability;
  • Using stochastic gradient descent , unlike standard gradient descent, stochastic gradient descent adds random factors to the gradient calculation.

For the content of this section, see Zhou Zhihua's "Machine Learning" P101-P107

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324592615&siteId=291194637