Implementation of caffe's backpropagation

1. The realization of
backpropagation The theory of backpropagation algorithm is
taken from "http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm"

Suppose we have a fixed training set with m training samples. We can use "batch gradient descent" to train our neural network. In detail, for a single training sample (x,y), we define its corresponding loss function as follows:

This is the "(one-half) squared-error "loss function. Given a training set with m samples, we define the overall loss function as:

The first term in the definition of J(W,b) is the average of the operation results of the loss function for all single samples. The second term is a regular term (also called "weight decay term"), which helps reduce the magnitude of the weight and helps prevent overfitting.

Explanation: "weight decay" is not like what we defined for J(W,b), it is usually not applied to the bias term. Applying "weight decay" to the bias unit usually only makes a small difference to the final network. If you use Stanford’s CS229 (Machine Learning) or watch the course video on YouTube, you can also realize that this “weight decay” is essentially a variant of Bayesian regularization, on top of it We place a Gaussian prior and do a MAP (instead of maximum likelihood) estimation.

The parameter λ of "weight decay" controls the relative importance of these two items. Explanation: J(W,b;x,y) is the squared error loss of a single sample; J(W,b) is the entire error loss function including the "weight decay" term.

The above loss function is often used for classification problems and regression problems. For the classification problem, we use y=0 or 1 to identify the two type markers (by calling back the activation function sigmoid to output the value of [0,1]; if we use a tanh activation function, we will use -1 and +1 to Identification mark). For regression problems, we first scale our outputs to ensure that they are in the interval [0,1] (or if we use the tanh activation function, they are in the interval [− 1,1]).

Our goal is to minimize the J(W,b) function. To train our neural network, we will use a small random value close to 0 to initialize each and each (say according to a Normal(0,ε2) distribution for some small ε, say 0.01), and then apply Optimization algorithm (for example, batch gradient descent). Since J(W,b) is a non-convex function, gradient descent can easily lead to local optimization. However, in practical applications, the gradient descent method works quite well. Finally, it is important to note that random initialization parameters are very important, and cannot all be set to 0. If all parameters are started with the same value, all hidden units will end with the same function learned. Random initialization will break this symmetry.

One iteration of gradient descent updates the parameters W and b as follows:

Here, α is the learning rate. The key step of the above formula is to calculate the partial derivative. We describe the backpropagation algorithm, which gives a very effective method to calculate these partial derivatives.

We first describe how backpropagation is used to calculate the sum . The partial derivative of the loss function J(W,b;x,y) is defined corresponding to a single sample (x,y). Once we can calculate the partial derivatives of these J(W,b;x,y), we will see that the partial derivative of the overall loss function J(W,b) can be calculated by the following formula:

The above two formulas are slightly different, because "weight decay" only uses W and not b.

The following is the intuition of the backpropagation algorithm. Given a training sample (x, y), we first run a "forward pass" to calculate all activations that pass through the network, including the output value assumed to be hW,b(x). Then, for each node i in layer l, we have to calculate an "error term", which measures how much responsibility this node should bear for any errors in our output results. For an output node, we directly measure the difference between the activation of the network and the actual target value and use it to define it (the layer here is the output layer). What about the hidden unit? For those hidden layer units, we calculate based on the weighted average of the "error term" using the nodes as input. Detailed description, this is the back propagation algorithm:

Perform a feedforward pass, computing the activations for layers L2, L3, and so on up to the output layer.
For each output unit i in layer nl (the output layer), set

For each node i in layer l, set

Compute the desired partial derivatives, which are given as:

Take the fully connected layer as an example to explain the implementation of the back propagation algorithm.

The following is the implementation of back propagation of caffe's InnerProduct layer and convolution layer

Implementation of Caffe's InnerProduct layer backpropagation
We know that the blobs_[0] and blobs_[1] of this layer store weight and bias, respectively. If bias is set, bias_multiplier_ is all 1.

The formula for forward propagation is as follows:

  1. The reverse operation is mainly to complete the calculation of the derivative of weight and bias, and calculate bottom_diff.

Calculated as follows:

  1. Calculate weight_diff and bias_diff

It should correspond to the following formula:

Based on the weight_diff calculated above, we update the bottom_diff of the innerProduct layer

Corresponds to the following formula:

Update weight_data and biase_data. After completing forward and backward propagation, the ApplyUpdate operation of SGDSolver is completed.

Based on the iter_size of the solver, normalize the learned diff. If it is 1, no processing is performed, and the default is 1.
Based on the weight_decay configured in solve.prototxt and train.prototxt corresponding to the params_weight_decay configured in the corresponding parameters of the layer (corresponding to the decay_mult parameter in the param of the layer, weight and bias in order) and the regularization_type configured in the solve.prototxt (default Yes L2) Perform regularization processing. The formula is as follows:

Among them, diff identifies the derivative of weight or bias, and data identifies the current data of weight or bias.

Based on the learning rate rate configured in solve.prototxt and the params_lr configured in the corresponding parameters of train.prototxt corresponding to the layer (corresponding to the lr_mult parameter in the param of the layer, weight and bias in order), and the configuration in solve.prototxt The momentum parameter, update the derivative value. The formula is as follows:

The formula corresponding to SGD is:

Call SGDSolver ->net_->Update() to update the blobs of weight and bias, and finally call the update function of the blob. The formula is as follows:

The formula corresponding to SGD is:

End.

Guess you like

Origin blog.csdn.net/sunlin972913894/article/details/106273479