Deep Learning Paper Notes (4) Derivation and Implementation of CNN Convolutional Neural Network

https://blog.csdn.net/zouxy09/article/details/9993371

   I usually read some papers, but I always feel that after reading it, I will slowly forget it, and when I pick it up again one day, it seems that I have not read it. Therefore, I want to habitually summarize and sort out the knowledge points in some papers that I feel useful. On the one hand, during the sorting process, my understanding will be deeper, and on the other hand, it will be convenient for my own investigation in the future. Even better, you can put it on the blog to communicate with you. Because of the limited foundation, some understandings of the paper may not be correct. I hope you will correct me and communicate, thank you.

 

       The paper for this article is from:

Notes on Convolutional Neural Networks, Jake Bouvrie。

         This is mainly some notes on the derivation and implementation of CNN. Before reading this note, it is best to have some foundations of CNN. Here is also a list of information for reference:

[1]  Deep Learning (deep learning) study notes finishing series (seven)

[2] LeNet-5, convolutional neural networks

[3] Convolutional Neural Network

[4] Neural Network for Recognition of Handwritten Digits

[5]  Deep learning: thirty-eight (a brief introduction to Stacked CNN)

[6] Gradient-based learning applied to document recognition.

[7]Imagenet classification with deep convolutional neural networks.

[8] " Convolutional Feature Extraction " and " Pooling " in UFLDL .

        In addition, there is a toolbox for Deep Learning in matlab , which contains the code of CNN. In the next blog post, I will annotate this code in detail. This note is very important to understand this code.

         The following is my understanding of some of the knowledge points:

 

《Notes on Convolutional Neural Networks》

1. Introduction

         This document discusses the derivation and implementation of CNNs. The CNN architecture has a lot more connections than weights, which actually implies some form of regularization. This particular network assumes that we want to learn some filters in a data-driven way as a way to extract features from the input.

         In this paper, we first describe the classical BP algorithm for training the fully connected network, and then derive the BP weight update method for the convolutional layer and subsampling layer of the 2D CNN network. In the derivation process, we put more emphasis on the efficiency of implementation, so some Matlab code will be given. Finally, we turn to discuss how to automatically learn to combine feature maps from previous layers, in particular, we also learn sparse combinations of feature maps.

 

Second, the fully connected back-propagation algorithm

         In a typical CNN, the first few layers are alternating convolution and downsampling, and then the last layers (closer to the output layer) are fully connected one-dimensional networks. At this point we have transformed all 2D 2D feature maps into fully connected 1D network inputs. This way, when you are ready to feed the final 2D feature maps into the 1D network, a very convenient way is to concatenate all the output feature maps into one long input vector. Then we return to the discussion of the BP algorithm. (For a more detailed basic derivation, please refer to the " Back Conduction Algorithm " in UFLDL).

2.1, Feedforward Pass forward propagation

         In the derivation below, we take the squared error cost function. We are discussing a multi-class problem with a total of c classes and a total of N training samples.

         This represents the kth dimension of the label corresponding to the nth sample. Represents the kth output of the network output corresponding to the nth sample. For multi-class problems, the output is generally organized in the form of "one-of-c", that is, only the output node of the class corresponding to the input has a positive output, and the bits or nodes of other classes are 0 or negative, depending on your output. layer activation function. sigmoid is 0 and tanh is -1.

         Because the error on the entire training set is only the sum of the errors of each training sample, here we first consider the BP for one sample. For the error of the nth sample, it is expressed as:

       In the traditional fully connected neural network, we need to calculate the partial derivative of the cost function E with respect to each weight of the network according to the BP rule. We use l to represent the current layer, then the output of the current layer can be expressed as:

       There are many kinds of output activation function f(.), usually sigmoid function or hyperbolic tangent function. sigmoid compresses the output to [0, 1], so the final output average tends to be 0. So if we normalize our training data to zero mean and 1 variance, we can increase convergence during gradient descent. For normalized datasets, the hyperbolic tangent function is also a good choice.

 

2.2, Backpropagation Pass back propagation

         The back-propagated error can be seen as the base sensitivity sensitivities of each neuron (sensitivity means how much our base b changes, how much the error will change, that is, the rate of change of the error to the base, that is, the derivative) , defined as follows: (The second equal sign is obtained according to the chain rule of derivation)

         Because ∂u/∂b=1, so ∂E/∂b=∂E/∂u=δ, that is to say, the sensitivity of the bias basis ∂E/∂b=δ and the error E to a node all input the derivative of u ∂ E/∂u is equal. This derivative is the magic trick for backpropagating high-level errors to the bottom. Backpropagation uses the following relation: (The following expression expresses the sensitivity of the lth layer, that is)

 Official (1)

         Here "◦" means multiply each element. The sensitivity of neurons in the output layer is different:

         Finally, the weights are updated using the delta (ie δ) rule for each neuron. Specifically, for a given neuron, get its input, and then use the delta (ie δ) of this neuron to scale. Expressed in the form of a vector, for the lth layer, the derivative of the error for each weight of the layer (combined as a matrix) is the input of the layer (equal to the output of the previous layer) and the sensitivity of the layer (each of the layer The neuron's deltas are combined into a cross product of a vector). Then the obtained partial derivative multiplied by a negative learning rate is the update of the weights of the neurons in this layer:

 Official (2)

         The update expressions for bias bases are similar. In fact, for each weight (W) ij there is a specific learning rate η Ij .

 

3. Convolutional Neural Networks Convolutional Neural Networks

3.1, Convolution Layers convolution layer

         We now focus on the BP update of the convolutional layers in the network. In a convolutional layer, the feature maps of the previous layer are convolved by a learnable convolution kernel, and then passed through an activation function to obtain the output feature map. Each output map may be a combination of convolved values ​​from multiple input maps:

       Here M j represents the set of selected input maps, so which input maps are selected? There is a choice of a pair or three. But below we discuss how to automatically select the feature maps to be combined. Each output map will give an additional bias b, but for a specific output map, the convolution kernel that convolves each input map is different. That is to say, if the output feature map j and the output feature map k are both obtained by convolution and summation from the input map i, then the corresponding convolution kernels are different.

3.1.1, Computing the Gradients gradient calculation

         We assume that each convolutional layer l is followed by a downsampling layer l+1. For BP, according to the above, we know that in order to obtain the weight update of the weight corresponding to each neuron in layer l, we need to first obtain the sensitivity δ of each neural node in layer l (that is, the weight value). updated formula (2)). In order to find this sensitivity we need to first sum the sensitivities of the nodes in the next layer (nodes in the l+1th layer connected to the nodes of interest in the current layer l) (to get δ l+1 ), and then multiply these connections by The corresponding weights (the weights connecting the nodes of interest in the lth layer and the nodes in the l+1th layer) W. Then multiply by the derivative value of the activation function f of the input u of the neuron node of the current layer l (that is, the solution of δ l of the formula (1) of the sensitivity backpropagation ), so that each of the current layer l can be obtained. The corresponding sensitivity δ l of the neural node.

      However, because of downsampling, the sensitivity δ corresponding to one pixel (neuron node) of the sampling layer corresponds to a block of pixels (sampling window size) of the output map of the convolutional layer (previous layer). Therefore, each node of a map in layer l is connected to only one node of the corresponding map in layer l+1.

     In order to effectively calculate the sensitivity of layer l, we need to upsample the sensitivity map corresponding to the downsampled downsample layer (each pixel in the feature map corresponds to a sensitivity, so it also forms a map), so that the size of the sensitivity map is related to the convolution The map size of the layer is the same, and then the partial derivative of the activation value of the map of layer l is multiplied element-wise by the sensitivity map obtained from the upsampling of layer l+1 (that is, formula (1)).

        The weights of the downsampling layer map all take the same value β, and it is a constant. So we only need to multiply the result obtained in the previous step by a β to complete the calculation of the sensitivity δ of the lth layer.

       We can repeat the same calculation process for each feature map j in the convolutional layer. But it is obvious that the map of the corresponding sub-sampling layer needs to be matched (refer to formula (1)):

        up(.) represents an upsampling operation. If the downsampling factor is n, it simply copies each pixel n times horizontally and vertically. This will restore the original size. In fact, this function can be implemented using the Kronecker product:

       Well, here, for a given map, we can calculate its sensitivity map. Then we can quickly calculate the gradient of the bias basis by simply summing all nodes in the sensitivity map in layer l:

 Official (3)

       Finally, the gradient of the weight of the convolution kernel can be calculated by the BP algorithm (formula (2)). In addition, the weights of many connections are shared, so for a given weight, we need to take the gradient of all connections with that weight (weight-shared connections) at that point, and then take these gradients Do the summation, just like the gradient calculation above for the bias basis:

       Here, is the patch that is multiplied element by element during convolution, and the value of the (u, v) position of the output convolution map is determined by the patch at the (u, v) position of the previous layer and the convolution kernel k_ij The result of element-wise multiplication.

      At first glance, it seems that we need to take pains to remember which patch of the input map each pixel corresponds to in the output map (and corresponding sensitivity map). But in fact, in Matlab, it can be implemented with one code. For the above formula, it can be implemented with Matlab's convolution function:

       We first rotate the delta sensitivity map so that cross-correlation calculations can be performed instead of convolution (in the mathematical definition of convolution, the feature matrix (convolution kernel) needs to be flipped before being passed to conv2. That is, the rows and columns of the feature matrix are reversed). Then invert the output back, so that when we convolve in the forward pass, the convolution kernel is the direction we want.

 

3.2, Sub-sampling Layers sub-sampling layer

         For the subsampling layer, there are N input maps, there are N output maps, but each output map becomes smaller.

        down(.) represents a downsampling function. A typical operation is to sum all pixels of different nxn blocks of the input image. This way the output image is scaled down by a factor of n in both dimensions. Each output map corresponds to its own multiplicative bias β and an additive bias b.

 

3.2.1. Computing the Gradients

         The hardest part here is computing the sensitivity map. Once we have this, the only bias parameters we need to update, β and b, are a breeze (Equation (3)). If the next convolutional layer is fully connected to this subsampling layer, then the sensitivity maps of the subsampling layer can be calculated by BP.

         We need to calculate the gradient of the convolution kernel, so we have to find which patch in the input map corresponds to which pixel in the output map. Here, it is necessary to find which patch in the sensitivity map of the current layer corresponds to a given pixel of the sensitivity map of the next layer, so that the δ recursion like formula (1) can be used, that is, the sensitivity is back-propagated. In addition, you need to multiply the weight of the connection between the input patch and the output pixel, which is actually the weight of the convolution kernel (rotated).

      Before that, we need to rotate the kernel a bit so that the convolution function can perform the cross-correlation calculation. In addition, we need to deal with the convolution boundary, but in Matlab, it is easier to deal with. Full convolution in Matlab pads missing input pixels with 0.

      At this point, we can calculate the gradients for b and β. First of all, the calculation of the additive base b is the same as that of the convolutional layer above, and all the elements in the sensitivity map can be added up:

       As for the multiplicative bias β, because it involves the calculation of downsampling maps in the forward propagation process, it is best to save these maps in the forward process, so that there is no need to recalculate in the reverse calculation. . We define:

In this way, the gradient with respect to β can be calculated in the following way:

 

3.3, Learning Combinations of Feature Maps

         Most of the time, convolving multiple input maps and then summing these convolution values ​​to get an output map is often better. In some literatures, it is common to manually select which input maps to combine to obtain an output map. But here we try to let CNN learn these combinations during the training process, that is, let the network learn to choose which input maps to calculate and get the output map is the best. We use αij to denote the weight or contribution of the i-th input map in obtaining the j-th output map. In this way, the jth output map can be represented as:

         Constraints need to be met:

These constraints on the variable αij can be enforced by representing the variable αij as a softmax function          of a set of unconstrained implicit weights cij . (Because the dependent variable of softmax is an exponential function of the independent variable, their rate of change will be different).

         Because for a fixed j, each group of weights c ij is independent of the weights of other groups, so for the sake of description, we remove the subscript j and only consider the update of one map, and the updates of other maps are The same process, but the index j of the map is different.

         The derivative of the Softmax function is expressed as:

        Here delta is the Kronecker delta. The derivative of the error with respect to the l-th layer variable αi is:

Finally, the partial derivative of the          cost function with respect to the weight c i can be obtained through the chain rule:

 

3.3.1, Enforcing Sparse Combinations to strengthen the combination of sparsity

         In order to restrict αi to be sparse, that is, to restrict an output map to only connect to some but not all input maps. We add a sparsity constraint term Ω(α) to the overall cost function. For a single sample, the rewrite cost function is:

Then look for the contribution of this regularization constraint to the derivation of the weight ci. The derivative of the regularization term Ω(α) with respect to αi is:

         Then, by the chain rule, the derivation for ci is:

         Therefore, the final gradient of the weight ci is:

 

3.4、Making it Fast with MATLAB

        The training of CNN is mainly on the interaction between the convolution layer and the subsampling layer, and its main computational bottlenecks are:

1) Forward propagation process: downsample the maps of each convolutional layer;

2) Backpropagation process: up-sampling the sensitivity map of the high-level sub-sampling layer to match the size of the output maps of the underlying convolutional layer;

3) The application and derivation of sigmoid.

         For the first and second questions, we consider how to use Matlab's built-in image processing functions to implement up-sampling and down-sampling operations. For upsampling, the imresize function can do it, but it requires a lot of overhead. A faster version is to use the Kronecker product function kron. The effect of upsampling can be achieved by performing a Kronecker product with an all-one matrix one and the matrix we need to upsample. For downsampling in the forward propagation process, imresize does not provide the function of calculating the sum of the pixels in the nxn block in the process of reducing the image, so it is useless. A good and fast approach is to convolve the image with an all-one convolution kernel, and then simply sample the final convolution result by standard indexing. For example, if the downsampled domain is 2x2, then we can convolve the image with a 2x2 kernel with all 1s. Then in the convolved image, we collect data once for every 2 points, y=x(1:2:end,1:2:end), so that we can get double downsampling and perform summation at the same time Effect.

         For the third question, some people actually think that the inline definition of the sigmoid function in Matlab will be faster. In fact, it is not. Matlab is different from C/C++ and other languages. Matlab's inline is more non-time than ordinary function definitions. . So, we can use the real code that computes the sigmoid function and its derivatives directly in the code.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324730224&siteId=291194637