Convolutional Neural Networks

Convolutional Neural Networks/CNN/ConvNets

Convolutional neural networks are very similar to ordinary neural networks: the neurons that make up them have learnable weights and biases. Each neuron takes some input, performs a dot product, and possibly a nonlinear function to get the output of that neuron. The entire network can still be represented as a differentiable scoring function. This function takes the pixels of the image on one end and gets a score for a class on the other. At the same time convolutional neural networks still have a loss function on the latter layer (fully-connected) - such as SVM/Softmax - and all the tricks we developed for regular neural networks still have this network.

So, how are convolutional neural networks different? The structure of ConvNet explicitly assumes that the input is an image, which is our ability to integrate certain properties into the network architecture. This enables the forward propagation function to be more efficient and greatly reduces the number of parameters in the network.

Architecture Overview

Review: A regular neural network takes an input (a vector) and passes that vector through a series of hidden layers. Each hidden layer contains a series of neurons that are fully connected to the neurons of the previous layer, and even in the same layer, each neuron is completely independent of each other and does not share any connections.

Conventional neural networks do not scale well on full large graphs. Taking CIFAR-10 as an example, the image size is just \(32\times32\times3\) (32 wide, 32 high and 3 color channels), so in the first hidden layer of the neural network, each neuron has \(32 *32*3=3072\) weight parameter, this number seems to be manageable, but obviously, the fully connected network cannot be extended to large images. For example, if an image of considerable size \(200\times200\times3\) results in 12,000 weight parameters per neuron, and we need many such neurons, the number of parameters will be It's increasing very fast! Obviously, it is very wasteful to perform full joins in this way, and a large number of parameters will cause overfitting.

3D volumes of neurons. Convolutional neural networks take full advantage of the fact that the input is an image and constrain the network structure in a reasonable way. Unlike conventional neural networks, the neurons of a convolutional neural network are arranged in 3 dimensions: width, height, and depth. For example, the input image in CIFAR-10 is the active input volume, and the dimension of this volume is \(32\times32\times3\) . The neurons in a layer connect only a small area of ​​the neurons in the previous layer instead of using a fully connected network approach. The final CIFAR-10 dimension is \(1\times1\times10\) . Below is the visualization:


Above: A conventional 3-layer neural network.
Bottom: ConvNet arranges its neurons in 3 dimensions (width, height, depth), as shown in one of the layers. Each layer of the ConvNet transforms a 3D input volume into 3D neuron excitations. In the example on the right, the red input layer represents the input image, so the width and height of this layer should be equal to the dimension of the image, and the depth should be 3 (RGB channels)

A convolutional neural network consists of many layers, each of which contains a simple API: use a differentiable function that can have parameters, convert an input 3D volume to another 3D volume output.

Layers used to build ConvNets

As mentioned above, a simple convolutional neural network is a series of layers, each layer transforms one 3D volume into another 3D volume through a differentiable function. We use 3 main layers to build our convolutional neural network architecture: Convolutional Layer , Pooling Layer , and Full-Connected Layer . We build our ConvNet architecture by stacking these network layers.

for example

A simple convolutional neural network for the CIFAR-10 classification task has this structure
\[\text{INPUT}\to\text{CONV}\to\text{RELU}\to\text{POOL}\to\text{FC} \]

  • INPUT [ \(32\times32\times3\) ]: The input layer will record the pixel value of the input image, in this scene width 32, height 32, and there are 3 color channels
  • The CONV layer calculates the output of neurons connected to the local area, and performs a dot product operation on the local area connected to the weight and the input volume to obtain the calculation result of each neuron. If we use 12 filters, then each filter is \(1*1\) , then we can get the output volume of [ \(32\times32\times12\) ]
  • The RELU layer will apply an element activation function, such as \(max(0,x)\) , after this layer, the obtained volume dimension remains unchanged, still [ \(32\times32\times12\) ]
  • The POOL layer performs downsampling on the spatial dimensions (width, height). The resulting volume is [ \(16\times16\times12\) ]
  • The FC (full connection) layer will calculate the score of the category, and the resulting volume is [ \(1\times1\times10\) ]. As the name implies, every neuron in a fully connected layer will be connected to all neurons in the previous layer, just like a normal neural network.

In this way, ConvNets convert the raw pixel values ​​of the original image layer by layer into class scores. Note that some layers contain parameters, while others do not. The parameters of the convolutional and fully connected layers will be trained by gradient descent to make the class labels computed by ConvNet consistent with the training set.


Now we will describe the relevant details of each individual layer such as hyperparameters, connectivity

Convolutional Layer

The convolutional layer is the core module of the convolutional network construction and does most of the computationally heavy work.
The convolutional layer contains some learnable filters, each of which is small in space (width and height), but is consistent with the input volume in the depth direction, such as the classic filter of the first convolutional layer of ConvNet The size is \(5*5*3\)

In the forward pass stage, we slide (convolve, to be precise) each filter along the width and height of the input volume, and compute the dot product of the entire filter and the input at any position. When we slide the filter over the width and height of the input volume, we generate a 2-dimensional activation map that indicates the response of the filter in space. Intuitively, the network will learn filters to activate when it sees a certain type of visual feature, such as an edge in a certain direction or a blob of a certain color on the first layer, or eventually on higher layers of the entire network the entire honeycomb or wheel-like pattern. In the CIFAR-10 example, we have a full set of filters for each CONV layer, and they all generate a single 2D activation map. We will stack these activation maps along the depth dimension to produce the output Volume

Local Connectivity. We saw when solving high-dimensional inputs such as images that it is not practical to connect neurons all to previous layers. Instead, we make each neuron connect to only a local region of the input volume. The spatial extent of this local connection is a hyperparameter of the neuron, which we call the receptive field . The connected depth along the depth axis is equal to the depth of the input volume. It's worth emphasizing here again that we treat spatial dimensions (width and height) and depth dimensions (depth) differently: "connections in spatial dimensions are local, but always along the entire depth of the input Volume" .


Above: red is the input volume, blue is an example of the first convolutional layer. Each neuron of a convolutional layer is spatially connected only to local regions of the input volume, but to the full depth (i.e. all color channels).
Bottom: The operation of the neuron remains the same. They still compute the dot product between the weights and the input before using the nonlinear activation function \(f(x)\) , but their connectivity is now limited to the local space.

Spatial arrangement, above we discussed the connectivity of each neuron in the convolutional layer to the input volume. But we haven't discussed how many neurons the output volume has and how they are arranged. There are three hyperparameters that control the size , depth , stride and zero-padding of the output volume .

  1. First, the depth of the output volume is a hyperparameter: it is the same as the number of filters we use. We take a group of neurons that observe the same input region as the output Volume depth column depth column
  2. Then, another hyperparameter refers to the stride of our sliding filter . For example, if stride is 1, then we move the filter one pixel at a time, if stride is 2, and so on. Hyperparameters with stride greater than 3 are generally less used in practical applications. Using a larger stride results in a smaller output volume.
  3. Sometimes it is convenient to pad with 0s around the input Volume. The number of zero-padding is also a hyperparameter, zero-padding has a point, that is, it allows us to control the space size of the output Volume

Assuming that the receptive field size of the convolutional layer is \(F\) , the sliding stride of the filter is \(S\) , the number of borders using zero-padding is \(P\) , and the size of the given input volume is \ (W\) We can calculate the size of the output volume.
\[(W-F+2P)/S+1\]
For example, use \(3\times3\) filter for \(7\times7\) input and "zero-padding" of number 1, in stride When it is 1 and 2, the output of \(5\times5\) and \(3\times3\) can be obtained , as shown below

The weight values ​​shared between neurons are [1, 0, -1], and the bias is 0.

Use zero-padding zero-padding. In the example on the left of the above figure, we see that the dimension of the input is equal to 5 and the dimension of the output is also equal to 5. If "zero-padding" is not used. The spatial dimension of the output volume is only 3. Under normal circumstances, if the stride is assumed \(S=1\) , in order to make the input volume and the output volume have the same scale, \(P=\frac{F-1 }{2}\) . The "zero-padding" operation is usually used this way, and the specific reasons will be discussed later when we talk about the ConvNet structure.

Constrains on strides , and again, there are mutual constraints on the hyperparameters of the spatial arrangement. Assuming that when the size of the input is \(W=10\) , the "zero-padding" is not used \(P=0\) , and the size of the filter is \(F=3\) , then the stride is not available \( S=2\) , because \((W-F+2P)/S+1 = 4.5\) . Being not an integer indicates that the neuron cannot "fit" the entire input neatly and symmetrically. Therefore, this hyperparameter setting will be unreasonable, and the ConvNet library may throw an exception or pad it with 0s to make it match, or trim the input to make it valid. Resizing ConvNets so that all dimensions can be "solved" is a real headache. Using zero padding and certain designs can significantly reduce this burden.

Parameter sharing , the convolutional layer uses a parameter sharing scheme to control the number of parameters. It turns out that we can greatly reduce the number of parameters with a reasonable assumption:

If the computation of a feature at a certain spatial location \((x,y)\) is useful, it should also be useful to compute it at another location \((x_2,y_2)\) .
Using the parameter sharing scheme, our number of parameters will be greatly reduced. In actual backpropagation, each neuron in the Volume computes its weight gradients, which are superimposed on each depth slice, and each slice's weights are updated independently of each other.

Convolutional Neural Network Example


Pooling Layer

It is common to periodically insert pooling layers between consecutive convolutional layers in ConvNet architectures. Its function is to gradually reduce the space size of the image expression, control the number of parameters and the calculation amount of the entire network, thereby controlling overfitting. Pooling layers are computed independently on each depth slice of the input using the MAX operation. The most common pooling layer uses a filter of size 2×2 that downsamples each depth slice of the input in steps of 2 along the width and height, discarding 75% of the activations . Each time you perform a MAX operation, you need to take a max operation on 4 numbers. The depth dimension remains the same.

  • Suppose the size of the layer input volume is \(W_1\times H_1\times D_1\)
  • Assume two hyperparameters
  • The spatial dimension of the filter \(F\)
  • stride \(S\)
  • Assuming the size of the output volume is \(W_2\times H_2\times D_2\) , then
  • \(W_2 = (W_1 - F)/S +1\)
  • \(H_2 = (H_1 - F)/S +1\)
  • \(D_2 = D_1\)
  • Since a fixed function is executed on the input, no parameters are introduced here
  • It is relatively uncommon to use "zero-padding" in pooling layers

It is worth noting that in practice it is found that there are usually only two common variants of max pooling layers

  • Overlapping pooling: \(F=3, S=2\) , the pooling kernel size is larger than the stride
  • Non-overlapping pooling: \(F=2, S=2\)
    The pooling of large size receptive fields is more destructive

Commonly used pooling operations . Besides max pooling, the pooling unit can also perform other functions such as average pooling or even L2-norm pooling. Average pooling has its place in the historical development process, but compared with max pooling, average pooling has gradually fallen out of favor. The main reason is that the max pooling operation works better in practice.


The pooling layer performs a downsampling operation independently on the space of each depth slice of the input Volume. Above
: In this example, the input Volume has a size of [224×224×3], a pass size of 2, and a stride of 2. After the pooling operation, the size of the output Volume is [224 × 224 × 3]
The following figure: shows the process of the maximum pooling operation, the size is 2 × 2, the stride is 2

Get rid of the pooling layer Many people dislike the pooling layer and go out of their way to get rid of it. For example, in the article Striving for Simplicity: The All Convolutional Net , it is advocated to abandon the pooling layer in favor of an architecture containing only convolutional layers. To reduce the representation size, we propose to use larger stride in the convolutional layers . Dropping pooling layers is also considered important when training models that generalize well. Such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs)

ConvNet Architectures

Convolutional neural networks are usually composed of 3 types of layers: convolutional layers, pooling layers (max pooling is used by default) and fully connected layers. We will also explicitly write RELU as a layer. In this section we will discuss how convolutional neural networks are usually stacked.

Layer Patterns

The most common form of ConvNets architecture is to stack several CONV-RELU layers, and each CONV-RELU layer is followed by a POOL layer. This pattern is repeated until the image space dimensions are combined into a small size. In some cases, transitioning to a fully connected layer is common. The final fully-connected layer will produce outputs such as class confidences. In summary, the most common structures of ConvNets follow the pattern
INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC

In the pattern above, *it means repetition, which POOL?means an optional pooling layer. In addition N>=0(usually N<=3), M>=0, K>=0(usually K<=3), the following is the structure of common convolutional neural networks, you will find that they all satisfy the above pattern

  • INPUT -> FCimplemented one. Linear Classifier,N=M=K=0
  • INPUT -> CONV -> RELU -> FC
  • INPUT -> [CONV -> RELU -> POOL]*2 -> FC -> RELU -> FCHere we see a CONV layer between each POOL layer.
  • INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2 -> FCHere we see two CONV layers stacked before each POOL layer. This is usually a good idea for deeper and larger networks, as multiple stacked convolutional layers can construct more complex features from the input volume before destructive pooling operations.

Rather than using one convolutional layer with a large receptive field, we prefer to stack multiple convolutional layers with small filters. Suppose we stack three 3×3 convolutional layers on top of the neural network (with non-linear activation functions between layers of course). In this arrangement, each neuron in the first convolutional layer has a 3×3 field of view for the input, and each neural network in the second convolutional layer also has a 3×3 field of view for the first layer, which is equivalent to Enter a 5×5 field of view for the Volume. The same third layer corresponds to a 7×7 field of view for the input. Suppose we do not use three 3×3 convolutional layers, but instead use one convolutional layer with a 7×7 receptive field, the same results can be obtained in space, but compared to the first one, there are some shortcoming.

  1. Neurons perform linear operations on their inputs. The stacked Conv layers have nonlinear operations, making the extracted features more expressive.
  2. Assuming our input has \(C\) channels, then using a single 7×7 convolutional layer would contain \(C\times(7\times7\times C) = 49C^2\) , while 3× The convolutional layer of 3 has only \(3\times(C\times(3\times3\times C)) = 27C^2\) parameters. Intuitively, convolutional layers with stacked small filters can extract more expressive features from the input and have fewer parameters than convolutional layers with large filters. A disadvantage in practical applications is that when performing backpropagation, more memory is required to store the intermediate results of the convolutional layers.

recent attempts .
It should be noted that linear stacking models have been challenged several times recently, including Google's Inception architecture and Microsoft Research Asia's residual neural network ResNet (currently state-of-the-art). Both of these have more complex and different connection structures.

layer size

Until now, we have ignored the hyperparameters used in ConvNet.

Input layer (that is, contains images)

The size of the input layer should be able to be divided by 2 times, commonly used sizes such as 32 ( CIFAR-10 ), 64, 96 ( STL-10 ), 224 (common in ImageNet's convolutional network), 384, 512

convolutional layer

Convolutional layers should use small convolution kernels (filters, 3x3 up to 5x5) with stride of 1. The key is to pad the input volume with 0 so that the convolutional layer does not change the spatial dimension of the input. That is, if \(F=3\) , \(P=1\) will keep the original dimension of the input, the same \(F=5\) then \(P=2\) . In general, \(P=(F-1)/2\) then the input dimension will be preserved. If a larger convolutional sum (like 7×7) is to be used, it is usually only seen on the first convolutional layer.

pooling layer

The pooling layer is responsible for downsampling the spatial dimension of the input, usually using a max-pooling operation with a receptive field of 2×2 and a stride of 2. Doing so will discard 75% of the activations of the previous layer. Another slightly less common setup is to use a 3x3 receptive field with a stride of 2. Pooling operations with a receptive field larger than 3 are very rare because pooling operations are very lossy and aggressive, often resulting in performance degradation.

The convolutional layer in the mode described above preserves the spatial dimension of the input, while the pooling layer is responsible for downsampling the spatial dimension of the input. To alleviate the worry of considering the size, in another feasible mode, the convolutional layer uses a stride greater than 1, and does not use zero padding, in this case we need to carefully consider the size of the input volume throughout the network, and ensure that the stride and the filter works fine.

Why is the stride of convolution 1?
In practice, small strides perform well. A stride of 1 allows us to keep all the spatial downsampling to the pooling layers, while in the convolutional layers only transform the input volume in the depth direction.

Compromises made by memory constraints
In some cases (especially the first few layers of convolutional networks), with the rules described above, memory growth becomes very fast. For example using three 3x3 convolutional layers (64 kernels each, and using "zero-padding") applied to an image of [224x224x3], there will be 10,000,000 activations. Since memory is the bottleneck of the GPU, it is important to make this compromise. In practical applications, people usually compromise on the first convolutional layer. For example, use a 7×7 convolution kernel on the first convolutional layer with a stride of 2 (like ZF net), another example is AlexNet, use a 11×11 convolution kernel with a stride of 4.

Why padding
in addition to keeping the space size constant after CONV as mentioned above. If the CONV layer did not "zero-pad" the input and only performed efficient convolutions, the size of the volume would be reduced by a small amount after each CONV, and the information at the boundaries would be "washed away" too quickly.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324376745&siteId=291194637