The most easy-to-understand explanation of convolutional neural network

What is a convolutional neural network? Why are they important?

Convolutional neural networks (ConvNets or CNNs) belong to the category of neural networks and have proven their efficient capabilities in areas such as image recognition and classification. Convolutional neural networks can successfully recognize faces, objects and traffic signals, providing vision for robots and self-driving cars.

figure 1

In the figure above, the convolutional neural network can recognize scenes and provide related labels, such as "bridge", "train" and "tennis"; while the figure below shows that the convolutional neural network can be used to identify daily objects and people. and animals. Recently, convolutional neural networks have also shown good results on some natural language processing tasks (such as sentence classification).

figure 2

Therefore, convolutional neural networks are an important tool for most machine learning users today. However, understanding convolutional neural networks and learning to use them for the first time can sometimes be painful. The main purpose of this blog is to give us a basic understanding of how convolutional neural networks process images.

If you are new to neural networks, I suggest you readthis short tutorial on multi-layer perceptrons before further reading Have a certain understanding of neural networks before. In this blog, multi-layer perceptrons are called "fully connected layers".

LeNet architecture (1990s)

LeNet is one of the earliest convolutional neural networks that advanced the field of deep learning. After many successful iterations, by 1988 Yann LeCun named this pioneering work LeNet5. At that time, the LeNet architecture was mainly used for character recognition tasks, such as reading zip codes, numbers, etc.

Next, we will look at how the LeNet architecture learns to recognize images. In recent years, many new architectures that have been improved on LeNet have been proposed, but they all use the main concepts in LeNet. If you have a clear understanding of LeNet, it is relatively easy to understand.

image 3

The structure of the convolutional neural network in the picture above is similar to that of the original LeNet. It can classify the input image into four categories: dog, cat, boat or bird (the original LeNet is mainly used for character recognition tasks). As the figure above illustrates, when the input is a picture of a boat, the network can correctly assign the highest probability (0.94) to the boat from the four categories. The sum of all probabilities at the output layer should be one (explained later in this article).

There are four main operations in the ConvNet shown in Figure 3 above:

  1. convolution
  2. Nonlinear processing (ReLU)
  3. Pooling or subsampling
  4. Classification (fully connected layer)

These operations are fundamental components to every convolutional neural network, so understanding how they work can help you fully understand convolutional neural networks. Below we will try to understand the principles behind each step of operation.

An image is a matrix of pixel values

Essentially, each image can be represented as a matrix of pixel values:

Figure 4

Channel is often used to represent a certain composition of an image. An image captured by a standard digital camera will have three channels - red, green and blue; you can think of them as a two-dimensional matrix stacked on top of each other (each channel represents a color), with each channel having a pixel value between 0 and within the range of 255.

Grayscale image has only one channel. In this article, we only consider grayscale images, so we only have a two-dimensional matrix to represent the image. The value of each pixel in the matrix ranges from 0 to 255 - zero means black and 255 means white.

convolution

The name of the convolutional neural network comes from theconvolution operation. The main purpose of convolution is to extract features from the input image. Convolution can learn the characteristics of an image from a small piece of input data and preserve the spatial relationship between pixels. We won’t go into the mathematical details of convolution here, but we will try to understand how convolution processes images.

As we discussed above, every image can be considered as a matrix of pixel values. Consider a 5 x 5 image whose pixel values ​​are only 0 and 1 (note that for a grayscale image, pixel values ​​range from 0 to 255, the green matrix below is a special case where pixel values ​​are only 0 and 1): As we said above, each image can be viewed as a matrix of pixel values. Consider a 5 x 5 image whose pixel values ​​are only 0 or 1 (note that for grayscale images, the range of pixel values ​​​​is 0 to 255, the following green matrix with pixel values ​​​​0 and 1 is only a special case) :

Figure 5

Also, consider another 3 x 3 matrix, as shown below:

Figure 6

Next, the convolution of the 5 x 5 image and the 3 x 3 matrix can be calculated as shown in the animation below:

Figure 7

Now stop and understand how the above calculation is done. We use the orange matrix to slide on the original image (green), sliding one pixel at a time (also called "step size"). At each position, we calculate the product of the corresponding elements (between the two matrices), and divide the product As the final result, the value of each element in the output matrix (pink) is obtained. Note that a 3 x 3 matrix only "sees" part of the input image at each step.

In CNN terminology, the 3x3 matrix is ​​called a "filter" or "kernel" or "feature detector". The matrix obtained by sliding the filter on the image and calculating the dot product is called " Convolved Feature" or "Activation Map" or "Feature Map". Remember that a filter acts as a feature detector on the original input image.

As can be seen from the animation in the above figure, for the same input image, filters with different values ​​will generate different feature maps. For example, for the following input image:

xiaolu

In the table below, we can see the effects of convolution of the above image with different filters. As shown, we can perform operations such as Edge Detection, Sharpen and Blur just by changing the numeric values of our filter matrix before the convolution operation 8 – this means that different filters can detect different features from an image, for example edges, curves etc. More such examples are available in Section 8.2.4 here.

In the table below, we can see the effect of different filters on the convolution of the above image. As shown in the table, by modifying the value of the filter matrix before the convolution operation, we can perform operations such as edge detection, sharpening and blurring - this shows that different filters can detect different features from the graph, such as Edges, curves, etc. More examples can be seenin Section 8.2.4 here.

convolution

Another good way to understand the convolution operation is to look at the animation below:

convolution

The filter (red box) is passed over the input image (convolution operation), producing a feature map. Convolving another filter (green box) on the same image results in a different feature map. Note that the convolution operation can obtain local dependency information from the original image. Also note how these two different filters generate different feature maps from the same image. Remember the image above and the two filters are just numerical matrices we discussed above.

In practice, CNN will learn the values ​​of these filters during the training process (although we still need to specify parameters such as the number of filters, filter size, network architecture, etc. before training). The more filters we use, the more image features we extract, and the better the network can recognize patterns on unknown images.

The size of the feature map (convolution feature) is controlled by the following three parameters, which we need to determine before convolution:

  • Depth: Depth corresponds to the number of filters required for the convolution operation. In the network below, we use three different filters to convolve the original image, so that we can generate three different feature maps. You can think of these three feature maps as stacked 2d matrices, then the "depth" of the feature maps is three.

depth

  • Stride: The stride is the number of pixels by which we slide the filter matrix on the input matrix. When the stride is 1, we move the filter position one pixel at a time. When the stride is 2, we skip 2 pixels each time we move the filter. The larger the step size, the smaller the feature map will be obtained.

  • Zero-padding: Sometimes, zero values ​​are used to pad the edges of the input matrix so that we can filter the edges of the input image matrix. One benefit of zero padding is that it allows us to control the size of the feature map. The one that uses zero padding is also called universal convolution, and the one that does not apply zero padding is called strict convolution. This concept is introduced in great detail in Reference 14 below.

Introduction to Nonlinearity (ReLU)

An additional operation called ReLU has been used after every Convolution operation in Figure 3 above. ReLU stands for Rectified Linear Unit and is a non-linear operation. Its output is given by:

In the above figure, an operation called ReLU is used after each convolution operation. ReLU stands for Rectified Linear Unit, which is a nonlinear operation. Its input looks like this:

resume

ReLU is an element-level operation (applied to individual pixels) and sets all pixel values ​​less than 0 in the feature map to zero. The purpose of ReLU is to introduce non-linearity in ConvNet, because in most of the actual data we want ConvNet to learn, it is non-linear (convolution is a linear operation - element-level matrix multiplication and addition, so we need to use The nonlinear function ReLU is used to introduce nonlinearity.

ReLU operation can be understood from the diagram below. The ReLU operation it shows is applied to one of the feature maps obtained in Figure 6 above. The output feature map here can also be regarded as a "corrected" feature map.

resume

Other nonlinear functions, such as tanh or sigmoid, can also be used instead of ReLU, but ReLU performs better in most cases.

Pooling operation

Spatial Pooling (also called sub-sampling or downsampling) reduces the dimensionality of each feature map, but can maintain most of the important information. There are several methods of spatial pooling: maximization, averaging, summation, etc.

For Max Pooling, we define a spatial neighborhood (for example, a 2x2 window) and take the largest element from the modified feature map within the window. In addition to taking the largest element, we can also take average (Average Pooling) or sum the elements in the window. In practice, max pooling has been shown to perform better.

The figure below shows an example of using max pooling on the rectified feature map (obtained after convolution + ReLU operation) using a 2x2 window.

resume

We slide our 2x2 window by 2 elements (also called a "step") and take the maximum value within each region. As shown in the figure above, this operation can reduce the dimension of our feature map.

In the network shown below, the pooling operation is applied to each feature map separately (note that because of this operation, we can get three output maps from three input maps).

network

The figure below shows the effect of the pooling operation on the modified feature map we obtained after the ReLU operation in Figure 9.

Pooling

The pooling function can gradually reduce the spatial scale of the input representation. In particular, pooling:

  • Make the input representation (feature dimension) smaller, and the number of parameters and calculations in the network is reduced more controllably, therefore, overfitting can be controlled
  • Make the network invariant to smaller changes, redundancies, and transformations in the input image (small redundancies in the input will not change the output of the pooling - because we use max/average in the local neighborhood operate.
  • helps us obtain the greatest degree of scale invariance (the precise word is "invariance") of the image. It is very powerful because we can detect objects in the image regardless of their location (see 18 and 19 for details).

story so far

network

So far we have seen how convolution, ReLU and pooling operate. It is important to understand that these layers are the basis for building any CNN. As shown in the figure above, we have two sets of convolution, ReLU & pooling layers - the second set of convolutional layers uses six filters to continue convolving the output of the first set of pooling layers, resulting in a total of six Feature map. Next apply ReLU to all six feature maps. Then we perform maximum pooling operations on the six modified feature maps respectively.

Together, these layers can extract useful features from images and introduce nonlinearities in the network, reducing feature dimensions while keeping these features somewhat invariant to changes in scale.

The output of the second set of pooling layers serves as the input to the fully connected layer, which we will introduce in the next section.

Fully connected layer

The fully connected layer is a traditional multi-layer perceptron, and the softmax activation function is used in the output layer (other classifiers like SVM can also be used, but only softmax is used in this article). The term "Fully Connected" indicates that all neurons in the previous layer are connected to all neurons in the next layer. If you are not familiar with multilayer perceptrons, I recommend reading this article.

The outputs of convolution and pooling layers represent high-level features of the input image. The purpose of the fully connected layer is to use these features to classify the input image based on the training data set. For example, in the figure below, the image classification we perform has four possible output results (note that the figure below does not show the node connections of the fully connected layer).

Fully connected

In addition to classification, adding a fully connected layer is also (generally) a simple way to learn non-linear combinations of these features. Most features obtained from convolution and pooling layers may be effective for classification tasks, but a combination of these features may be better.

The output probabilities obtained from the fully connected layer sum to 1. This can be ensured by using softmax as the activation function in the output layer. The softmax function takes as input a vector of any values ​​greater than 0 and converts them into a numeric vector between zero and one that sums to one.

Combine them - train using backpropagation

As discussed above, the role of the convolution + pooling layer is to extract features from the input image, while the role of the fully connected layer is the classifier.

Note that in the figure below, because the input image is a boat, the target probability for the boat category is 1, while the target probabilities for the other three categories are 0, that is

  • input image = boat
  • target vector = [0, 0, 1, 0]

network

The training process of the complete convolutional network can be summarized as follows:

  • Step 1: We initialize all filters and set parameters/weights with random values
  • Step 2: The network receives a training image as input and finds the output probability of each class through the forward propagation process (convolution, ReLU and pooling operations, and forward propagation of the fully connected layer)
    • We assume that the output probability of this image of the boat is [0.2, 0.4, 0.1, 0.3]
    • Because the weight of the first training sample is randomly assigned, the probability of the output is also random.
  • Step 3: Calculate the total error in the output layer (calculate the sum of 4 categories)
    • Total Error = ∑  ½ (target probability – output probability) ²
  • Step 4: Use the backpropagation algorithm to calculate the gradient of the error based on the weight of the network, and use the gradient descent algorithm to update the values/weights of all filters and parameter values ​​to minimize the output error.

    • The weights are updated relative to their proportion of the total error
    • When the same image is used as input again, the output probability at this time may be [0.1, 0.1, 0.7, 0.1], which is closer to the target vector [0, 0, 1, 0]
    • This indicates that the network has correctly classified this particular image by adjusting the weights/filters, so that the error in the output is reduced.
    • Parameters like the number of filters, filter size, network structure, etc. are fixed before the first step and remain unchanged during the training process - only the values ​​of the filter matrix and connection weights are updated
  • Step 5: Repeat steps 1 ~ 4 for all images in the training data

The above steps cantrain ConvNet - this essentially means that for the images in the training dataset, ConvNet updates all the weights and parameters, have been optimized to correctly classify these images.

When a new (unseen) image is used as the input of ConvNet, the network will perform the forward propagation process again and output the probabilities of each category (for new images, the output probabilities are optimized using the previous training samples Classification parameters are calculated). If our training data set is very large, the network will (hopefully) generalize well to new images and classify them into the correct categories.

Note 1: The above steps have been simplified and mathematical details have been avoided to provide an intuitive view of the training process. You can refer to the literature 4 and 12 to understand the mathematical formula and complete process.

Note 2: In the above example we used two sets of convolution and pooling layers. However, keep in mind that these operations can be repeated multiple times within a ConvNet. In fact, some of the best-performing ConvNets today have as many as a dozen convolution and pooling layers! At the same time, each convolutional layer does not necessarily have to be followed by a pooling layer. As shown in the figure below, we can use multiple convolution + ReLU operations in succession before the pooling operation. Also, notice how the layers of ConvNet are visualized in the image below.

car

Visualization of convolutional neural networks

Generally speaking, the more convolution steps, the more complex the recognition features that the network can learn. For example, ConvNet's image classification might detect edges from raw pixels in the first layer, then use edges to detect simple shapes in the second layer, and then use those shapes to detect higher-level features, such as faces at higher levels. The following figure shows these contents - the features we learned using Convolutional Deep Belief Network. This figure is only used to prove the above. content (this is just an example: real convolutional filters may detect objects that mean nothing to us).

demo

Adam Harley created a visualization of the results of a convolutional neural network using the MNIST training set of handwritten digits13. I highly recommend using it to understand how CNNs work.

We can see in the image below how the network recognizes the input "8". Note that the visualization in the image below does not show the ReLU operation alone.

Conv_all

The input image contains 1024 pixels (32 x 32 size), and the first convolutional layer (convolutional layer 1) consists of six unique 5x5 (stride 1) filters. As can be seen in the figure, a feature map with a depth of six is ​​obtained using six different filters.

Convolutional layer 1 is followed by pooling layer 1, and a 2x2 maximum pooling (step size 2) operation is performed on the six feature maps obtained by convolutional layer 1. You can move the mouse to any pixel on the pooling layer and observe the 4x4 grid obtained in the previous convolution layer (as shown in the figure above). You will find that the largest (brightest) pixel in the 4x4 grid will enter the pooling layer.

pooling

Pooling layer 1 is followed by sixteen 5x5 (stride 1) convolutional filters performing convolution operations. Next is pooling layer 2, which performs 2x2 maximum pooling (step size 2). The concepts of these two layers are the same as described previously.

Next we arrive at three fully connected layers. They are:

  • The first fully connected layer has 120 neurons
  • The second fully connected layer has 100 neurons
  • The third fully connected layer has 10 neurons, corresponding to 10 numbers - also known as the output layer

Notice in the image below that each of the 10 nodes in the output layer is connected to all 100 nodes in the second fully connected layer (hence the name fully connected).

Also, notice how the only bright node in the output layer corresponds to the number "8" - this indicates that the network correctly classified our handwritten digits (a brighter node indicates a higher output value from it, i.e. 8 is the most probable of all numbers).

final

The same 3D visualization can be seenhere.

Other ConvNet architectures

Convolutional neural networks have been around since the early 1990s. The LeNet we mentioned above is one of the early convolutional neural networks. Other influential architectures are as follows3:

  • LeNet (1990s): Covered in this article.
  • 1990s to 2012: From the late 1990s to the early 2010s, convolutional neural networks entered their incubation period. As the amount of data and computing power gradually develop, the problems that convolutional neural networks can handle become more and more interesting.
  • AlexNet (2012) – In 2012, Alex Krizhevsky (with others) released AlexNet which was deeper and wider than LeNet version, and won the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with a huge advantage. This is a huge breakthrough for previous methods, and the current wide range of applications of CNN are also based on this work.
  • ZF Net (2013) – The winner of ILSVRC 2013 was a convolutional neural network from Matthew Zeiler and Rob Fergus. It is known as ZFNet (short for Zeiler & Fergus Net). It is an improvement obtained by adjusting the hyperparameters of the AlexNet architecture.
  • GoogLeNet (2014) – The winner of ILSVRC 2014 is the convolutional neural network of Szegedy et al. from Google. Its main contribution is the use of an Inception module, which can significantly reduce the number of network parameters (4M, AlexNet has 60M parameters).
  • VGGNet (2014) – One of the leaders at ILSVRC 2014 was VGGNet. Its main contribution is to show that the depth (number of layers) of the network has a strong impact on performance.
  • ResNets (2015) – Residual Networks was developed by Kaiming He (and others) and won ILSVRC 2015. ResNets are currently the best model among convolutional neural networks and are the default choice for using ConvNet in practice (as of May 2016).
  • DenseNet (August 2016) – Layers of the Densely Connected Convolutional Network recently published by Gao Huang (and others) All are directly connected to other layers in a forward manner. DenseNet achieves significant improvements over the previous best architectures on five competitively accumulated object recognition benchmark tasks. You can see the Torch implementationhere.

Summarize

In this article, I try to explain the main concepts behind Convolutional Neural Networks in a simple way. I've simplified/skipped some details, but hopefully this article gives you some idea of ​​them.

This article was originally inspired by Denny Britz's Understanding Convolutional Neural Networks for Natural Language Processing (which I highly recommend reading). A lot of the explanation is also based on that article. If you want a deeper understanding of these concepts, I recommend browsing Stanford's ConvNet course Notes, and the references listed below. If you have any questions about the above concepts, or if you have questions or suggestions, please leave a message below.

All images and animations used in this article are copyrighted by the corresponding authors listed in the references below.

references

  1. Clarifai Home Page
  2. Shaoqing Ren, et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, 2015, arXiv:1506.01497
  3. Neural Network Architectures, Eugenio Culurciello’s blog
  4. CS231n Convolutional Neural Networks for Visual Recognition, Stanford
  5. Clarifai / Technology
  6. Machine Learning is Fun! Part 3: Deep Learning and Convolutional Neural Networks
  7. Feature extraction using convolution, Stanford
  8. Wikipedia article on Kernel (image processing)
  9. Deep Learning Methods for Vision, CVPR 2012 Tutorial
  10. Neural Networks by Rob Fergus, Machine Learning Summer School 2015
  11. What do the fully connected layers do in CNNs?
  12. Convolutional Neural Networks, Andrew Gibiansky
  13. A. W. Harley, “An Interactive Node-Link Visualization of Convolutional Neural Networks,” in ISVC, pages 867-877, 2015 (link)
  14. Understanding Convolutional Neural Networks for NLP
  15. Backpropagation in Convolutional Neural Networks
  16. arXiv:1603.07285
  17. What is the difference between deep learning and usual machine learning?
  18. How is a convolutional neural network able to learn invariant features?
  19. A Taxonomy of Deep Convolutional Neural Nets for Computer Vision

Guess you like

Origin blog.csdn.net/u013995172/article/details/86529563