Dahua Convolutional Neural Network (CNN)

In recent years, deep learning has developed rapidly, and has achieved great success in various scenarios such as image recognition, speech recognition, and object recognition. For example, AlphaGo defeated the world Go champion, iPhone X has built-in face recognition unlocking function, etc. Many AI products caused a great stir in the world. In this deep learning revolution, Convolutional Neural Networks (CNN) is the main force driving all these outbreaks, and has a very important position in the current development of artificial intelligence.

[The question is coming] What is a convolutional neural network (CNN)?

1. Xiaobai, what is a neural network?
The neural network here, also referred to as Artificial Neural Networks (ANNs), is an algorithmic mathematical model that imitates the behavioral characteristics of biological neural networks, consisting of neurons, connections (synapses) between nodes and nodes. , as shown below:
 
The mathematical model abstracted from each neural network unit is as follows, also called a perceptron, which receives multiple inputs (x1, x2, x3...) and produces an output, which is like a nerve ending feeling various Changes in the external environment (external stimuli), which then generate electrical signals for transduction to nerve cells (also called neurons).
 
A single perceptron constitutes a simple model, but in the real world, the actual decision-making model is much more complex, often a multi-layer network composed of multiple perceptrons, as shown in the figure below, which is also a classic The neural network model consists of an input layer, a hidden layer, and an output layer.
 
Artificial neural network can map any complex nonlinear relationship, has strong robustness, memory ability, self-learning and other capabilities, and has a wide range of applications in classification, prediction, pattern recognition and so on.

2. The point is, what is a convolutional neural network?
Convolutional neural networks shine in image recognition, reaching unprecedented accuracy and having a wide range of applications. Next, we will take image recognition as an example to introduce the principle of convolutional neural network.
(1) Case
Suppose a picture (may be the letter X or the letter O) is given, and it can be identified by CNN whether it is X or O, as shown in the figure below, how can it be done?
 
(2) If the image input
adopts the classic For the neural network model, the entire image needs to be read as the input of the neural network model (that is, the full connection method).
And our human cognition of the outside world is generally from the local to the global, first we have a perceptual understanding of the part, and then gradually we have a cognition of the whole. This is the human cognition model. The spatial relationship in the image is also similar, the pixels in the local range are more closely related, while the pixels farther away are less related. Therefore, each neuron does not actually need to perceive the global image, but only needs to perceive the local image, and then synthesize the local information at a higher level to obtain the global information. This pattern is an important artifact for reducing the number of parameters in convolutional neural networks: the local receptive field .
 
(3) Extracting features
If the letter X and the letter O are fixed, then the easiest way is to compare the pixels between the images one by one, but in real life, fonts have various morphological changes (for example, Handwritten text recognition), such as translation, zoom, rotation, micro-deformation, etc., as shown in the figure below:
 
Our goal is to accurately identify X and O of various morphological changes through CNN, which involves the How to effectively extract features as a key factor for identification.
Recalling the "local receptive field" mode mentioned earlier, for CNN, it is a small patch to compare, and find some rough features (small patches) in roughly the same position in the two images. For matching, compared with the traditional way of comparing the whole image one by one, this small block matching method of CNN can better compare the similarity between two images. As shown in the figure below:
 
Taking the letter X as an example, three important features can be extracted (two intersecting lines, one diagonal line), as shown in the figure below:
 
If the pixel value "1" represents white, the pixel value "-1" represents Black, the three important features of the letter X are as follows:
 
So how do these features perform matching calculations? (Don't tell me that the pixels are matched one by one, Khan!)
(4) Convolution
At this time, we will invite today's important guest: Convolution. So what is convolution, don't worry, let's take it slowly.
When given a new image, CNN does not know exactly which parts of the original image these features are to match, so it tries every possible position in the original image, which is equivalent to turning this feature into a a filter. This process of matching is called the convolution operation, which is where the name of the convolutional neural network comes from.
The operation of convolution is shown in the figure below:
 
is it very similar to rolling up a towel diagonally? The figure below vividly illustrates why it is called "convolution".
 
In this case, a feature (feature) and its in the original image are calculated. For the result of a small block corresponding to the above, it is only necessary to multiply the pixel values ​​of the corresponding positions in the two small blocks, then accumulate the results of the multiplication operations in the entire small block, and finally divide by the total number of pixels in the small block. The number is enough (note: it may not be divided by the total number).
If both pixels are white (both values ​​are 1), then 1*1 = 1, if both are black, then (-1)*(-1) = 1, that is, each pair can match , the multiplication result is 1. Similarly, multiplying any unmatched pixels results in -1. The specific process is as follows (matching results of the first, second..., last pixel):
 
 
 
According to the calculation method of convolution, the convolution calculation after the first block feature matching is as follows, and the result is 1.
 
For the matching of other positions, it is also similar (such as the matching in the middle part)
 
. The convolution after calculation is as follows
 
and so on. For three The feature image continuously repeats the above process, and through the convolution operation of each feature (feature), a new two-dimensional array will be obtained, called the feature map. The closer the value is to 1, the more complete the matching between the corresponding position and the feature, the closer to -1, the more complete the matching between the corresponding position and the feature is, and the value close to 0 means that the corresponding position has no match or no relationship. . As shown in the following figure:
 
It can be seen that when the image size increases, the number of internal addition, multiplication and division operations will increase rapidly, and the size of each filter and the number of filters increase linearly. With so many factors, it's easy to make the amount of computation quite large.
(5) Pooling
In order to effectively reduce the amount of computation, another effective tool used by CNN is called "Pooling". Pooling is to reduce the input image, reduce the pixel information, and retain only the important information.
The operation of pooling is also very simple. Usually, the pooling area is 2*2 in size, and then converted into corresponding values ​​according to certain rules, such as taking the maximum value (max-pooling), the average value ( mean-pooling), etc., take this value as the resulting pixel value.
The following figure shows the max-pooling result of the 2*2 pooling area in the upper left corner. Take the maximum value of the area max(0.77,-0.11,-0.11,1.00) as the result of the pooling, as shown in the following figure:
 
Pooling area To the left, the second block takes the large value max(0.11,0.33,-0.11,0.33) as the result of pooling, as shown in the following figure:
 
Other areas are similar, take the maximum value in the area as the result of pooling, Finally, after pooling, the results are as follows:
 
Perform the same operation on all feature maps, and the results are as follows:
 
max-pooling retains the maximum value in each small block, which is equivalent to retaining the best matching result of this block (because the closer the value is) 1 means better match). That is to say, it will not specifically pay attention to which place in the window is matched, but only pay attention to whether there is a match somewhere.
By adding the pooling layer, the image is reduced, which can greatly reduce the amount of calculation and reduce the machine load.
(6) Activation function ReLU (Rectified Linear Units)
Commonly used activation functions are sigmoid, tanh, relu, etc. The first two sigmoid/tanh are more common in the fully connected layer, and the latter ReLU is common in the convolutional layer.
Recalling the perceptron mentioned earlier, the perceptron receives each input, then sums it up, and then outputs after the activation function. The function of the activation function is to add nonlinear factors to nonlinearly map the output of the convolutional layer.
 
In the convolutional neural network, the activation function generally uses ReLU (The Rectified Linear Unit), which is characterized by fast convergence and simple gradient calculation. The calculation formula is also very simple, max(0,T), that is, for the negative value of the input, the output is all 0, and for the positive value, it is output as it is.
Let's take a look at the operation process of the ReLU activation function in this case:
the first value, take max(0,0.77), the result is 0.77, the
 
second value in the following figure, take max(0,-0.11), the result is 0, as follows Figure
 
By analogy, after the ReLU activation function, the results are as follows:
 
Perform the ReLU activation function operation on all feature maps, and the results are as follows:
 
(7) The deep neural network
combines the above-mentioned convolution, activation function, and pooling Together, it becomes the following image:
 
By increasing the depth of the network and adding more layers, a deep neural network is obtained, as shown in the figure below:
 
(8) Fully connected layers
The fully connected layer acts as a "classifier" in the entire convolutional neural network. The role of , that is, through deep networks such as convolution, activation function, pooling, etc., and then through the fully connected layer to identify and classify the results.
First, concatenate the results of the deep network after convolution, activation function, and pooling, as shown in the following figure:
 
Since the neural network belongs to supervised learning, during model training, the model is trained according to the training samples, so as to obtain full connection The weights of the layers (such as the weights of all connections that predict the letter X)
 
are based on the weights trained by the model just mentioned when using the model for result identification, and the deep networks through the previous convolution, activation function, pooling, etc. The calculated results are weighted and summed to obtain the predicted value of each result, and then the largest value is taken as the recognition result (as shown in the figure below, the recognition value of the letter X is finally calculated to be 0.92, and the recognition value of the letter O is 0.51, then The result is determined to be X)
 
The operation defined in the above process is "fully connected layers", and there can be multiple fully connected layers, as shown in the figure below: (
 
9) Convolutional Neural Networks
After the results are strung together, a "convolutional neural network" (CNN) structure is formed, as shown in the following figure:

Finally, to review and summarize, the convolutional neural network is mainly composed of two parts, one part is feature extraction (convolution, activation function, pooling), the other part is classification recognition (full connection layer), the following picture is the famous handwriting Text recognition convolutional neural network structure diagram:
 

Recommended related reading

 

Welcome to follow my official account "Big Data and Artificial Intelligence Lab" (BigdataAILab) for details .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324398380&siteId=291194637