How Convolutional Neural Networks Work

How Convolutional Neural Networks Work (Translation)

Original: http://brohrer.github.io/how_convolutional_neural_networks_work.html

Reproduced in How Convolutional Neural Networks Work

Nine times out of ten, when you hear about new technological breakthroughs in deep learning, you can always see the shadow of convolutional neural networks. Convolutional Neural Networks (CNNs or ConvNets) are the mainstay in the field of deep neural networks, and they are even more accurate than humans in some areas of image recognition. CNNs are one of the exciting approaches.

Another exciting point is that CNNs are easy to understand, at least when you break them down into their basic parts. Details are described below.

 

X's and O's

A simple chestnut: determine whether a picture is X or O. As shown below:

 

 

An ideal solution is to save a picture of X and O, then compare it with the new picture, and classify whichever has the highest similarity. However, the computer is more realistic. The picture that the computer sees is just a two-dimensional pixel matrix (imagine a large chessboard), and each two-dimensional position has a number. Suppose, white is 1 and black is -1 as shown in the figure below. When comparing, some pixels are different, and the whole match is different. The effect we want is: no matter how the image of X or O changes, such as translation, scaling, rotation or distortion, it can be correctly identified. CNNs can solve this problem.

 

Features _

The way CNNs compare two images is done on a part-by-part basis, and these parts are called features. Compared with the full-image pixel-by-pixel matching comparison, CNNs can better reflect the similarity by finding rough feature matches at roughly the same location in the two images. As shown in the figure, the local features are similar to the parts.

 

 

 

Each feature can be regarded as a small picture, and feature matching is the matching of these feature small pictures. An X character picture, the diagonal features and cross features in it are almost all X characters have, the center of each X is a cross feature, and the four arms are diagonal features (the direction is different). As shown below.

 

Convolution _

When given an image, CNNs do not know which region to extract features, so they try to find all possible locations in the image. When computing these features, we use a filter (filter, convolution template). For this mathematical calculation method, we use convolution, which is the origin of the name of the convolutional neural network.

The calculation principle of convolution will give many people a headache. In order to calculate the matching relationship between an area of ​​an image and a known feature, each pixel value in the known feature matrix (convolution template) can be multiplied by the pixel value in the corresponding image, and then the The calculation results are added and divided by the total number of pixels in the feature matrix (convolution template). Assuming that the pixels are all white (the case in the figure above), that is, 1, then the product is still 1, both are -1, and the product is also 1, only when it does not match, it is -1. If the feature is the same as the feature at the corresponding position in the image, then the final calculation result will get 1 (the result of this case). Otherwise less than 1.

 

Continuing this calculation process, the feature matrix (convolution template) is convolved with all possible image regions. Finally, the value obtained by each convolution operation will form a new two-dimensional matrix. This matrix is ​​the map of the features found in the image. The closer the value is to 1, the more matching the feature, and the closer to -1, the closer to negative. The feature (the principle of the negative of the photo) is close to 0, indicating that there is no matching relationship.

Next, perform the same operations on other feature matrices (other diagonal features, cross features, etc. in the example graph) with the image, and finally obtain several feature maps, each of which corresponds to a convolution template. These operations can all be done in the convolutional layers of CNNs.

 

It is easy to understand that CNNs perform reckless (greedy convolution of all pixels) operations to maintain their recognition ability, although we can simply write CNNs on paper and quickly add them, Write the numbers for multiplication and division. Described in mathematical language, the computational complexity of the convolutional network has a linear proportional relationship with the number of pixels in the image, the number of pixels in the convolution template, and the number of convolution templates. Today's computing chips (CPU, GPU, etc.) can easily solve large-scale convolution computing problems, which is why CNNs have only recently become popular.

 

 

 

Pooling (downsampling)

Besides convolutional layers, another powerful tool is the pooling layer, which is a method of shrinking an image while preserving most of its important information. The mathematical principle behind it is very simple (second grade level?), including a sliding window (n*n) to traverse the entire image, and then take the maximum (max pooling) or average (average pooling) or random of all pixels in this window. (random pooling) and so on. The figure below is max pooling. Generally in practice, 2x2, 3x3 pooling windows and the pooling operation with a step size of 2 pixels are the best.

According to the 2x2 pooling window, the final image is a quarter of the original image. Since pooling takes the maximum value of the window, it preserves the best features in the window. This means that it does not care which very precise pixel-level position the feature matches in the pooling window, as long as the feature to be tested matches the feature at a certain position in the window. Such results allow CNNs to discover whether a feature is present in an image without caring where it is. This solves the problem that the computer is more realistic (let the computer not know exactly where the feature is, if you know exactly where the feature is, the generalization ability is very weak).

The Pooling operation is performed on images or image sets, so the number of resulting images after output does not change, but the number of pixels is reduced. This reduces the amount of data. For example, an 8M image becomes 2M in size after the pooling operation, which is easy for subsequent calculation operations.

 

Rectified Linear Units (ReLU)

 

ReLU operations (i.e. an activation function), the components of this CNNs are small but important. The mathematical principle behind it is also very simple: when an input value is negative, the output is 0 after going through the ReLU function.

 

This operation allows the CNN to remain mathematically robust, learning values ​​ranging from 0 to infinity (only on one-sided inhibition, with sparsity, wide excitation boundaries, etc.). It's like the axle lubricant for the CNNs cart, simple and not fancy, but without it the cart won't go far.

The output of the ReLU layer is the same size as the input, except that all negative numbers become 0.

 

 

 

Deep learning

From the front, it can be found that the input and output of each layer are very similar (both are two-dimensional matrices), and because of this, we can stack them like Lego blocks. An original image, after convolution filtering, ReLU correction and pooling operations, can obtain a series of reduced feature images. This process can be repeated all the time, so that more and more complex feature maps are obtained, and the image becomes smaller and the data is more compressed.

 

Such a structure enables the low feature extraction layer to extract simple features of the image, such as local features such as edges and brightness information, as shown in the left figure below. Higher feature extraction layers will gradually extract high-level features of the image, such as shapes and patterns, which are gradually easier to distinguish. For example, in CNN face recognition, the top-level features are very similar to the face, as shown on the right below.

 

 

Fully connected layers

 

A powerful tool of CNNs is the fully connected layer, which converts the high-level feature map of the image into the number of votes, in our case, the votes to identify X and O. The fully connected layer is the main component layer of the traditional neural network. The input is not a two-dimensional matrix, but a single column where each element is equally important. Each value has its own voting rights, whether it is X or O. But that's not fair, because some values ​​reflect that the image is an X more than others, and some more that it's an O, and those values ​​that stand out should have higher voting power. These so-called voting rights are the weights between feature values ​​and categories, or the ability to connect. When an original image enters the CNN, from the lower layer to the fully connected layer, a so-called voting begins, and the final output category is the category with the highest number of votes.

 

 

Sharp fully connected layers can naturally also be stacked (the output is similar to the input, both are a list of values). In practice, several fully-connected layers are often connected together, and the middle one is the category that votes for the hidden layer. In this way, each additional layer allows the network to learn a rich and complex joint feature, which makes the network's prediction accuracy higher.

 

 

Backpropagation

 

So far, the story is perfect, but there are questions: where did the features come from? How do we find these weights in the fully connected layer? manual? Certainly not, otherwise CNNs would not be so popular. In fact, the backpropagation method helps us achieve these operations.

In order to apply the backpropagation method, we must prepare a large number of images with known labels, like a large number of pictures known to be X characters and pictures of O, and preparing the data requires a lot of patience. We use this data and an untrained neural network (untrained means that the pixel values ​​of each feature template and the weights of the fully connected layers are random numbers). These images are then fed into an untrained neural network.

After each image is processed by CNN, it gets a category vote, and the total number of vote category errors (it should be X, cast O), that is, the error, is the basis for judging the quality of our feature templates and weights. Next, this error is reduced by adjusting the feature templates and weights. The adjusted values ​​(changed values) are all positive and negative pairs, that is, the weight plus a point and the weight minus a point must be tried, and then the error value is calculated separately. If it is found that which of the two errors has decreased, which value (increased or decreased) is retained. This process is applied to the pixel values ​​of the feature template in the convolutional layer and the weights on the fully connected layer, and the new weights eventually reduce the error slightly. Then it is applied to different known label images. After many, many iterations, the weights on one image may be overwritten and lost, but the weights and features that match the entire large data set will be retained (individual Although there are differences, the overall trend is the same). As long as there is enough data, a network model with strong generalization ability can be obtained.

 

 

Of course, it is obvious that back-propagation requires a large amount of computation and requires higher hardware computing power. (This is also the reason why offline training and online recognition are commonly used in deep learning)

 

Hyperparameters

CNNs still have a lot to consider:

(1) How many features are needed for each convolutional layer? How many pixels are there in each feature map?

(2) What is the window size for each pooling layer? What about step size?

(3) How many hidden layer neurons should there be in each fully connected layer?

 

There are more advanced architectural problems to solve: how many layers are there in the network? How to organize? Some deep neural networks have thousands of layers and can predict more class possibilities.

There are so many permutations and combinations of CNN structures that only a few have been tested. The development and growth of CNN is inseparable from the promotion of the community, and some surprising results are often obtained. We tried some simple CNNs, and many, many CNN structures, such as having some new types of layers, more complex interconnections between layers, and so on.

 

Beyond images

 

CNNs can not only classify image files, but also other data types. Just convert this data type to something like an image (matrix form?). For example, a sound signal can be cut into short time blocks, and then each block is broken down into bass, midrange, treble, or finer frequency bands. In this way, it can be regarded as a two-dimensional matrix, each column is a time block, and each row is a frequency band. As shown below.

 

 

This looks like an image. CNNs can also perform well on this kind of data. There are things like text data in natural language processing, or even chemical data in medical discovery, and so on.

 

There is no one-size-fits-all technique, so there are data types that are not suitable for CNNs, such as customer data. Each row represents a customer, and each column represents customer-related information, such as name, gender, address, email, shopping and browsing history, etc. With such data, the positional relationship between rows and columns is no longer important. This way, rows can be rearranged and columns can be rearranged without losing important data. Conversely, rearranging the image will make the image almost cluttered and useless.

Rule of thumb: If your data is just as useful after swapping ranks, then CNNs are not suitable. Conversely, if your problem is like finding some patterns in an image, then CNNs are what you want.

 

Learn more

Continue to dig Deep Learning, there are many materials to learn from:

 

http://brohrer.github.io/deep_learning_demystified.html

http://cs231n.github.io/convolutional-networks/

http://colah.github.io/archive.html

https://brohrer.mcknote.com/zh-Hans/how_machine_learning_works/how_convolutional_neural_networks_work.html

 

There are also many excellent frameworks worth working on:

· Caffe

·  CNTK

· Deeplearning4j

· TensorFlow

Theano  _

· Torch

· Many others

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324770206&siteId=291194637