Do not tangle convolution formula for it! 0 formula feed network convolution depth analysis before fully connected neural network!

Article reprinted from subscription number "sell Meng Yao Xi small house" in the article " Do not tangle convolution formula for it! 0 formula feed network convolution depth analysis before fully connected neural network . "


Hello ~ your small evening finally spit bubbles ~ A few days ago the small Sunset and overtime project it dragged on for several days, almost suffering from anxiety disorders draft drag the _ (: з "∠) _

About convolution neural network, small evening do not talk about it from the convolution. More than one fan asked me before significance convolutional neural network convolution, even in the know were invited to answer almost on a "neural network convolution Why not call the cross-correlation neural network" like a strange question, could not, or CNN quickly to write it (say I wanted to go save the world, like \ (// ∇ //) \

Learning model we start from the front of the machine easier. Recall has been mentioned N * N * N times before fully connected feedforward neural network, the previous article small evening talked for a fully connected a hidden layer feedforward networks:


v2-b9c09eb1ee8d9a1e80de1d75617dcbc8_b.jpg


Here it can be seen as a simple front and rear two cascaded classifier, the classifier output of the previous layer is the input layer of the classifier, then obviously each preceding layer classifier output (i.e., each of the hidden units ) What is the meaning we are not clear, that is, before a layer of classifiers learned is that the classification of unknown significance category! Then a layer of the classifier is to learn our final output categories defined directly in front of an unknown category Classifier get! For chestnuts.

For example, the input is an image:

v2-95a7176d3b0164acc6d9ae9e200cf95c_b.jpg


This assumption is 100 * 100 images, that is, there are 10,000 pixels. Each pixel value of 0-255.

Imagine if we do not want to define human characteristics, you want to direct the original image is lost in, to classify whether the image contains dog in this category. Then the time dimension corresponds to the input layer 10000, i.e. 10,000 features, each feature point is the value of a pixel.

If our machine learning models without hidden layers words:

v2-3beb01bcea8ba51543896cc27a7d2949_b.jpg


Then the model is clearly connect each pixel directly to the "is not a dog," the two final categories. However, we also know a little think about it, in fact, the value of each pixel is not a dog and with no connection (you can not say that this pixel is black (value of 0), say that a dog is a pixel, the same , the pixel is white (value 255) can not explain this pixel is not a dog's body.) so apparently do directly with the characteristic value of each pixel to decision-making is not a dog it is very do not fly ! (Related to each feature category has little sister ah)

However, if we add a hidden layer of it? This situation will not be better then?

v2-4a808a73cb334096136891f9094d46c5_b.jpg

Imagine, as mentioned above, plus a hidden layer, a layer model can learn the unknown category, and these categories can do with pixel strong correlation! For example, a hidden category is the "center of the image as a circle with a radius of 50 place if there is a circular"

v2-7b1ad7c8de68c33e7443244dcde949ac_b.png

Learn out of this classifier is very simple, model just to make this right circle where the pixels corresponding to the weight of a large (for example 2), the right to make other features of weight close to zero. Thus the more obvious round appearances (pixel value closer to 0), will cause the output of the sub-classifier is close to 0, but this place is not present when the circle (the pixel values ​​close to 255) will cause the output of the classifier is very large, look, it is easy to learn this simple classification task bar (of course, there is no place to learn this round, also need to consider pixel circle around (in order to have enough contrast round Oh really explain here), but ignore these details matter, understand the meaning of the expression Xi Yao small fine).

Well, this sub-classifier training Well, that is a hidden node somehow, then the same token, the other sub-classifier (hidden node) can learn some strange and simple implied category, this series the category is entirely possible to combine is like this

v2-52ccc188fe99feedb958ccd2823d7347_b.png

Look, based on these categories (that is characteristic for the next level classifier), the next stage of classifier is easy to classify this image is not a dog friends - such as the above picture, it is about 7 hidden nodes, They are responsible for the presence or absence of seven lines of classification decisions. Then the next stage of classifier just have to let these seven characteristics of weights are larger, so that when all of these features exist, obviously this is a dog ah! So after a classifier can be very confident decisions, he said: "This picture is a dog!" Look - this dog tied to the practice of science more directly than a single pixel, more confident it ~

This is before the full depth of connection with feed-forward neural network to do the basic principle of classification.

v2-4a808a73cb334096136891f9094d46c5_b.jpg

but! You will be able to find the problem! So obviously it has great limitations! Once another position such as dogs do? Once a change in the size of the dog do? Image dog curled up in a corner of it?

Obviously! At this time assignments hidden layer fully connected feed-forward network before going to explode! There must be a lot a lot of hidden nodes to learn very much hidden category / hidden features it may be sufficient to cope with so many complicated situation!

And a large number of hidden layer nodes will result in an increase in the parameters of neural networks is increasing rapidly! Such as the above example, to be increased to increase a hidden node parameters 10000 + 2, clearly the cost is very large. So is there a better solution?

Obviously ah! Since our simple classifier to learn a simple circle, a simple straight line, a simple decision task, then all of the hidden node learning circle can be merged into one node ah! Then we can use a much less than the whole image of the "window" to represent. For example, to use a window of 20 * 20 (so that only 400 parameters, while connected to the time before the full 100 * 100 = 10000 parameter), and this window is responsible for the small circle around the picture to find out! This "window" is called " convolution kernel " (apparently that is essentially a scaled-down version of the input to a hidden node connection weights of weight) to find the image of small round every corner, we can let this slide in turn convolution kernel over every corner of the image, as long as somewhere found a small round, where it is activated, that is a good mark there is a small circle.

So, just in front of full-time connection, like to learn more features, we certainly want to set up multiple convolution kernel ah ~ each winder core is responsible for a simple classification task (when it comes to this sake, we would have guessed, this is still just like when fully connected, simple classification task here is to actually throw out the old features to the next layer of sub-classifiers, creating new features it).

Obviously, still with the same full-time connection, such as the same convolution kernel is 20 * 20 different parameters on behalf of the extracted different characteristics, some for extraction 20 * 20 in the small circle, some responsible for extracting Triangle some extraction and the like - the same linear convolution kernel the size of the plurality of groups, it is referred to as a 20 * 20 filters (filter) (i.e., a size of the filter may be provided a plurality of learning convolution)

Since we need 20 * 20 blocks from the classification of small round to go, of course, so it may need to find the great circle from the block's 50 ~ 50 * Therefore in a convolution layer, we can set the size of a plurality of filters - of course, you are provided a plurality of convolution kernels to extract the different characteristics of each filter size.

Let's consider a more complex case!

We know that a lot of time input is not only a presentation layer! For example, a color image will include red, blue, green three layers, but not in front of the same grayscale contains only one layer.

v2-4adf1597202c1a014a779db27d7627af_b.jpg

Sometimes a circle of color images only appear in blue in the layer, but does not appear in the other two layers, so obviously if we convolution kernel is responsible for extracting only slide in a circle in the layer, it may be in many positions missing a lot of information, so when the input data may be represented as a number of layers (i.e., there are a lot of data representing different angles), when the convolution kernel to all layers are mapped at each position / "convolution" what, and summing , to determine the real image at this position in the end there is no feature of the convolution kernel looking approach. A plurality of layers is referred to herein as input a plurality of input channels (Channel-in) , so that a convolution layer, not only the size of a plurality of filters may be provided at each of the plurality of convolution filter size nuclear, but also allows the same convolution kernel every time taking into account multiple channels of input data!

Let's consider a more complex case more!

If we now classified mission has changed! It becomes more difficult! Now we want to directly identify a picture is not taking place in Cats & Dogs !

v2-00d0d6957ce22387f0194500527b6c5f_b.jpg

In this case! We have many, many classifiers ah! We want to identify the cat, the dog to be identified, to identify the head of the pack, to identify the Band-Aid, and so on, how do so many sub-classification task? Convolution can not be directly incorporated into a layer of it?

Of course! Since a plurality of input channels can, of course, also be provided a plurality of output channels (channel-out) ah! One output channel on behalf of a sub-classification task ~ (Of course, each child will have its classification task of a filter and a bunch of convolution kernel). (Of course, these sub-classification task is unclear humans, neural networks they know)

At this point, a complete layer on the definition of convolution complete it. Let us sum up, a convolution kernel to be taking all the channel-in at each location; and a lower filter size can set up multiple convolution kernel used to extract different features; then you can set a number of different size of filter to control the particle size of the feature extraction; then may be provided a plurality of output channels represent a plurality of channel-out classification tasks. And with a fully connected feedforward network as before, the mapping is complete convolution kernel (i.e., the end of the linear mapping) Remember lost activation function oh

Therefore, in the image, a convolution filter layer is split into a plurality of size, the size of each filter corresponding to a channel-in * width * height * channel-out parameters of the Tensor 4D, wherein the width and height of the convolution is width and height of the window friends ~ nuclear convolution kernel such two-dimensional convolution window is also referred to 2D, 3D empathy dimensional convolution window is the convolution parameters Tensor 5D is anyway.

Think about it, there is no need to address the problem?

v2-4a808a73cb334096136891f9094d46c5_b.jpg

Objects we have just discussed are merely for the winder core operations in one location on! Obviously after a convolution kernel slides over the entire input data, it will produce a lot of output in a variety, so much so output how to choose?

Imagine, in fact, we're looking for out of the cat, then, regardless of the location of the cat is in the top left corner of the picture, or the lower right corner, still covered the entire picture, in fact, we say that this image contains a cat! Most of the time we are not interested in its location, just a concern this picture in the end there is no such thing, so we only need to determine the convolution kernel produced in all locations strongest point output and how strong enough friends ~ output directly to another position of a point - like slightly abandon this operation is called a maximum pooled (max-pooling) ! This is why the largest pool of most cases is the most effective way of pooling.

Obviously, dealing with a convolution kernel of all location points, in addition to taking the maximum of the way, certainly there are some scenes require other more rational way. These ways are called pooled .

In addition to the maximum pooling, when there will be pooled with the mean (average-pooling), taking the first n pooled maximum (max-n pooling) and the like, able to guess from the name pooling embodiment matter, not one long-winded the ~ of course, be for all location points, then called a pool of global pooling, which is equivalent to extract a feature or a global category. If we define a pool operation range (i.e. a window / a core pooled), then it is localized pooling local feature is obtained ~ / category.

Oh yeah, have to explain it here, it is clear that pooling is to do with the back of the convolution layer (of course, due to the convolution with a sure guarantee activation function nonlinear models, so if the activation function can be considered a layer then pooling layer is behind the active layer).

Well ~ convolution neural network finished, convolution - Activation - pooling, it's that simple. Of course, as said before, if the results are still pooling is implied as a category of words (ie the pooling of output does not take us the final classification task), then this is still the same as before, the next layer can be used as feature to use. So the pool behind layers of course they can take a new round of convolution - Activation - pooling, that is, the formation of the neural network depth in the true sense.

Published 33 original articles · won praise 0 · Views 3291

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/104553510