Wonderful Visualization: What is the convolution depth learning?

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/jayandchuxu/article/details/100590657

This article from excellent visualization: What is the convolution depth learning?
In recent years, along with some powerful, versatile depth learning framework have emerged, adding depth study into the convolution model layer has become possible. The process is simple, just a single line of code can be achieved. But do you really understand what "convolution" Really? When beginners first contact with this word, see convolution, nuclear, and other channels term stacked together, they often get confused. As a concept, "convolution" The word itself is a complex, multi-layered.

Here Insert Picture Description
In this article, we will break down the convolution operation mechanism, and gradually be linked with the standard neural network, and explore how they establish a strong visual hierarchy, and eventually became a powerful image feature extractor.

2D convolution: Operation

2D convolution is a fairly simple operation: Let's start with a little weight matrix, which is the convolution kernel (kernel) start, it gradually on the two-dimensional input data "Scan." Meanwhile convolution kernel "sliding", the weight matrix calculation and weight of the resulting scan data matrix product, and then the summary result as an output pixel.

Standard convolution

Standard convolution
Convolution kernel will have all the above operations are repeated through positions which on, until the input feature matrix into another two-dimensional feature matrix. Wherein Briefly, the output is essentially a weighted sum of the original input feature (carrying weight is a convolution kernel value), the pixel position from the point of view, they are approximately the same as the place is located.

So why output characteristics will fall into the "general area" mean? Depending on the size of the convolution kernel. The size of the convolution kernel in generating directly determines the output characteristic, that incorporates many features of the input, that is: the smaller the convolution kernel, the closer the position of the input and output; the larger convolution kernels, the farther distance.

This layer fully connected and very different. In the example above, we have the input 5 × 5 = 25 feature, and our output is 3 × 3 = 9 features. If this is a fully connected layers, wherein the input 25, the output we will contain 25 × 9 = 225 weight parameter weight matrix, each output of each feature is a weighted sum of input feature.

This means that for each input feature, the convolution operation is performed using nine parameter conversion. What it is concerned that not every feature of what is, but what are the approximate location of the feature. It is important to understand it, we can conduct in-depth discussion.

Some commonly used techniques

Before we continue the discussion, let's look at two neural network convolution techniques often appear: Padding and Strides.

Padding

If you carefully read the above gif, you will find that we feature feature a 5 × 5 matrix transformation became a 3 × 3 matrix, the edges of the input image to be "pruned" out, because the pixels on the edge of forever not located in the center of the convolution kernel, the convolution kernels are unable to extend beyond the edge region. This is not ideal, we usually want the size of the input and output should be consistent.
Here Insert Picture Description

padding
Padding is a proposed solution to this problem: it with additional "false" pixel fill edges (typically the value 0), so that, when the convolution kernel scan input data, it can extend into the dummy pixels outside the edge, so that the same input and output sizes.

Stride

If the role is to make Padding with high output and input width, then the convolution layer, sometimes we will need an output less than the input size. Well, this how to do it? This is actually a common application convolution neural networks, when the number of channels, we need to reduce the feature space dimension. There are two ways to achieve this goal, first using pooled layer, the second is the use of Stride (stride).

Here Insert Picture Description

Stride
Sliding convolution kernel, we will start with the upper left corner of the start input, each sliding down left or sliding a movable individually calculating an output line, will be the number of rows and columns called sliding Stride, in section picture, Stride = 1; in the figure above, Stride = 2. Stride is multiplied effect of reduced size, and this is to reduce the values ​​of the parameters of a specific number, such as two steps, half output is input; stride is 3, the input is the output 1/3. And so on.

In some currently more advanced network architectures, such as ResNet, they have chosen to use less pooling layer, select the convolution stride when there is need to reduce output.

Multi-channel convolution

Of course, the above examples contains only one input channel. In fact, most of the input image has three RGB channels, while increasing the depth of the network means an increase of the number of channels. To facilitate understanding, we can observe the different channels as different "perspective" of the whole map, it might ignore certain features, but it must also emphasize certain features.
Here Insert Picture Description

Color images are generally red, green and blue channels
Here we must distinguish relates to "convolution" and "filter" these two terms. In the case where only one channel, "convolution kernel" is equivalent to "filter", these two concepts are interchangeable; however, in general, they are two completely different concepts. ** Each "filter" in fact all the "convolution" is a collection ** in the current layer, each channel corresponds to a convolution kernel, and the convolution kernel is unique.

Here Insert Picture Description

filter: a set of convolution kernels
Each of the convolution filter layer has one and only one output channel - a convolution kernel when the respective filter is slid on the input data, they output a different processing result, some of the convolution kernel weights may be higher, the corresponding data channel it will also be more emphasis on (Example: If the right red channel convolution kernel weight higher, filter characteristic differences will be more concerned about this channel).

Here Insert Picture Description
After processing the data convolution kernel form three versions of a processing result, this time, filter then add them together to form an overall channel. Briefly therefore, the processing of a convolution kernel different versions of different channels, and the filter as a whole is to produce an overall output.

Here Insert Picture Description
Finally, there is a bias term. We all know that each filter output should be added after a bias term, so why put it in this position? If you contact the role of filter, which is not difficult to understand, after all, only here, in order to offset term and filter act together to produce the final output channel.

Here Insert Picture Description
以上是单个 filter 的情况,但多个 filter 也是一样的工作原理:每个 filter 通过自己的卷积核集处理数据,形成一个单通道输出,加上偏置项后,我们得到了一个最终的单通道输出。如果存在多个 filter,这时我们可以把这些最终的单通道输出组合成一个总输出,它的通道数就等于 filter 数。这个总输出经过非线性处理后,继续被作为输入馈送进下一个卷积层,然后重复上述过程。

2D 卷积:直觉

卷积仍是线性变换
尽管上文已经讲解了卷积层的机制,但对比标准的前馈网络,我们还是很难在它们之间建立起联系。同样的,我们也无法解释为什么卷积可以进行缩放,以及它在图像数据上的处理效果为什么会那么好。

假设我们有一个 4×4 的输入,目标是把它转换成 2×2 的输出。这时,如果我们用的是前馈网络,我们会把这个 4×4 的输入重新转换成一个长度为 16 的向量,然后把这 16 个值输入一个有 4 个输出的密集连接层中。下面是这个层的权重矩阵 W:

In summary, there are 64 parameters
总而言之,有 64 个参数
虽然卷积的卷积核操作看起来很奇怪,但它仍然是一个带有等效变换矩阵的线性变换。如果我们在重构的 4×4 输入上使用一个大小为 3 的卷积核 K,那么这个等效矩阵会变成:

Here really only nine parameters
这里真的只有 9 个参数

注:虽然上面的矩阵是一个等效变换矩阵,但实际操作可能会不太一样。

可以发现,整个卷积仍然是线性变换,但与此同时,它也是一种截然不同的变换。相比前馈网络的 64 个参数,卷积得到的 9 个参数可以多次重复使用。由于权重矩阵中包含大量 0 权重,我们只会在每个输出节点看到选定数量的输入(卷积核的输入)。

而更高效的是,卷积的预定义参数可以被视为权重矩阵的先验。卷积核的大小、filter 的数量,这些都是可以预定义的网络参数。当我们使用预训练模型进行图像分类时,我们可以把预先训练的网络参数作为当前的网络参数,并在此基础上训练自己的特征提取器。这会大大节省时间。

从这个意义上讲,虽然同为线性变换,卷积相比前馈网络的优势就可以被解释了。和随机初始化不同,使用预训练的参数允许我们只需要优化最终全连接层的参数,这意味着更好的性能。而大大削减参数数量则意味着更高的效率。

上图中我们只展示了把 64 个独立参数减少到 9 个共享参数,但在实际操作中,当我们从 MNIST 选择 784 幅 224×224×3 的图像时,它会有超过 150,000 个输入,也就是超过 100 亿个参数。相比之下,整个 ResNet-50 只有约 2500 万个参数。

因此,将一些参数固定为 0 可以大大提高效率。那么对比迁移学习,我们是怎么判断这些先验会产生积极效果的呢?

答案在于先前引导参数学习的特征组合。

局部性

在文章开头,我们就讨论过这么几点:

  • 卷积核仅组合局部区域的几个像素,并形成一个输出。也就是说,输出特征只代表这一小块局部区域的输入特征。 卷积核会在 “扫描”
  • 完整张图像后再生成输出矩阵。

因此,随着反向传播从分类节点开始往前推移,卷积核就可以不断调整权重,努力从一组本地输入中提取有效特征。另外,因为卷积核本身应用于整个图像,所以无论它学习的是哪个区域的特征,这些特征必须足够通用。

如果这是任何其他类型的数据,比如应用程序的安装序列号,卷积的这种操作完全不起作用。因为序列号虽然是一系列有顺序的数字,但他们彼此间没有共享的信息,也没有潜在联系。但在图像中,像素总是以一致的顺序出现,并且会始终对周围像素产生影响:如果所有附近的像素都是红色,那么我们的目标像素就很可能也是红色的。如果这个像素最终被证明存在偏差,不是红色的,那这个有趣的点就可能会被转换为特征。

通过对比像素和临近像素的差异来学习特征 —— 这实际上是许多早期计算机视觉特征提取方法的基础。例如,对于边缘检测,我们可以使用 Sobel edge detection:

Here Insert Picture Description
用于垂直边缘检测的 Sobel 算子

对于不包含边缘的网格(如天空),因为大多数像素都是相同的值,所以它的卷积核的总输出为 0。对于具有垂直边缘的网格,边缘左侧和右侧的像素存在差异,所以卷积核的输出不为零,激活边缘区域。虽然这个卷积核一次只能扫描 3×3 的区域,提取其中的特征,但当它扫描完整幅图像后,它就有能力在图像中的任何位置检测全局范围内的某个特征。

那么深度学习和这种传统方法的区别是什么?对于图像数据的早期处理,我们确实可以用低级的特征检测器来检测图中的线条、边缘,那么,Sobel 边缘算子的作用能否被卷积学习到?

深度学习研究的一个分支是研究神经网络模型可解释性,其中最强大的是使用了优化的特征可视化。它的思路很简单,就是通过优化图像来尽可能强烈地激活 filter。这确实具有直观意义:如果优化后的图像完全被边缘填充,这其实就是 filter 本身正在寻找激活特征,并让自己被激活的强有力证据。

Here Insert Picture Description
GoogLeNet 第一个卷积层的 3 个不同通道的特征可视化,请注意,虽然它们检测到不同类型的边缘,但它们仍然是低级边缘检测器
Here Insert Picture Description
GoogLeNet 第二个、第三个卷积层的 12 个通道的特征可视化
这里要注意一点,卷积图像也是图像。卷积核是从图像左上角开始滑动的,相应的,它的输出仍将位于左上角。所以我们可以在这个卷积层上在做卷积,以提取更深层的特征可视化。

然而,无论我们的特征检测器如何深入,在没有任何进一步改变的情况下,它们仍将在非常小的图像块上运行。无论检测器有多深,它的大小就只有 3×3,它是不可能检测到完整的脸部的。这是感受野(Receptive field)的问题。

感受野

无论是什么 CNN 架构,它们的基本设计就是不断压缩图像的高和宽,同时增加通道数量,也就是深度。如前所述,这可以通过池化和 Stride 来实现。局部性影响的是临近层的输入输出观察区域,而感受野决定的则是整个网络原始输入的观察区域。

The idea behind the pace convolution is that we only slide a fixed distance intervals, and skip the middle of the grid.

Here Insert Picture Description
As shown above, the stride was adjusted to 2, the output of the convolution greatly reduced. At this time, if we do this in a nonlinear activation on the basis of output, then the above plus a layer of convolution, interesting things happened. Compared to normal convolution output, 3 × 3 convolution kernel larger receptive field on the pace of the convolution output.
Here Insert Picture Description

This is because it is larger than the original input region on the input region of the normal convolution, this will affect subsequent feature extraction.

This field allows the expansion feeling convolution lower layer characteristics (line, edges) into a higher-level characteristic (curve, texture), as we have seen in the above mixed3a layer. And as we add more Stride layer, the network will show a more advanced features, such as mixed4a, mixed5a.

Here Insert Picture Description
By detecting low-level feature, and use them to detect higher-level features, it forward in the visual hierarchy, eventually the entire visual concept can be detected, such as the face, birds, trees, and the like. This is the convolution on the image data so powerful and efficient for a reason.

in conclusion

Now, CNN has allowed developers from building a simple CV applications, it is used to provide power for complex technical products and services, it is both small photo gallery tool used to detect human faces, but also help physicians in clinical medicine close aides cancer screening. They may be a key to the future of computer vision, of course, some new breakthrough may also upcoming.

But in any case, one thing is certain: CNN is the core of many of today's innovative applications, and their effect is absolutely amazing, the technology itself also has to grasp, understand the necessity.

Guess you like

Origin blog.csdn.net/jayandchuxu/article/details/100590657