Deep Learning-conv Convolution

filter (convolution kernel)

The traditional image filter operators are as follows:

  • blur kernel: Reduce the difference between adjacent pixels and smooth the image.
  • sobel : Shows the difference between adjacent elements in a specific direction.
  • sharpen : Sharpen the difference between adjacent pixels to make the picture look more vivid.
  • outline : Also known as edge kernel, pixels with similar brightness of adjacent pixels are set to black, and those with large differences are set to white.

More can refer to image-kernels online demo of different convolutional filters.

CNN convolutional layers

What CNN does is not to decide the filter in advance, but to use the filter as a parameter to continuously adjust and learn to learn the appropriate filter. The first layer of the convolutional network usually performs edge detection, and the subsequent layers get more abstract features. An important feature of CNN convolutional layers is weight sharing.

Weight sharing : Different receptive fields share the same weight, so it is also called filter, which can greatly reduce the number of weights (memory occupied), which is usually effective because the filter filters a feature regardless of the specific spatial location . But coincidentally, face pictures are usually centralized, that is, people's heads are closer to the middle, so it can be seen that the location information is useful. For this situation, we can cancel the weight sharing mechanism, and call this layer as Locally-Connected Layer .

Sometimes the length and width of the filter are mentioned without the depth, and the depth is the entire depth of the input data (so a 1x1 convolution kernel is also meaningful). For example, the input of the previous layer is [16x16x20], the receptive field The size is 3x3, then each neuron in the convolutional layer will have 3x3x20 = 180 connections to the previous layer.
If the depth is specified, such as 96, then each neuron in the convolutional layer will have 3x3x96 connections to the previous layer Connection, 96 layers are connected to the same area, but the weights are not the same. (Note that sometimes the three layers of RGB three channels are used as a whole, so it is also multiplied by 3).

Small convolution kernel

Now the popular network structure design mostly follows the design principles of small convolution kernels, the advantages of small convolution kernels :
the accumulation of three 3x3 convolution kernels is equivalent to one 7x7 convolution kernel, but with fewer parameters and more computation Small, there are more nonlinear layer calculations. You can also further reduce the calculation by adding a 1x1 "bottleneck" convolution kernel (GoogLeNet and others use this method to deepen the level). If the input HxWxCgoes through the following steps, the output dimension remains unchanged:
\[ \require{AMScd} \begin{CD} H\times W\times C @>{\text{C/2 x Conv1x1}}>> H\times W\times C/2 \\ @. @V { \text{C/2 x Conv3x3}} VV \\ H\times W\times C @< \text{C x Conv1x1} << H\times W\times C/2 \end{CD} \]
However the above steps The 3x3 convolution kernel is still used in , which can be converted into a 1x3 and 3x1 connection.

Dilated convolutions . When the convolution kernel slides on the input, the step size can be made to have a gap between different receptive fields. This gap is used as a hyperparameter (dilation expansion). Its effect is equivalent to reducing some convolution layers and other layers, faster Get the spatial information of the input.

Convolution implementation

There are three main ways of computing convolution: conversion to matrix multiplication, winograd, and FFT.

In modern DL frameworks, matrix multiplication is usually used for convolution calculations, and the im2col operation is used to expand the input data and weights into a two-dimensional matrix (so that the image matrix and the convolution kernel can be directly multiplied, and the reverse operation of the conversion is col2im ), using the BLAS API for efficient calculations, but the disadvantage is that it takes up a lot of memory. This idea can also be used in pooling operations.

Reason why FFT is not widely used :
FFT has a significant speed advantage only when the convolution kernel is relatively large. However, the convolution kernel of CNN is generally less than 5, so FFT is generally not used in deep learning. The reason why FFT convolution is not widely used is because there is a more suitable Winograd convolution on a general-purpose platform, and it is a more suitable solution to directly reduce the operation precision on a dedicated platform. However, now more and more 1×1 convolution and depthwise convolution are added in CNN, and the value of Winograd convolution is getting smaller and smaller. 1

The following is a brief introduction to the fast Fourier transform.

convolution theorem 2 3

Fast Fourier transform is known as one of the most important algorithms in the 20th century, and one factor is the convolution theorem.
Fourier transform can be seen as a reorganization of data such as images or audio, which combines the time and space domains. A complex convolution corresponds to a simple product of elements in the frequency domain.
Convolution of two continuous functions over a one-dimensional continuous domain:

\[ h(x)=f\bigotimes g=\int_{-\infty}^\infty f(x-u)g(u)du=\mathcal F^{-1}(\sqrt{2\pi}\mathcal F[f]\mathcal F[g]) \]

It can be known from the convolution theorem that the result of the convolution of two matrices is equivalent to the Fourier transform of the two matrices ( \(\mathcal F\) ), element-level multiplication, and then inverse Fourier transform ( \ (\mathcal F^{-1}\) ). \(\sqrt{2\pi}\) is a normalizer.

Convolution on a 2D discrete domain (image):

\[ \begin{align} \text{feature map}=&\text{intput}\bigotimes\text{kernel} \\ =&\sum_{y=0}^M \sum_{x=0}^N \text{intput}(x-a,y-b)\cdot \text{kernel}(x,y) \\ =&\mathcal F^{-1}(\sqrt{2\pi}\mathcal F[\text{intput}]\mathcal F[\text{kernel}]) \end{align} \]

The Fast Fourier Transform is an algorithm that transforms data in the time and spatial domains into the frequency domain. The Fourier transform represents the original function as a sum of some sine and cosine waves. It must be noted that the Fourier transform generally involves complex numbers, that is, a real number is transformed into a complex number with real and imaginary parts. Usually the imaginary part is only useful in part of the domain, such as transforming the frequency domain back into the time and space domains.
Fourier transform diagram 4
fft

Orientation information can be seen from the Fourier transform:
FFT-orientation
Images by Fisher & Koryllos (1998) . Source

Convolution implementation in caffe

The schematic diagram of the convolution operation is as follows, the dimension of the input image is [c0, h0, w0]; the dimension of the convolution kernel is [c1, c0, hk, wk], where c0 is not shown in the figure, a convolution kernel It can be seen as consisting of c1 three-dimensional filters with dimensions [c0, hk, wk], and the dimensions of the output features are [c1, h1, w1]
conv
. Efficient calculation of two-dimensional matrix multiplication:
im2col
more detailed im2col diagram Show:
im2col-detail

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325172456&siteId=291194637