[Personal Notes | Convolution Gift Pack | Under Arrangement]

【Reading list】

Read all 20 types of convolutions in deep learning in one article (CVHub) : Introduction to principles, advantages and disadvantages (conceptual)

Summary of convolution (ordinary convolution, transposed convolution, dilated convolution, grouped convolution and depth-separable convolution)

[Classic convolution]

Read all 20 types of convolutions in deep learning in one article (with source code compilation and paper interpretation) - Zhihu (zhihu.com)    thumb.jpg

1. Vanilla Convolution

It consists of a set of convolution kernels with a fixed window size and learnable parameters, which can be used to extract features.

Sparse connectivity : as opposed to one-to-one dense connectivity

Shared weights : convolution kernel parameter sharing

Translation invariant : When the target in the image shifts, the network can still output consistent results for the same source image. For image classification tasks, we hope that CNNs have translation invariance (fragile, need to learn from multiple data)

Translation equivalence : When the input shifts, the output of the network should also shift accordingly. This feature is more suitable for tasks such as target detection and semantic segmentation.

Replenish:

Feedforward Neural Networks

A feedforward neural network is an artificial neural network that processes information in a unidirectional manner, from the input layer to the output layer, without any loops or loops in the network. It's called "feedforward" because information flows forward through the network, with each layer processing the output of the previous layer.

In a feedforward neural network, the input data first passes through the input layer , which usually contains a large number of neurons, each neuron corresponding to a feature of the input data . The output of the input layer then through one or more hidden layers, where each neuron in the hidden layer performs a weighted sum of the inputs and applies an activation function to the sum . The output of each hidden layer is then passed to the next layer until the output layer is reached . The output layer contains a set of neurons that produce the final output of the network.

The weights and biases of neurons in a feedforward neural network are learned during the training phase, typically using backpropagation, which adjusts the weights and biases based on the difference between the network's predicted and actual outputs. Once the network is trained, it can make predictions on new data by passing the data into the network and using the output produced by the output layer.

【Convolution Series】Plain Convolution (qq.com)

From the perspective of function mapping , the convolution process is like linearly transforming each position on the image and mapping it into a new value . So multi-layer convolution is just mapping layer by layer, and the whole is constructed into a complex function. The training process of the network is actually learning the weight required for each local mapping. The training results can be seen as the process of the network fitting the input distribution.

From the perspective of template matching , the convolution kernel can essentially be defined as a certain pattern. The convolution operation calculates the matching degree between the current position in the image and the template (convolution kernel). If the matching degree is higher, then The response (activation value) is stronger. Therefore, extracting more discriminative features is the key to convolution operations . However, traditional convolution operations cannot effectively suppress the interference of background noise. Therefore, we need to introduce methods similar to attention to enhance the network. performance.

2. Group convolution

Split the network to enable parallel computation on multiple GPUs

Reduce the number of parameters, improve training efficiency (distributed resources-parallel computing), and improve generalization performance (decoupling-regularization)

Disadvantages : In the original group convolution implementation, the features of different channels will be divided into different groups, and they will not be fused until the end of the network. The intermediate process obviously lacks information interaction (considering that different filters can extract different Features) In order to solve this problem, ShuffleNet [3] combines pointwise group convolution (PGC) and channel shuffle (channel shuffle) to achieve an efficient and lightweight mobile network design.

 3. Transposed Convolution

Feature upsampling, feature visualization,

 4、1x1 conv

Enhanced feature expression ability : 1×1 convolution is essentially a filter with parameters, which can increase the depth of the network without changing the size of the feature map itself . The expressive ability of the network can be effectively enhanced by using a nonlinear activation function after convolution . ( Modify the depth of the feature map by changing the number of channels, because the depth of the output feature map is determined by the number of channels of the convolution kernel. But 1x1 conv is not fully connected, 1x1 convolution is a convolution operation, usually used to modify features The number and depth of channels and helps the network learn better representations. Fully connected layers are a common layer type in neural networks that connect each input to each output, often used for classification and regression problems. )

Dimensionality enhancement and dimensionality reduction : 1×1 convolution can achieve dimensionality enhancement or dimensionality reduction by increasing or reducing the number of filters. Different from the fully connected layer, convolution is based on weight sharing, so it can effectively reduce the amount of parameters and calculation of the network . On the other hand, reducing the dimension can be thought of as reducing the sparsity of the weights in the middle layers of the model by reducing redundant feature maps , thereby obtaining a more compact network structure.

Cross-channel information interaction : Similar to a multi-layer perceptron, 1×1 convolution is essentially a linear combination of multiple feature maps . Therefore, cross-channel information interaction and integration can be easily achieved through 1×1 convolution operation.

 5. Dilated Conv

The hole rate is introduced on the basis of the original convolution to adjust the number of intervals of the convolution kernel.

The function of atrous convolution is to increase the receptive field of the network without performing downsampling operations, that is, without losing any spatial information, so that the network can capture contextual information in different ranges . This is beneficial for intensive prediction tasks such as semantic segmentation. At the same time, we can also easily extract multi-scale contextual information by applying atrous convolution.

Motivation: It was first applied to semantic segmentation tasks (an intensive prediction task that requires pixel-wise classification of images). A major difficulty in semantic segmentation tasks is how to efficiently extract multi-scale contextual information to adapt to changes in different object scales . Usually, CNN-based methods are mostly based on a hierarchical feature representation, and extract features at different levels by continuously stacking convolutional layers and pooling layers. However, there are still the following problems: ① As the depth of the network increases, the amount of calculation and parameters will increase sharply ② Too many times of downsampling will cause objects with smaller resolutions to be unable to be reconstructed ③ There are no parameters in the upsampling or downsampling layer Carry out learning ④The internal data structure and spatial hierarchical information are seriously lost. Therefore, dilated convolution is proposed, which can effectively extract context information of different scales without reducing spatial resolution (expanding the receptive field of the network with the same convolution kernel) .

Increase the receptive field : Atrous convolution can obtain a larger receptive field under the same convolution kernel parameters. Therefore, for tasks that require more global semantic information or require longer sequence information dependencies similar to speech text , you can try to apply atrous convolution. ( Receptive field : Neuron represents the size of the perception range of the input image. Usually, the features extracted using convolution and pooling operations are a local connection, which lacks global correlation. Therefore, in order to capture the global context Information can be expanded by stacking deeper convolution and pooling structures to expand the receptive field of the network . The larger the receptive field, the more comprehensive the extracted features will be. On the contrary, the smaller the receptive field, the more sensitive the network will be to local features. Details will be more sensitive.)

Representing multi-scale information : Using convolutions with different hole rates, multi-scale contextual semantic information can also be captured. Different hole rates represent different receptive fields, which means that the network can perceive targets of different sizes.

Disadvantages :

It is not easy to optimize (increasing the receptive field when the parameters remain unchanged, but due to the increase in spatial resolution , it is often difficult to optimize in practice. Speed ​​is a criticism, so there are real-time requirements in industry. More applications are still based on FCN-like structures),

Grid/checkerboard effect: When using convolutions with the same hole rate multiple times to extract features, the continuity of information will be lost . This is because the convolution kernel is not continuous, resulting in many pixels not participating in the operation from beginning to end, which is equivalent to failure (blank space is superimposed in the picture below). This is very difficult for intensive prediction tasks such as semantic segmentation. Very unfriendly , especially for small targets ( from the working principle, dilated convolution is used to enhance the network's ability to capture long-distance features, so it is only useful for segmenting some larger objects, but for some For small objects, the effect is often counterproductive. For example, for crack segmentation, the crack usually takes on a slender strip shape. If we increase the void rate at this time, it will undoubtedly seriously destroy the inherent spatial consistency of the feature itself. )

One solution is to ensure that the hole rate of the superimposed convolution cannot have a common divisor greater than 1, such as [1, 2, 5], making it appear a sawtooth structure.

 6. Depthwise Separable Convolution (DSC )

Used to reduce the amount of network parameters and calculations to improve network operating efficiency

DSC first applies a single depth convolution filter to each input channel , and then uses point-wise convolution to combine the outputs of different depth convolutions. Therefore, DSC breaks the ordinary convolution operation into two processes: Depth Convolution (Depthwise Convolution) and pointwise convolution (Pointwise Convolution) processes. Depth convolution is used for filtering , and point-wise convolution is used for combination . This decomposition method can greatly reduce the amount of network parameters and calculations.

Depth convolution is a convolution kernel responsible for one channel, and performs spatial convolution on each channel independently. Therefore, the number of output feature maps of deep convolution is equal to the number of input feature maps, and effective dimension expansion cannot be performed.

Since a feature map is only convolved by one filter, the feature information of different channels at the same spatial position cannot be effectively utilized, so point-wise convolution is added. Point convolution is mainly composed of 1×1 convolution, which is responsible for projecting the output of depth convolution onto a new feature map by channel.

Filter each channel first, then combine

Reduce the amount of parameters and calculations: Depthwise separable convolution divides the original convolution operation into two layers, one for filtering (depth convolution) and one for combination (point-wise convolution). This decomposition process can greatly reduce the number of parameters and calculations in the model .

Reduce model capacity: Depthwise separable convolution is applied without using an activation function . In addition, although depthwise separable convolution can significantly reduce the calculation amount of the model, it will also lead to a significant reduction in the capacity of the model, resulting in a decrease in model accuracy .

DSC is far superior to ordinary convolution in terms of parameter quantity and computational efficiency. As a substitute for ordinary convolution, DSC's biggest advantage is that its calculation efficiency is very high . Therefore, using DSC to build lightweight models is a very common practice nowadays. However, this high efficiency of DSC comes at the expense of accuracy . In the future process of using DSC to build lightweight models, it is necessary to balance accuracy and computing efficiency in an all-round way. While sacrificing a small amount of accuracy, greatly improving computing efficiency is the right choice .

 7. Spatially Separable Convolution

Convolve from image space dimensions (width and height)

Spatially separable convolution is rarely widely used in practice. One of the main reasons is that not all convolution kernels can be effectively split into small convolution kernels (to reduce the amount of calculation). However, it is rarely widely used in practice . One of the main reasons is that not all convolution kernels can be effectively split into small convolution kernels .

8. Deformable convolution

It can be seen as a self-attention operation on the local area (it is not an expansion of the convolution kernel, but a pixel reintegration of the feature map before convolution, and the expansion of the convolution kernel in disguise)

The unknown geometric transformation of the same object in different scenes and angles is a major challenge of the task ( conventional convolution encodes feature information through a fixed geometric structure to capture the receptive field. However, since the shape of the object is changeable, Irregular, this encoding method is difficult to capture the appropriate receptive field, resulting in limited network expression ability ). Generally speaking, either through sufficient data enhancement, enough samples are expanded to enhance the model's ability to adapt to scale transformation, or Set some features or algorithms that are invariant to geometric transformations, such as SIFT or sliding windows. However, the fixed geometric structure of traditional CNNs cannot effectively model unknown object deformations, so deformable convolution is proposed to solve this problem.

No increase in the number of sampling points: Capture object semantics more effectively and comprehensively without increasing the amount of additional parameters

Deformable convolution does not really learn a deformable convolution kernel, but uses additional convolution layers to learn the corresponding offset, and superimpose the obtained offset to the pixels at the corresponding positions in the input feature map. Hit . However, since the generation of the offset will produce a floating point number type, and the offset must be converted into an integer, backpropagation cannot be performed if it is rounded directly. Therefore, the original article uses bilinear interpolation to calculate indirectly . the corresponding pixel value.

Calculation process x4:

(1) Calculate relative offset : Learn the offset of each pixel by performing a convolution operation on the input feature map . 2N in the figure represents the doubled number of channels, that is, each pixel has an offset in the x and y directions.

(2) Obtain the absolute offset : Add the relative offset calculated in the first step to the original image pixel index value to obtain the absolute offset of the input feature image pixel index.

(3) Pixel reintegration : First round the absolute offset of the pixel index up and down, and then further integrate it into four pairs of coordinates. Through bilinear interpolation, the offset index pixel value is obtained.

(4) Encoding new pixels : Complete the calculation of the pixel value of the index offset, that is, a new feature map is obtained, and conventional convolution is performed on it to achieve deformable convolution.

This method uses a clever idea to implement a deformable geometric structure feature extraction method: it is based on a parallel network to learn offsets, allowing the convolution kernel to divergently sample the input feature map, allowing the network to focus on the center of the target, thereby improving the accuracy of the object. Deformation modeling capabilities. Deformable convolution does not expand the convolution kernel, but reintegrates the pixels of the feature image before convolution, achieving the effect of convolution kernel expansion in disguise . ()

It can improve the generalization ability of model feature extraction to a certain extent, but it will also introduce some irrelevant background noise interference. Three corresponding solutions: use a larger number of deformable convolutions, add corresponding weights to each bias, and imitate the features in R-CNN. However, the computational efficiency of deformable convolution is also a questionable issue ( although it can bring high accuracy, when the convolution kernel is too large, it will occupy a very large memory space, so its application in landing deployment is very limited. . But it can be regarded as a trick to improve points when participating in competitions. )

 9. Graph Convolution

Paper: https://arxiv.org/abs/1609.02907

Code: https://github.com/tkipf/gcn

( others reading notes , CVHub code )

Graph convolutional network is the simplest branch of graph networks. It is designed to effectively solve data problems in non-Euclidean spaces that traditional CNNs, RNNs and other networks cannot handle ( in graph convolutional networks, the similarity or distance between nodes It does not just depend on their distance in physical space, but is determined by the feature vectors of nodes and their connection relationships on the graph. Therefore, graph convolution can capture the abstract correlation between nodes, not just the physical distance. )

Principle: Use edge information to aggregate node information to generate a new node representation . In short, the convolution operation in CNNs is the weighted summation of the corresponding positions of the convolution kernel. Extended to GCNs, it uses the edge information to continuously aggregate the information of adjacent nodes to update the parameters of the original node.

Features: Node features (each node can be used for feature representation), structural features (nodes are related through edges that carry information)

Overview of Graph Neural Network GNN

Getting Started with GCN Graph Convolutional Networks

Figure ML: Using Transformer to alleviate GNN limitations

 10. Inception Block

Multi-scale contextual information is mainly captured through multiple convolution operations with different convolution kernel sizes, and finally the output is aggregated through splicing operations to obtain multi-scale feature representation. The original purpose was to obtain feature representations at different scales on the same layer of feature maps. Increasing the network width is conducive to obtaining richer feature representations. ( Perform multiple convolution operations or pooling operations in parallel on the input image, and splice all the output results into a very deep feature map. Because different convolution operations such as 1*1, 3*3, 5*5, etc. are different from Pooling operations can obtain different information of the input image, processing these operations in parallel and combining all results will obtain better image representation )

Full analysis from Inception v1 to Inception v4

Inception v1 : ① 1x1 convolution kernel compresses the number of channels, reduces parameters and calculation amount, and improves the effect; ②. Use convolution kernels of different sizes for parallel processing and let the network choose the best feature map by itself. The advantage of this is that it can improve both speed and accuracy. ③In addition, Inception v1 also introduces the maximum pooling layer and the average pooling layer to enrich the feature extraction method.

The main idea of ​​Inception v1 is to use convolution kernels of different sizes for convolution ( combining different convolution kernels can not only increase the receptive field, but also improve the robustness of the neural network. ), and then combine these The results of the convolution are spliced ​​together in the channel dimension as output to increase the expressive power of the network. In this process, 1x1 convolution is used to control the number of channels, reduce the calculation amount of the network, and also help enhance the nonlinear expression ability of features.

Specifically, Inception v1 uses multiple different convolution kernel sizes (such as 1x1, 3x3, 5x5) for convolution to extract features of different scales. In addition, Inception v1 also introduces a pooling layer (such as Max Pooling or Average Pooling) to further reduce the size of the feature map. Ultimately, Inception v1 obtains a more powerful feature representation by splicing these features of different scales together.

Added 1x1conv, channel :

The 1×1 convolution branch refers to using a 1×1 size convolution kernel for the convolution operation, where 1 indicates that the convolution kernel only covers one pixel in the spatial dimension. The 1×1 convolution branch is mainly used to increase or decrease the number of channels to reduce the amount of calculation and model complexity. In Inception Block, the 1×1 convolution branch is usually used to reduce the number of channels of the input feature map so that the subsequent convolution branch and pooling branch require less calculation. In deeper networks such as Inception-ResNet, 1×1 convolution branches are also used to perform dimensionality operations to increase the nonlinear representation capabilities of the network. In practice, the 1×1 convolution branch is usually placed at the beginning or end of the Inception Block to take full advantage of it.

Regarding channel compression : 1x1 convolution is usually used to compress the number of channels. Specifically, it uses a convolution layer with only 1 convolution kernel size in a certain layer of the convolutional neural network, where the depth of the convolution kernel is the number of channels that need to be compressed, that is, the input feature map is in the channel Perform dimensionality reduction operations.

The content of the channel is learned by the network and usually includes the extraction of different features of the input data. In a convolutional neural network, the channels output by each convolutional layer correspond to different features, such as edges, textures, colors, etc. When performing channel compression, it is usually determined by calculating the importance of each channel to determine which channels need to be retained and which can be discarded. This calculation method can be determined according to different application scenarios. Commonly used methods include using dimensionality reduction methods such as PCA, or using gradient-based methods, such as calculating the gradient size of each channel during network training to determine its importance.

In general, the purpose of compressing the channel is to reduce the amount of calculation and the number of model parameters while retaining the characteristics of the input data as much as possible. The specific content and mechanism of compression will be adjusted according to specific application scenarios and algorithms, but the general principle is to retain the most representative features of the input data .

For each layer in the convolutional neural network, the input data generally includes three dimensions: height, width, and channel. The number of channels can be understood as the number of features or feature maps. Each channel contains the response value of the input data under certain characteristics. For example, for a color picture, its channels can include color channels (red, green, blue) and texture channels, etc. The value in each channel Represents the response degree of the image under this feature. In a convolutional neural network, the purpose of the convolution operation is to extract different features of the input data without changing the size. These features are usually nonlinear, and their number can vary in different layers. Therefore, by adjusting the number of channels in the convolutional layer, the degree to which the convolutional neural network extracts different features of the input data can be controlled, thereby affecting the performance of the network.

Supplementary batch and mini-batch

When we train a deep neural network, we usually need to iteratively train a large amount of training data. These training data may not be loaded into memory all at once, so they need to be divided into multiple batches for processing. The batch size determines the number of samples used in each iteration of training. Generally, choosing an appropriate batch size can improve the speed and stability of training. In addition, batch can also be used for GPU parallel computing, sending multiple samples to the GPU for calculation at the same time to improve training efficiency.
In fact, when training a deep neural network, we usually divide the data set into multiple batches for training, and mini-batch is the amount of small sample data contained in a batch. We divide a batch into multiple mini-batches in order to send the data set into the network in batches for training. Only the mini-batch data needs to be processed each time, which can reduce memory usage and accelerate training. At the same time, by continuously traversing the entire data set, the network can be more fully trained, thereby improving the performance of the model. Therefore, mini-batch is a part of batch, which can be said to be a more fine-grained division.

Inception v2 : Batch Normalization is proposed, which effectively accelerates the training of deep networks by reducing internal covariate offsets. In addition, drawing on the ideas of VGG-Net [19], v2 replaces the 5×5 convolution in v1 with two 3×3 convolutions, further reducing the number of network parameters while ensuring the same receptive field. and calculation amount.

Compared with Inception v1, Inception v2 has made some improvements and optimizations while maintaining some core ideas. First of all, Inception v2 introduces Batch Normalization, which makes the network training more stable and also helps to improve the accuracy . Secondly, Inception v2 uses a technology called "Factorizing Convolutions", which replaces one larger convolution kernel with two smaller convolution kernels. For example, use two 3x3 convolution kernels instead of one 5x5 convolution kernel. The advantage of this is that it can reduce the amount of parameters and calculations, while also ensuring that the receptive field remains unchanged and improving the accuracy of the network. In addition, Inception v2 also introduces some new technologies , such as "Label Smoothing" and "Weight Decay", to further improve the accuracy and generalization ability of the network.

Additional regularization:

Regularization refers to limiting the complexity of the model by adding additional constraints during the model training process, thereby preventing the model from overfitting the training data and improving the generalization ability of the model. Regularization techniques are widely used in the fields of machine learning and deep learning. In deep learning, commonly used regularization techniques include L1 regularization, L2 regularization, Dropout, Batch Normalization, etc. Among them, L1 regularization and L2 regularization add a regularization term to the loss function, so that the model parameters continue to tend to 0, thereby reducing the model complexity; Dropout reduces overfitting by randomly setting a part of neurons to 0. ; Batch Normalization is to standardize the input of each batch to reduce the internal covariate offset of the network and improve the efficiency and generalization ability of training. The choice of regularization technique should be tailored to the specific problem and data set to improve the model's generalization ability and accuracy.

These concepts can easily be confused. Regularization and methods to reduce gradient explosion/disappearance are commonly used techniques in deep learning, and their purpose is to improve the generalization performance of the model.

Methods to reduce gradient explosion/disappearance are mainly aimed at the problem that gradients in deep neural networks become too small or too large as the number of layers increases. The gradients are controlled through some means, such as weight initialization, use of activation functions, batch normalization , etc. The size of the gradient prevents the gradient from being too small or too large , so that the network can be trained better.

Regularization is a method of adding some penalty terms to the loss function. The purpose is to make the parameters of the model smoother or closer to 0, thereby reducing the complexity of the model, avoiding overfitting , and improving generalization performance. Commonly used regularization methods include L1 regularization, L2 regularization, dropout, etc.

Inceptionv3 : Mainly draws on the idea of ​​spatially separable convolution, splitting the original k×k convolution kernel into 1×k and k×1 one-dimensional convolutions. On the one hand, it can effectively speed up the network operation, and on the other hand, it can effectively speed up the network operation. The excess computing resources can be used to increase the depth of the network and improve the ability of nonlinear mapping.

Depthwise separable convolution and spatially separable convolution are two different concepts. Depth separable convolution decomposes an ordinary convolution kernel into a 1x1 convolution kernel in the depth direction and a kxk convolution kernel in the spatial direction, and then first uses the 1x1 convolution kernel to perform depth convolution, and then uses kxk convolution The kernel performs spatial convolution, thereby reducing the amount of calculation to a certain extent and also increasing the depth of the network. The spatially separable convolution is to split a kxk convolution kernel into k 1xk and k kx1 convolution kernels, and then first perform 1xk and kx1 convolutions in space, and then perform 1x1 convolution in depth. , thereby reducing the amount of calculation to a certain extent.

When we use 1x1 convolution in the depth direction, we can control the depth of the output feature map by adjusting the number of convolution kernels. For example, suppose we have an input feature map of size 28x28x192 and we want to reduce its depth to 64, then we can use 64 convolution kernels of size 1x1x192 to convolve the input feature map. This produces an output feature map of size 28x28x64, where the value of each pixel is obtained by taking a weighted sum of the 192 values ​​at each position of the input feature map.

Inception v4 : Drawing on the ideas of ResNet, it introduces Skip Connection, which not only greatly accelerates network training, but also significantly improves network performance.

Based on Inception v3, the skip connection technology in ResNet is further introduced and the Inception-ResNet structure is proposed. This structure can not only accelerate the training of the network, but also effectively alleviate the problems of gradient disappearance and gradient explosion, and improve the training effect and generalization ability of the network. Inception v4 also uses some other technologies, such as sparse matrix-based filtering, dynamic convolution, etc., to further improve the performance and efficiency of the network.

The skip connection technology in ResNet can alleviate the gradient disappearance and gradient explosion problems, mainly because it can make the back propagation of the network smoother and more stable . In deep neural networks, since the gradient will be reduced or amplified by continuous matrix multiplication operations during the backpropagation process, the problem of gradient disappearance or gradient explosion is prone to occur in the deep part of the network, making training difficult. The skip connection technology in ResNet introduces cross-layer connections in the network to directly transfer information between different layers, so that the gradient will not be reduced or amplified too much by continuous matrix multiplication operations during backpropagation, thus avoiding the disappearance of the gradient . Or the problem of gradient explosion, it can also make the network easier to train and converge.

[Convolution variant]

Will update after using it

1. Asymmetric Convolution (Asymmetric Convolution)


2. Octave Convolution


3. Heterogeneous Convolution


4. Conditionally Parameterized Convolutions


5. Dynamic Convolution (Dynamic Convolution)

Dynamic convolutionCondConv 


6. Ghost Convolution


7. Self-Calibrated Convolution (Self-Calibrated Convolution)


8. Depthwise Over-parameterized Convolution (Depthwise Over-parameterized Convolution)


9. Separated attention module (ResNeSt Block)


10. Involution

Guess you like

Origin blog.csdn.net/sinat_40759442/article/details/129284792