[Neural Network]----Common Terms and Concepts of Neural Network (continuously updated)

receptive field

The receptive field refers to the area size of the input signal that a neuron can perceive in a neural network. The size of the receptive field determines the scope of the network's understanding of the input image, thus affecting the network's understanding and representation of the image.

For a convolutional neural network, the receptive field size of each pixel can be calculated by calculating the convolution kernel size and step size of each layer, as well as the size and step size of the pooling layer. Since the receptive field increases with the depth of the network, it is usually only necessary to calculate the receptive field of the first few layers.

Receptive field graphic analysis is to analyze the receptive field range of each layer of neurons and visualize it as a graph. In general, neurons with larger receptive fields can understand more global information, while neurons with smaller receptive fields pay more attention to local information. Therefore, when designing the network structure, it is necessary to comprehensively consider the size of the receptive field at different levels, so that the network can better understand and classify images.

Anchor box

Anchor box (anchor box) is an auxiliary tool for target detection. Its function is to generate multiple rectangular boxes with fixed size and aspect ratio in the image to match the target object.

In object detection, anchor boxes usually correspond to certain locations on a sliding window or feature map in order to detect objects at each location. Anchor boxes are usually defined as a set of base boxes that can be obtained by zooming and translating on images of different positions and sizes. Objects of different sizes and shapes can be efficiently adapted by performing detection on anchor boxes of different sizes and aspect ratios. In object detection, anchor boxes are usually used to define the location and size of objects, and are used to calculate the matching degree between target objects and anchor boxes during training.

upsampling

Upsampling is to enlarge the size of the feature map, usually through interpolation.

Upsampling operations are often combined with convolution operations to restore low-resolution feature maps to the original image size. The upsampling operation can also increase the receptive field, thereby improving the network's ability to recognize objects.

In object detection, upsampling is often used to restore low-resolution feature maps to the original image size for object localization and detection.

downsampling

Downsampling, often referred to as pooling, combines a region of the input feature map into a single value.

Commonly used pooling operations include max pooling and average pooling. The pooling operation can reduce the size of the feature map, thereby reducing the amount of computation while extracting features.

In object detection, downsampling is often used to extract features from raw images for object localization and detection.

convolutional layer

Convolutional Layer (Convolutional Layer) is the core component of convolutional neural network. It uses a set of learnable filters (also known as convolution kernels) to scan the input data and compute a dot product at each location. Convolutional layers are often used to extract features from input data.

pooling layer

Pooling Layer (Pooling Layer) is a downsampling operation used to reduce the dimensionality of data. It reduces the size of data by computing statistics such as maximum or average values ​​over local regions of the data. Pooling layers are often used to reduce the dimensionality of the output of convolutional layers and enhance the ability to recognize translation invariance.

max pooling

Maximum pooling is to divide the input feature map into several regions, and then take the maximum value in each region as the output.
First, the input feature map is divided into multiple rectangular areas of equal size, and the maximum value of the pixel value in each rectangular area is taken as the output of the area, and finally the outputs of all areas are concatenated as the output of the pooling layer. The effect of this is to reduce the spatial size of the feature map, and at the same time extract the strongest features, because only one maximum value is selected in each rectangular area, and the maximum value usually appears at the position where the local features are obvious .

Max pooling is often used in convolutional neural networks to reduce the spatial resolution of feature maps and help the network extract features that are spatially invariant.

average pooling

Average Pooling is a pooling operation that reduces the spatial size of input tensors. Its operation is to average each subregion of the input tensor to produce a new tensor. This process can be seen as a downsampling operation on the input tensor, reducing the spatial resolution while preserving the number of channels of the input tensor.

Average pooling is usually used in convolutional neural networks to perform downsampling operations during feature extraction to reduce the size of feature maps. A common average pooling operation is a 2x2 pooling window with a stride of 2, which can halve the size of the feature map while preserving the number of channels of the input feature map. The main advantage is that it does not introduce too many parameters, because it has no trainable weights, just a fixed operator. It can also prevent overfitting to a certain extent, because it can downsample the input feature map, thereby reducing redundant information in the feature map.

global average pooling

Global average pooling is often used in the last layer of the convolutional neural network to convert the feature map of the last layer into a fixed-length vector for classification and other tasks.
The operation of global average pooling is very simple, which is to perform average pooling on the entire feature map. Specifically, suppose the size of the feature map of the last layer is H × W × CH \times W \times CH×W×C , whereHHH means altitude,WWW means width,CCC represents the number of channels, then the output of the global average pooling is aCCC- dimensional vector,iiThe value of the i dimension is equal to the iiof all positions in the feature mapThe average of the values ​​of the i channels.

Global average pooling can be seen as a parameterless feature compression operation, so it can effectively reduce the dimension of the feature map without introducing additional parameters, so it is very suitable for the last layer of the convolutional neural network.

min pooling

Min pooling is similar to max pooling and average pooling, and it also performs area sampling and downsampling operations on the input feature map. In min pooling, for each pooling area, the minimum value of all elements in the area is taken as the output value of the area.

The difference between min pooling and max pooling and average pooling is that min pooling keeps the smallest feature value in the input area and ignores other feature values. Therefore, min pooling may be more effective than max pooling and average pooling in some specific application scenarios. For example, when detecting local features such as edges and corners in an image, using minimum pooling can enhance these local features and reduce the influence of noise.

activation layer

Activation Layer is a non-linear operation used to increase the expressive power of neural networks. It passes the input value to a non-linear function (such as Sigmoid or ReLU), and generates a new output value. Activation layers are usually used between convolutional and fully connected layers to introduce nonlinear transformations.

In a convolutional neural network, multiple convolutional, pooling, and activation layers are usually connected together to build a deep neural network.

fully connected layer

A fully connected layer (Fully Connected Layer) is a common layer type in a neural network, and is usually also called a dense layer (Dense Layer). Its function is to connect all neurons in the previous layer to each neuron in the current layer, which can be regarded as a full connection between multiple neurons.

In the fully connected layer, each neuron will perform a weighted summation of all neurons in the previous layer, and then perform nonlinear transformation through the activation function, and output to the next layer of network.

The fully connected layer is often used in the last layer of the neural network to classify or regress the features of the previous layer. However, in deeper networks, fully connected layers may also appear in some of the middle layers.

feature pyramid

The feature pyramid is a structure for multi-scale object detection, which aims to improve the detection performance of object detection algorithms for objects of different sizes. The feature pyramid structure usually consists of feature maps of multiple scales, which are extracted in different network layers, so as to be able to detect objects at different scales.

In object detection, small objects often require higher-resolution feature maps to be detected accurately, while large objects can be detected using low-resolution feature maps. The feature pyramid structure can extract feature maps with different scales by performing up- and down-sampling operations between feature maps at different levels, and these feature maps can be effectively utilized in the object detection process.

There are two common ways to realize the feature pyramid structure, which are bottom-up and top-down.

bottom up

The bottom-up feature pyramid structure obtains features of different scales by performing convolution and pooling operations on feature maps of different levels, and then fuses these feature maps to obtain more comprehensive information. This method performs better when dealing with smaller objects, but suffers from information loss when dealing with larger objects.

top down

The top-down feature pyramid structure performs an upsampling operation on the feature map of the highest layer, then fuses it with the feature map of the lower layer, and passes the fused feature map down layer by layer to obtain features of different scales. This approach handles larger objects better, but is computationally expensive.

SPP

SPP is spatial pyramid pooling, and its function is to achieve an output of an adaptive size.

It uses pooling windows of different sizes to perform pooling operations on input images at different scales, and finally stitches these pooling results together to form a fixed-size feature vector. Specifically, it evenly divides feature maps of different sizes into several grids of different sizes, performs a pooling operation on the feature maps in each grid, and then connects all pooled feature vectors as the network enter. In this way, fixed-length feature vectors can be obtained for network training and inference even if the input image size is different. The SPP module is widely used in target detection, image classification and other tasks, such as Fast R-CNN, Faster R-CNN, YoloV3, etc.

SPPF

The SPPF module is a further improved module based on the SPP module, which is used to solve the problem that the output feature map size is not fixed in the SPP module. The SPPF module controls the size of the output feature map by adding a pooling layer, while ensuring that the spatial information of the input feature map is not lost.

Different from the SPP module, the pooling layer in the SPPF module not only performs maximum pooling, but also performs average pooling, which can better capture the statistics of features. After the pooling layer, the SPPF module also adds a convolutional layer for feature extraction.

The size of the output feature map of the SPPF module is fixed, which makes it perform better in some tasks that have strict requirements on the size of the output feature map. At the same time, the SPPF module also retains the advantages of the SPP module, which can perform feature extraction on feature maps of different scales, improving the network's receptive field and feature expression ability.

DarkNet53

DarkNet53 is a convolutional neural network architecture that is the backbone of the YOLOv3 network. It consists of 53 convolutional layers and several Residual Blocks, and is a relatively deep neural network.

DarkNet53 adopts a brand-new residual structure, and its design can reduce the disappearance of gradients, so that deeper networks can be trained. In addition, DarkNet53 also adopts the method of widening the convolution kernel to increase the number of channels of the feature map, which can improve the expressive ability of the network. DarkNet53 plays an important role in YOLOv3, which is responsible for extracting the features of the input image for subsequent detection tasks. Since DarkNet53 has strong feature extraction capabilities, it can also be used for other computer vision tasks, such as image classification, object recognition, etc.

CSPDarknet

CSPDarknet is one of the Darknet series networks and one of the backbone networks of YoloV5. The full name of CSPDarknet is Cross Stage Partial Darknet, which is an improved network based on Darknet53.

The improvement of CSPDarknet mainly lies in two aspects: cross-stage connection and local connection. Cross-stage connection refers to that in each stage of Darknet53, the input is divided into two parts, and a cross-stage connection is added to one of the parts in order to enhance the information flow of the network. Local connection refers to dividing the input and output into multiple parts, and then performing a convolution operation on each part, which can reduce the computational load of the network and improve the computational efficiency of the network.

Compared with Darknet53, CSPDarknet ensures the accuracy and speed of the network while reducing the number of parameters.

Residual Block

Residual Block (residual block) is a network structure commonly used in deep learning for building deep neural networks.

The main idea of ​​Residual Block is to insert "shortcut connections" (cross-layer connections) in the network, and directly add the output of the previous layer in the network to the input of the current layer. The advantage of this is that it can solve the gradient disappearance problem of the deep neural network, making the training more stable and faster. In addition, Residual Block can also improve the fitting ability and generalization ability of the network.
The basic structure of Residual Block is:

enter xxx goes through a convolutional layerffget outputyy after fand
yyPerform another convolution on y to get another output zzz
z z z andxxAdd x to get the final outputy + z y+zy+z

insert image description here

where, ⊕ \oplus represents element-wise addition,fff is a convolutional layer,WWW andbbb are the weights and biases of the convolutional layers, respectively. σ \sigmaσ is the activation function, which can be ReLU, LeakyReLU, etc.

During the training process, the gradient is obtained through the backpropagation algorithm, and the parameters are updated using the gradient descent algorithm. In this way, deep neural networks can be trained to improve the performance and generalization ability of the model.

Shortcut connections

Shortcut connections are a key feature of Residual Block, which enables Residual Block to train deep neural networks more efficiently. In traditional convolutional neural networks, information is gradually lost after multiple layers of convolution, making it difficult to train deep networks. Residual Block directly connects the input to the output by adding shortcut connections, so that the model can better transfer information when learning the residual, speed up the training speed, and improve the accuracy and generalization ability of the model.

Shortcut connections are usually implemented by adding a cross-layer connection in the Residual Block, specifically, adding the input x and the output f(x), and then sending it to the next layer of the network for processing. Shortcut connections can be used in various deep neural networks, including ResNet, DenseNet, etc.

ResNet

The main purpose of ResNet is to solve the problem of gradient disappearance and gradient explosion as the depth of the neural network increases. In traditional neural networks, as the number of network layers increases, the performance of the network tends to degrade, that is, the training error does not decrease but increases. ResNet proposes a method of residual learning, which enables the network to be deeper and wider while achieving higher accuracy.

The core idea of ​​ResNet is residual learning, which introduces a skip connection into the network. The traditional network structure is progressive layer by layer, and information must pass through each layer to reach the output layer. With the introduction of skip connections, information can be passed directly between certain layers, thereby avoiding the loss of information and the disappearance of gradients.

In ResNet, a residual block consists of two convolutional layers and a skip connection. Specifically, the input of a residual block first passes through a convolutional layer, then through a ReLU activation function, and then through another convolutional layer. The output between the two convolutional layers is added to the skip connection between the inputs, and finally output through a ReLU activation function.

Another important component of ResNet is the residual module. A residual module consists of multiple residual blocks. In ResNet, there are four residual modules, including four different structures of 18, 34, 50 and 101 layers. These residual modules achieve different complexity and performance through different depths and widths.

The difference and connection between ResNet and DarkNe

  • Same point:

     都使用残差块,使得网络能够更好地学习深层次的特征。
    
     都采用了卷积层、BN层和激活函数ReLU。
    
  • difference:

     DarkNet53相对于ResNet采用更小的卷积核,即3x3的卷积核。
    
     DarkNet53采用了多个不同大小的卷积核,而ResNet使用的是一种更加传统的卷积核设计。
    
     DarkNet53在卷积之前进行了一次1x1的卷积,目的是将输入通道数减小,从而减少计算量。
    
     DarkNet53相对于ResNet采用了更少的层数,但是仍然能够达到较好的效果。
    

Both DarkNet53 and ResNet are commonly used backbone networks. They have their own advantages and disadvantages. Which one to choose as the backbone network depends on specific tasks and requirements.

DarkNet53 is the backbone network used by YOLOv3. Compared with ResNet, it has less calculation and fewer parameters, so it can provide faster training speed and inference speed. Moreover, the accuracy of DarkNet53 in detection tasks is also relatively good.

ResNet is the champion network in the ImageNet competition. Its residual structure can effectively solve the problem of gradient disappearance, and can train very deep neural networks, thereby improving the expressiveness and performance of the network. It performs well in tasks such as image classification, object detection, and semantic segmentation, especially in very deep networks.

conv

"conv" is short for convolution operation, which is usually used in deep learning to extract features or implement image processing tasks. Convolution is a mathematical operation that operates on an input image or feature map by sliding a convolution kernel (also known as a filter or filter), resulting in a new feature map or image. Convolution can effectively extract local features of images, so it is widely used in image processing and computer vision. In deep learning, convolution is usually combined with other operations (such as activation function, batch normalization, etc.) to form the various layers of the neural network.

conv2d

conv2d is one of the commonly used convolution operations in deep learning. It refers to the convolution operation on two-dimensional images, where 2d means two-dimensional.

In the convolutional neural network, the conv2d layer is used to perform convolution operations on the input data, and perform feature extraction through different convolution kernels. The convolution operation can learn some spatial local features, and by continuously stacking multiple convolution layers, more abstract and advanced feature representations can be obtained, which can then be used for tasks such as classification and target detection.

MLP

MLP is the abbreviation of Multilayer Perceptron (Multilayer Perceptron), which is a common feedforward neural network model, which is composed of multiple neurons connected according to a certain hierarchical structure. Each neuron is connected to all neurons in the previous layer. The number of neurons in each layer can be different, and the full connection method is generally used. MLPs are commonly used for classification and regression tasks. In classification tasks, MLPs map inputs to different categories, while in regression tasks, MLPs map inputs to a numerical value.

MSA

MSA is the abbreviation of Multi-Head Self-Attention, that is, the multi-head self-attention mechanism. It is a key technology in natural language processing and computer vision, which can capture key information in a sequence by interacting with different positions in the input sequence. MSA was first introduced in the Transformer model. By mapping the input sequence to multiple heads, each head independently calculates the attention weight, thereby improving the expressiveness and generalization ability of the model. MSA has been widely used in text classification, machine translation, image classification, object detection and other tasks.

CBL

CBL is the abbreviation of Convolutions with Batch Normalization and Leaky ReLU, which is a basic module commonly used in convolutional neural networks.

The CBL module usually consists of three parts: convolutional layer, Batch Normalization (BN for short) and LeakyReLU activation function. Among them, the convolutional layer is responsible for feature extraction, the BN layer is used to accelerate the training process, reduce gradient disappearance and gradient explosion and other problems, and the LeakyReLU activation function can solve the neuron death problem that may occur in the ReLU activation function.

The advantage of the CBL module is that it can effectively reduce overfitting, and it can also improve the generalization ability of the model and accelerate the training process of the model. Another advantage of it is that it can converge to the optimal solution faster and train faster than some traditional convolution modules.

The CBL module is widely used in many deep learning frameworks, such as PyTorch, TensorFlow, etc.

C3 module

The C3 module is one of the newly added modules in YOLOv5. Its main function is to improve the receptive field of the model and increase the depth of the model, thereby improving the detection accuracy. The C3 module is a block composed of three convolutional layers, where the first two convolutional layers use a 1x1 convolutional kernel to reduce and increase the number of channels, and the third convolutional layer uses a 3x3 convolutional kernel for feature extraction. , this design can effectively reduce the amount of model parameters and alleviate the overfitting problem to a certain extent. The C3 module can be nested multiple times to form different hierarchical structures to gradually increase the receptive field and depth, thereby improving detection performance.

CA module

The CA module (Channel Attention Module) is a method for adaptively weighting feature maps in the channel dimension. It works by using global average pooling to compute weights for each channel, and then applying those weights to all pixels in that channel. Specifically, given a feature map F ∈ RC × H × WF\in \mathbb{R}^{C\times H \times W}FRC × H × W , the calculation process of the CA module is as follows:
FCA=σ(MLPCA(AvgPool⁡(F)))⊙F FCA​=σ(MLPCA​(AvgPool(F)))⊙F

where, MLP \mathrm{MLP}{\mathrm{}}MLPis a multi-layer perceptron (MLP) with two fully connected layers, σ \sigmaσ represents the sigmoid function,⊙ \odot means element-wise product. FF{\mathrm{}}FRepresents the feature map processed by the CA module.

SE module

The SE module (Squeeze-and-Excitation Module) is a method for adaptively weighting feature maps in spatial and channel dimensions. It works by using global average pooling to compute weights for each channel, and then applying those weights to all pixels in that channel. Specifically, given a feature map F ∈ RC × H × WF\in \mathbb{R}^{C\times H \times W}FRC × H × W , the calculation process of the SE module is as follows:
FSE=σ(MLPSE(AvgPool⁡(F)))⊙F FSE​=σ(MLPSE​(AvgPool(F))))⊙F

where, MLP \mathrm{MLP}{\mathrm{}}MLPis a multi-layer perceptron (MLP) with two fully connected layers, σ \sigmaσ represents the sigmoid function,⊙ \odot means element-wise product. FF{\mathrm{}}FRepresents the feature map processed by the SE module.

In general, both the CA module and the SE module enhance the attention mechanism of the model and improve the performance of the model by weighting the features in the channel dimension. Among them, the CA module only considers the features in the channel dimension, while the SE module considers the features in both channel and spatial dimensions.

Both attention mechanisms have their own advantages and applicability. Generally speaking, if the task focuses on weighting the features of different channels, then the CA module may be more suitable; if it is necessary to focus on the features of the spatial dimension, the SE module may be more suitable. Of course, in practical applications, they can also be selected according to specific conditions or used in combination to achieve better results.

BiFPN

BiFPN (Bilateral Feature Pyramid Network) is a feature pyramid network for target detection. Its main idea is to achieve high-quality feature extraction and object detection through multi-level feature fusion and bilateral connection. Its network structure and that of FPA and PAN are as follows:
insert image description here
The formula of BiFPN is as follows:

y l , i = { x l , i , l = l m i n   1 2 ( y l + 1 , i + x l , i ) × w 1 + ∑ j = i i + 3 1 4 ( y l + 1 , j + x l , j ) × w 2 , l m i n + 1 ≤ l ≤ l m a x   1 2 ( y l − 1 , i + x l , i ) × w 1 + ∑ j = i − 3 i 1 4 ( y l − 1 , j + x l , j ) × w 2 , l m i n ≤ l < l m i n + 1   y_{l,i}=\begin{cases}x_{l,i},& l=l_{min}\ \frac{1}{2}(y_{l+1,i}+x_{l,i})\times w_{1}+\sum_{j=i}^{i+3}\frac{1}{4}(y_{l+1,j}+x_{l,j})\times w_{2},&l_{min}+1\leq l\leq l_{max}\ \frac{1}{2}(y_{l-1,i}+x_{l,i})\times w_{1}+\sum_{j=i-3}^{i}\frac{1}{4}(y_{l-1,j}+x_{l,j})\times w_{2},&l_{min}\leq l<l_{min}+1\ \end{cases} yl,i={ xl,i,l=lmin 21(yl+1,i+xl,i)×w1+j=ii+341(yl + 1 , j+xl , j)×w2,lmin+1llmax 21(yl1,i+xl,i)×w1+j=i3i41(yl 1 , j+xl , j)×w2,lminl<lmin+1 

Among them, yl , i y_{l,i}yl,iIs the llth in BiFPNThe ii of the l- level feature mapoutput at i positions,xl , i x_{l,i}xl,iis the input feature map, lmin l_{min}lmin l m a x l_{max} lmaxare the minimum and maximum levels in the feature pyramid, w 1 w_1w1Sum w 2 w_2w2is a learnable parameter. BiFPN can perform adaptive bilateral connections between multi-level feature maps, so that multi-level feature maps can simultaneously acquire global and local information, thereby improving the accuracy and efficiency of object detection.

f up ( P i ) f_{\text{up}}(P_i) fup(Pi) : For feature mapP i P_iPiPerform upsampling and 2D convolution to get a new feature map U i U_iUi

Suppose the input feature map is x 1 , x 2 , . . . , xn x_1, x_2, ..., x_nx1,x2,...,xn, where nnn represents the number of input feature maps. The calculation formula of the BiFPN module can be expressed as:
insert image description here
whereyi y_iyiRepresents the output feature map, x ^ i , 1 \hat{x}_{i,1}x^i,1Indicates the feature map after upsampling, x ^ i , 2 \hat{x}_{i,2}x^i,2Represents the feature map after downsampling, wi , 1 w_{i,1}wi,1 w i , 2 w_{i,2} wi,2 w i , 3 w_{i,3} wi,3are the corresponding weights, which can be obtained through learning.

Specifically, the upsampled feature map x ^ i , 1 \hat{x}_{i,1}x^i,1And the feature map after downsampling x ^ i , 2 \hat{x}_{i,2}x^i,2Can be calculated by the following formula:
insert image description here
where Upsample ( ⋅ ) \text{Upsample}(\cdot)Upsample ( ) means upsampling,Downsample ( ⋅ ) \text{Downsample}(\cdot)Downsample ( ) means downsampling.

When calculating the weight, it can be calculated by the following formula:
insert image description here
where zi , 1 z_{i, 1}zi,1 z i , 2 z_{i, 2} zi,2are learnable bias parameters, which are trained via backpropagation.

Guess you like

Origin blog.csdn.net/qq_44878985/article/details/129302484
Recommended