Classic network structure diagram and code in deep learning

Inception model evolution history: from GoogLeNet to Inception-ResNet — Performance comparison between Inception network and other networks

PyTorch-Networks : Contains pytorch code for networks such as classification, detection, and pose estimation

caffe-model-zoo : AlexNet, VGGNet, GooLeNet, ResNet and other networks with pre-trained weights

caffe-model : Performance comparison of different networks, only including network structure, without weight

DWConv depthwise conv

Comparison of ordinary convolution and depthwise separable convolution

For example, in the above picture, a 10x10 pixel, three-channel color input image (shape is 3x10x10), Depthwise Convolution first undergoes the first channel-by-channel convolution operation (shape is 3x3x3), and then performs point-by-point convolution (shape is 3x1x1x16 ), then the total number of parameters is 27+48 =75, which reduces the number of parameters by 83% compared to conventional convolution (3x3x3x16)

ShuffleNet Inverted residual block of CNN model

The Inverted residual block comes from Mobilenet. The traditional residual module first uses 1x1 convolution to reduce the dimension of the input feature map, then performs a 3x3 convolution operation, and finally uses a 1x1 convolution to increase the dimension. The inverted residual The difference module first uses 1x1 convolution to increase the dimension of the input feature map, then uses 3x3 depthwise convolution to perform convolution operation, and finally uses 1x1 convolution operation to reduce its dimension. Note that in Mobilenet V2, 1x1 convolution operation Finally, the ReLU6 activation function is no longer used, but the linear activation function is used to retain more feature information and ensure the expressiveness of the model

ShuffleNet of CNN model

basic unit

ShuffleNet of CNN model

Channel shuffle implementation: Assume that the input layer is divided into g groups, and the total number of channels is n. First, you split the dimension of the channel into two dimensions (g, n), and then transpose these two dimensions into (n, g), and finally reshape into one dimension. Uniform shuffle can be achieved with only simple dimension manipulation and transposition

def shuffle_channels(x, groups):
    """shuffle channels of a 4-D Tensor"""
    batch_size, channels, height, width = x.size()
    assert channels % groups == 0
    channels_per_group = channels // groups
    # split into groups
    x = x.view(batch_size, groups, channels_per_group,
               height, width)
    # transpose 1, 2 axis
    x = x.transpose(1, 2).contiguous()
    # reshape into orignal
    x = x.view(batch_size, channels, height, width)
    return x

ShuffleNetV2: The Crown of Lightweight CNN Networks

Equal channel size minimizes memory accesses

Excessive use of group convolution will increase MAC

Network Fragmentation Reduces Parallelism

Element-level operations cannot be ignored

The guidelines are summarized below

1x1 convolution to balance the input and output channel sizes;

Group convolution should be used with caution, pay attention to the number of groups;

Avoid fragmentation of the network;

Reduce element-wise operations.

SqueezeNet of CNN model

Comparison results of SqueezeNet and AlexNet

The last ImageNet champion model: SENet

The innovation of the SENet network is to pay attention to the relationship between channels, hoping that the model can automatically learn the importance of different channel features. The SE module first performs the Squeeze operation on the feature map obtained by convolution to obtain the channel-level global features, and then the global Excitation operation is performed on the feature, the relationship between each channel is learned, and the weights of different channels are obtained, and finally multiplied by the original feature map to obtain the final feature. In essence, the SE module performs attention or gating operations on the channel dimension. This attention mechanism allows the model to pay more attention to the channel features with the most information and suppress those unimportant channel features. Another point is that the SE module is universal, which means it can be embedded into existing network architectures

class SELayer(nn.Module):
    def __init__(self, channel, reduction=16):
        super(SELayer, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
            nn.Linear(channel, channel // reduction, bias=False),
            nn.ReLU(inplace=True),
            nn.Linear(channel // reduction, channel, bias=False),
            nn.Sigmoid()
        )

    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.avg_pool(x).view(b, c)
        y = self.fc(y).view(b, c, 1, 1)
        return x * y.expand_as(x)

Join the Inception and ResNet networks of the SE module

mobilenet-v3_The rookie's road to AI
insert image description here

The SE structure is added to the bottlenet structure and placed after the depthwise filter. Because the SE structure will consume a certain amount of time, the author changed the channel of the expansion layer to 1/4 of the original in the structure containing SE, so that the author It was found that the accuracy was improved without increasing the time consumption. And the SE structure is placed after depthwise.

insert image description here

In mobilenetv2, before avg pooling, there is a 1x1 convolutional layer, the purpose is to increase the dimension of the feature map, which is more conducive to the prediction of the structure, but this actually brings a certain amount of calculation, so the author modified it here. Put it behind avg pooling, first use avg pooling to reduce the size of the feature map from 7x7 to 1x1, and then use 1x1 to increase the dimension, which reduces the calculation amount by 7x7=49 times. And in order to further reduce the amount of calculation, the author directly removed the 3x3 and 1x1 convolution of the previous spindle convolution, further reducing the amount of calculation, and it became the structure shown in the second row of the figure below. After 1x1 is removed, the accuracy is not lost. Here's a slowdown of about 15ms

Detailed explanation of RepVGG algorithm

The Identity and residual branches are added to the Block block of the VGG network, which is equivalent to applying the essence of the ResNet network to the VGG network. In the model reasoning stage, all network layers are converted to Conv3*3 through the Op fusion strategy, which is convenient Model deployment and acceleration. Different network architectures are used in the network training and network inference phases. The training phase pays more attention to accuracy, and the inference phase pays more attention to speed. It is a solution to improve the speed of model deployment.

First, by fusing the convolutional layer and the BN layer in the residual block, the fused convolutional layer is converted to Conv3*3, and the Conv3*3 in the residual branch is merged

Read EfficientDet in one article

A complete explanation of the core basic knowledge of Yolov3&Yolov4 in the Yolo series

The three basic components of Yolov3 are represented in the three blue boxes above :

CBL: The smallest component in the Yolov3 network structure, consisting of Conv+Bn+Leaky_relu activation function.

Res unit: Learn from the residual structure in the Resnet network, so that the network can be built deeper.

ResX: Consisting of a CBL and X residual components, it is a large component in Yolov3. The CBL in front of each Res module plays the role of downsampling, so after 5 Res modules, the obtained feature map is 608->304->152->76->38->19 in size .

Input end: The innovation referred to here mainly refers to the improvement of the input end during training, mainly including Mosaic data enhancement, cmBN, and SAT self-adversarial training

BackBone backbone network: combine various new methods, including: CSPDarknet53, Mish activation function, Dropblock

Neck: The target detection network often inserts some layers between BackBone and the final output layer, such as the SPP module and FPN+PAN structure in Yolov4

Prediction: The anchor frame mechanism of the output layer is the same as Yolov3, the main improvement is the loss function CIOU_Loss during training , and the nms of the prediction frame screening becomes DIOU_nms

A complete explanation of the core basic knowledge of Yolov5 in the Yolo series

(1) Input end: Mosaic data enhancement, adaptive anchor box calculation, adaptive image scaling
(2) Backbone: Focus structure, CSP structure
(3) Neck: FPN+PAN structure
(4) Prediction: GIOU_Loss

A complete explanation of the core foundation of Yolox in the Yolo series

1. Input: Strong augmentation data enhancement

2. BackBone backbone network: There is no change in the backbone network, it is still Darknet53.

3. Neck: There is no change, the Neck layer of Yolov3 baseline is still FPN structure.

4. Prediction：Decoupled Head、End-to-End YOLO、Anchor-free、Multi positives

Understand HRNet in one article

The backbone is divided into 4 stages, and each stage is divided into two parts: blue frame and orange frame. The blue box part is the basic structure of each stage, which is composed of multiple branches. The blue box of stage1 in HRNet uses BottleNeck, and the blue box of stage2&3&4 uses BasicBlock. The orange box part is the transition structure of each stage, the stage1 orange box in HRNet is a TransitionLayer, the stage2&3 orange box is a superposition of a FuseLayer and a TransitionLayer, and the stage4 orange box is a FuseLayer

Classic network structure diagram and code in deep learning

Guess you like