[Neural Network] (19) ConvNeXt code reproduction, network analysis, with complete Tensorflow code

Hello everyone, today I will share with you how to use Tensorflow to build a ConvNeXt convolutional neural network model.

Paper address : https://arxiv.org/pdf/2201.03545.pdf

The full code is in my Gitee: https://gitee.com/dgvv4/neural-network-model/tree/master/

In the past 21 years, Transformer has frequently crossed the field of vision. First, it was broken through by Google ViT in image classification, and later by Microsoft Swin Transformer in object detection and image segmentation. As more and more scholars are engaged in the research of visual Transformer, the top three task lists are all dominated by Transformer or a model combining the two architectures. At this time, ConvNeXt stands for Convolutional Neural Network.


1. ConvNeXt Block Module

1.1 Basic structure

(1) ConvNeXt uses the idea of ​​grouped convolution , which is the same as the depthwise convolution ( Depthwise Conv ) in MobileNetV1. There are as many convolution kernels as there are channels in the input feature map. Each convolution kernel processes a corresponding channel , and each convolution kernel generates a feature map. Stacking all the produced feature maps in the channel dimension, then there are input feature maps with the same number of channels as the output feature maps.

(2) Reverse the residual structure, first 1*1 convolution to increase the dimension, and then 1*1 convolution to reduce the dimension . The inverse residual structure of MobileNetV2 is borrowed. Accuracy increased from 1 to 2 on smaller models, and 1 to 1 on larger models .80.5%80.6%81.9%82.6%

You can refer to my previous article: https://blog.csdn.net/dgvv4/article/details/123476899

(3) Fewer normalization layers, using Layer Normalization instead of Batch Normalization . The author borrows the structure of Transformer, and only retains the normalization layer after depthwise convolution, and the accuracy rate is slightly improved after replacement.


1.2 Code display

In the following code, gama scales the data of the output feature map of the 1*1 dimensionality reduction convolution. gama is a learnable variable, and backpropagation optimizes the value of gama during network training.

gama is a one-dimensional vector with the same number of elements as the number of channels in the output feature map. Each element in the vector processes a corresponding feature map, and all pixel values ​​of a feature map are multiplied by an element of the vector in turn to achieve the purpose of scaling the feature map data.

Defining the trainable parameter add_weight() is a method under the Layer class. Before using it, instantiate the Layer class, layers.Layer()

#(2)ConvNeXt Block
def block(inputs, dropout_rate=0.2, layer_scale_init_value=1e-6):
    '''
    layer_scale_init_value 缩放比例gama的初始化值
    '''
    # 获取输入特征图的通道数
    dim = inputs.shape[-1]

    # 残差边
    residual = inputs

    # 7*7深度卷积
    x = layers.DepthwiseConv2D(kernel_size=(7,7), strides=1, padding='same')(inputs)
    # 标准化
    x = layers.LayerNormalization()(x)
    # 1*1标准卷积上升通道数4倍
    x = layers.Conv2D(filters=dim*4, kernel_size=(1,1), strides=1, padding='same')(x)
    # GELU激活函数
    x = layers.Activation('gelu')(x)
    # 1*1标准卷积下降通道数
    x = layers.Conv2D(filters=dim, kernel_size=(1,1), strides=1, padding='same')(x)
    
    # 创建可学习的向量gama,该函数用于向某一层添加权重变量,类实例化layers.Layer()
    gama = layers.Layer().add_weight(shape=[dim],  # 向量个数和输出特征图通道数量一致
                                   initializer=tf.initializers.Constant(layer_scale_init_value),  # 权重初始化
                                   dtype=tf.float32,  # 指定数据类型
                                   trainable=True)  # 可训练参数,可通过反向传播调整权重

    # layer scale 对特征图的每一个通道数据进行缩放,缩放比例gama
    x = x * gama  # [56,56,96]*[96]==>[56,56,96]

    # Dropout层随机杀死神经元
    x = layers.Dropout(rate=dropout_rate)(x)

    # 残差连接输入和输出
    x = layers.add([x, residual])
    
    return x

2. Backbone network

2.1 Network structure diagram


2.2 Design scheme

network architecture . As shown above, in the ResNet50 network , the number of stacking blocks from res2 to res5 is (3, 4, 6, 3), and the ratio is about 1:1:2:1, but in Swin Transformer , for example, the ratio of Swin-T is 1 :1:3:1, the ratio of Swin-L is 1:1:9:1. Obviously, the proportion of stacked blocks is higher in the Swin Transformer. So the author adjusted the stacking times in ResNet50 from (3, 4, 6, 3) to (3, 3, 9, 3), which has similar FLOPs to Swin-T .

Design of downsampling layers . The downsampling of the ResNet network is done by setting the stride of the 3x3 convolutional layer on the main branch to 2, and the stride of the 1x1 convolutional layer on the residual side to 2. But in Swin Transformer is achieved through a separate Patch Merging. The author draws on Swin-T to use a single downsampling layer for the ConvNext network, which consists of a Lairer Normalization plus a convolutional layer with kernel_size=2 and strides=2 .


2.3 Complete code display

ConvNeXtAll existing structures and methods are used, there is no innovation of any structure or method. And the code is also very streamlined, more than 100 lines of code can be built

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Model, layers

#(1)输入图像经过的第一个卷积块
def pre_Conv(inputs, out_channel):

    # 4*4卷积+标准化
    x = layers.Conv2D(filters=out_channel,  # 输出特征图的通道数
                      kernel_size=(4,4),
                      strides=4,  # 下采样
                      padding='same')(inputs)

    x = layers.LayerNormalization()(x)

    return x


#(2)ConvNeXt Block
def block(inputs, dropout_rate=0.2, layer_scale_init_value=1e-6):
    '''
    layer_scale_init_value 缩放比例gama的初始化值
    '''
    # 获取输入特征图的通道数
    dim = inputs.shape[-1]

    # 残差边
    residual = inputs

    # 7*7深度卷积
    x = layers.DepthwiseConv2D(kernel_size=(7,7), strides=1, padding='same')(inputs)
    # 标准化
    x = layers.LayerNormalization()(x)
    # 1*1标准卷积上升通道数4倍
    x = layers.Conv2D(filters=dim*4, kernel_size=(1,1), strides=1, padding='same')(x)
    # GELU激活函数
    x = layers.Activation('gelu')(x)
    # 1*1标准卷积下降通道数
    x = layers.Conv2D(filters=dim, kernel_size=(1,1), strides=1, padding='same')(x)
    
    # 创建可学习的向量gama,该函数用于向某一层添加权重变量,类实例化layers.Layer()
    gama = layers.Layer().add_weight(shape=[dim],  # 向量个数和输出特征图通道数量一致
                                   initializer=tf.initializers.Constant(layer_scale_init_value),  # 权重初始化
                                   dtype=tf.float32,  # 指定数据类型
                                   trainable=True)  # 可训练参数,可通过反向传播调整权重

    # layer scale 对特征图的每一个通道数据进行缩放,缩放比例gama
    x = x * gama  # [56,56,96]*[96]==>[56,56,96]

    # Dropout层随机杀死神经元
    x = layers.Dropout(rate=dropout_rate)(x)

    # 残差连接输入和输出
    x = layers.add([x, residual])
    
    return x
    

#(3)下采样层
def downsampling(inputs, out_channel):

    # 标准化+2*2卷积下采样
    x = layers.LayerNormalization()(inputs)
    
    x = layers.Conv2D(filters=out_channel,  # 输出通道数个数
                      kernel_size=(2,2),
                      strides=2,  # 下采样
                      padding='same')(x)
    
    return x


#(4)卷积块,一个下采样层+多个block卷积层
def stage(x, num, out_channel, downsampe=True):
    '''
    num:重复执行多少次block ; out_channel代表下采样层输出通道数
    downsampe:判断是否执行下采样层
    '''
    if downsampe is True:
        x = downsampling(x, out_channel)

    # 重复执行num次block,每次输出的通道数都相同
    for _ in range(num):
        x = block(x)

    return x


#(5)主干网络
def convnext(input_shape, classes):  # 输入图像shape和分类类别数

    # 构造输入层
    inputs = keras.Input(shape=input_shape)

    # [224,224,3]==>[56,56,96]
    x = pre_Conv(inputs, out_channel=96)
    # [56,56,96]==>[56,56,96]
    x = stage(x, num=3, out_channel=96, downsampe=False)
    # [56,56,96]==>[28,28,192]
    x = stage(x, num=3, out_channel=192, downsampe=True)
    # [28,28,192]==>[14,14,384]
    x = stage(x, num=9, out_channel=384, downsampe=True)
    # [14,14,384]==>[7,7,768]
    x = stage(x, num=3, out_channel=768, downsampe=True)

    # [7,7,768]==>[None,768]
    x = layers.GlobalAveragePooling2D()(x)
    x = layers.LayerNormalization()(x)

    # [None,768]==>[None,classes]
    logits = layers.Dense(classes)(x)  # 不经过softmax

    # 构建网络
    model = Model(inputs, logits)

    return model


#(6)接收网络模型
if __name__ == '__main__':

    # 构造网络,传入输入图像的shape,和最终输出的分类类别数
    model = convnext(input_shape=[224,224,3], classes=1000)

    model.summary()  # 查看网络结构

The network parameters are as follows

==================================================================================================
Total params: 28,582,504
Trainable params: 28,582,504
Non-trainable params: 0
__________________________________________________________________________________________________

3. Network Model Diagram

Thanks to  Sunflower's little mung bean  blogger's model diagram

Guess you like

Origin blog.csdn.net/dgvv4/article/details/123792313