ICLR 2023 | RevCol: Reversible multi-column network, a new paradigm for large model architecture design

We have added a dimension to the neural network architecture!

Since the arrival of the ViT era, the basic model consisting of a stack of blocks has become a widely followed basic model design paradigm. The macro-architecture of a neural network is determined by the width (number of channels) and depth (number of blocks). Have you ever thought that a neural network is not necessarily composed of a stack of blocks? Maybe 2 stacks, 4 stacks, or... 16 stacks?

Introduce our latest work "Reversible Column Networks", which introduces the idea of ​​disentangled feature learning into the model design, and proposes to use the reversible column as a unit to transmit information, which not only ensures the decoupling of features, but also the information in the network. Transfer without loss. The entire network structure includes multiple sub-networks (we call them columns), and reversible connections are added between the columns. By repeatedly connecting the input to the columns, the low-level texture details and semantic semantic information are gradually separated. The advantage of this is that it can not only ensure high precision in pre-training, but also ensure that low-level information is not lost to achieve better results in downstream tasks (detection, segmentation). In order to verify the performance of this design pattern under large model and big data, we made a 2B parameter pure CNN super large model on RevCol, and only used a 3x3 convolution kernel. It reached 90% Top-1 Accuracy on ImageNet-1K, both downstream detection and segmentation tasks reached 60+ level, COCO AP box 63.8%, ADE 20k mIoU 61.0%. In addition, the RevCol architecture still follows the design paradigm of the reversible neural network, which also inherits the natural memory-saving advantages of the reversible network. Most of the experiments in this article can be completed on the 2080ti. And saving video memory is undoubtedly important for large model training.

08ccbf7416fca2d6a7123b49973fa4a0.png

arxiv: https://arxiv.org/pdf/2212.11696.pdf

github: https://github.com/megvii-research/RevCol

(Foundation Model Group of Megvii Research Institute, please indicate the source for reprinting)

c89ee160988c29d30790c4f750fff874.png

1. Background

As early as  the CNN  era, we found that it is difficult to take advantage of simply using  ResNet  in downstream tasks, but using  multi-scale fusion methods such as HRNet and FPN can achieve better results. However, simply putting HRNet and FPN in the upstream classification task is not as effective as a straight network like ResNet. Where did the multi-scale integrated network and the straight network do it right, and where did it fall short?

To answer this question, we can examine these structures with the help of the Information Bottleneck principle proposed by Tishby. Information Bottleneck said that when the network propagates forward, it will gradually discard (compress) information that is irrelevant to the task, while retaining information that is helpful to the task. For the network on classification tasks, the shallow features close to the input contain a lot of low-level information that has nothing to do with classification, while the deep layers close to the output mainly contain semantic semantic information. From the perspective of classification alone, this seems reasonable, but if the network pre-trained on the classification task is used downstream, the information loss in the pre-training stage will affect the effect of the downstream task.

Therefore, the best backbone should be able to decouple the task-relevant semantic information into some dimensions, and retain as much input information as possible in the network. This is also mentioned by Bengio in the Disentangling factors of variation article in 2012. The original sentence is: we require a means of feature extraction that disentangles these factors in the data rather than simply learn to represent some of these factors at the expense of those that are lost in the filter pooling operation.

9a912b1eb63892b0c79a3d6b80413dfb.png

Our RevCol achieves the goal of disentangle under the e2e training pipeline through the exquisite structural design. As shown in Figure 1, the information transmission method of the general straight network (single column) is that the closer to the input part, the more low-level the information is, and the closer to the loss, the more semantic. RevCol adopts a multi-input design, and the starting position of each column is low-level information. With the iteration of the column, at the end of the column, the semantic information in the feature is gradually extracted. The Reversible connection design is adopted between the columns, that is to say, the information of the previous columns can be rolled out from the latter columns. In this way, the last column can be pushed all the way back to the first column, so as to ensure that the information is lossless when passing between columns. At the same time, intermediate supervision is added at the end of the column to explicitly constrain the representation of the output information of each column, so as to ensure that the semantic information can be decoupled with the column iteration.

2. Method

84898d7c68bf43013b427b1c2e2b019f.png

The RevCol structure contains many subnets, which we call columns. Arranging multi-level reversible units in the form of column iteration constitutes the macro-architecture of the network.

2.1        Multi-Level Reversible Unit

If you readers have known RevNet (the pioneering work of Reversible network), you should be impressed by the two-way cross network structure mentioned on the opposite side. Here we repeat it again: as shown in Figure 2a, RevNet first divides the input into and (these two can be kept consistent), and the input of each subsequent Reversible block comes from the output of the previous two blocks.

(1)

Equation 1 is the formula of RevNet's forward pass and reverse push, where represents the current block, represents the previous block, and represents the previous one. The input of the current block consists of two parts. The first part comes from the output of the previous block, passing through a Non-linear Function F( ) (which can be Conv/BN/Relu/Residual Block). The second part is that the output of the previous two blocks goes through a Reversible Operation, such as channel-wise scaling. Then the two parts are summed. Such a constraint ensures that during the backward push, the result can be calculated by re-inputting the previous push into Ft(·), and then reverse the forward formula. Addition, subtraction, multiplication, division, and reverse.

RevNet's Equation has some natural flaws. The tensors on both sides of the plus sign must maintain the same shape, which means ,,, this series of outputs cannot show  the characteristics of hierarchical  . This is why  RevNet  cannot achieve Reversible from input to output  . Instead, it chooses Reversible inside each  stage ( same resolution), and information will still be lost during down-sample. In this way, we have no way to wear multiple hierarchical columns, so we have to change it.

We derived RevNet's Equation to a more generalized form. In Equation 2 below, we add more x to the input of the first part of RevNet, Ft(·), and keep the second part unchanged. That is to say, the current formula no longer depends on the output of the previous two blocks, but on m ones, one of which is placed in the second part, and the rest are placed in the first part. When inverting, as long as you get all the m-1 inputs of the first part, you can calculate the input of the second part. And these  m-1  inputs have been pushed at an earlier moment.

(2)

 ,, 

 , 

[Put this formula into a column] If we regard all the features of a network as a long sequence, we divide each m feature into a group, ,, then as long as we get a group, we can push it out one by one and step by step All features in a group. In the same way, the value of the previous group is pushed back one by one by the current group when pushing backwards. It can also be understood as a sliding window. In this way, if the feature of each group is extracted into a column, then the forward/inverse of column by column constitutes the basic structure of RevCol.

[What benefits does Equation 2 bring? First of all, the restriction on shape in RevNet is eased, we only need to ensure that the shapes of and are consistent, to because Ft(·) can adjust them to be consistent with . So we still retain all the advantages of hierarchical. In addition, this structure can be combined with any single-column network, and the features in the network just correspond to the features of a group. These benefits are crucial to making the network bigger and stronger.

2.2 Basic structure

【Macro structure】

c0e7972ef9eefc603aa5efaf709ddfae.png

When we actually implemented RevNet, we made some simplifications to the above Multi-level Reversible. The input of Ft(·) is m-1, we have reduced it to 2, adding more inputs has limited benefit. Referring to Figure 2 (c), one input comes from the previous Level (red line) of the same column, and the other input comes from the next Level (blue line) in the previous column. One of the two inputs represents high-level semantic information, and the other represents low-level texture information, which is enough.

【micro structure】

502d6fe19d7952aea9017fa9a2275613.png

In each Level, first use a Fusion unit (Figure 5 c) to adjust the input of different shapes to the same shape, and then go through a bunch of ConvNeXt Blocks to get the output. These are the Ft( ) in the formula, and then Add the input of the Reversible operation to get the final result. It is worth noting that we changed the kernel size from 7x7 to 3x3 in the original ConvNeXt block. The benefit of the large kernel is limited on the revcol, but the small kernel is really fast!

2.3 Intermediate Supervision

We also made a plug-in intermediate supervision method. Without changing the pipeline of e2e training, the upstream and downstream can bring more than 1 additional point of benefit. We found that due to the post-addition and design of the Reversible unit, at the end of each column, the Reversible operation is used to add all the way to the last column and then connect to the loss, so that the bottom of the previous column will be very close to the loss. Being close to loss means that the feature at this position mainly contains semantic information. Then the first column itself will lose information. If the first column contains too much semantic information and loses too much texture information, then it will be Reversible later, and the benefits will be very small. It cannot collapse.

So we added a classification head at the bottom of the column, and a feature reconstruction head, connected to CE and BCE Loss respectively, and then gradually adjusted the ratio of the two losses as the column deepened. The loss of the last column reconstruction was 0, and the classification The loss ratio is 100%.

(3)

The total loss L in is the summation of all compound loss:

2.4 Model Design

Under  the framework of RevCol  , a model has three dimensions: the number of channels  ( width ),  the number of blocks  of a single column  (depth), and the number of columns. When we were designing the model, we found that the benefits of increasing the number of columns are almost equivalent to increasing the width and depth at the same time, so we made a simple and crude scale up rule: a small (8 column) is two tiny (2x4 column), a base ( 16 column) is four tiny (4* 4column). It's just that here a little adjustment has been made on the base to align the calculation with competing products.

dc05bf3bd1a4810f9b7e1d712b271055.png

Oh, by the way, RevNet also has a feature that everyone loves, which is the video memory saving that is advertised in other Reversible papers. RevCol-T/S/B uses almost the same amount of calculation for a single column. After adding a column, only the storage of param is added to the video memory, so the models of these three sizes basically occupy the same video memory. They can all be trained in RTX 2080ti (11G). On Huge (2B parameter), we also turned on the Reversible calculation method to improve training efficiency. If the video memory cannot be saved, the training cost of Huge may increase many times.

3. Experimental results

【ImageNet Classification】

09754ad8fcdf850ddfd9f1ddae80643c.png

In addition to the 2B parameter model, we also collected 168Million private data sets (Megdata-168M), weakly-label labels. used for pre-training. The XL model (800M param) can reach 88.2 under 22k, and can rise to 89.4 after Megdata-168M training. Huge (2.1 B param) 224 pretrain, 640x640 Finetune, able to reach 90.0% Top-1 Accuracy. The training overhead of this model: pre-training a total of 1600 ImageNet Epochs, training once using 80 A100 takes 14 days.

【COCO Object Detection】

48cae0a848a80ebe50684d8224d3dcf6.png

【ADE Semantic Segmentation】

3b2fcc43ec48e24446c3963370dde3a0.png

On COCO, using the framework of DINO, after further Finetune by Object 365, RevCol-H can reach an AP box of 63.8. In ADE 20k, using the Mask2Former framework, mIoU can reach 61%.

【Foundation Models】

f2997fa8c603a91c23f9f9e23a451778.png

We listed various Foundation Models and made a comparison. RevCol-H is a single-modal model (the Megdata-168M data set only contains pictures without language) and the label uses a semi-labeled method. We did not use Mask Image Modeling pre-training, and we are still a CNN. The final upstream and downstream tasks can achieve comparable results with other single-modal and multi-modal large models, such as multi-modal model BEiT3, multi-modal model Florence, single-modal super-large model plus Swinv2-G under MIM pre-training setting .

4. Conclusion and Outlook

Our goal for the first version of RevCol is to verify its scale up capability under pure vision tasks. However, the CV large model should not be limited to the tasks of classification and detection, etc., especially after the phenomenon-level explosion of ChatGPT, where the CV large model will be in the future is worth pondering. The potential future we can see includes video understanding, multimodal models, generative models, autonomous driving, and more. We firmly believe that a macro-architecture such as RevCol can be used universally. Where the future of CV is, RevCol will be there. See you at the top.

a6a6e7484ba3e2eb534075bcf71deb4f.gif

Guess you like

Origin blog.csdn.net/Megvii_tech/article/details/129095758