论文详解——《InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions》

论文地址:《InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions》

Abstract


Original translation :
Abstract Compared with the tremendous progress made by large-scale vision Transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNN) are still at an early stage. This paper proposes a new large-scale CNN-based foundation model called InternImage.
Similar to VIT, this model can gain by increasing parameters and training data. Unlike recent CNNs that focus on large dense kernels, InternImage adopts deformable convolution as the core operator, so our model not only has a large effective receptive field required for downstream tasks such as detection and segmentation, but also has the ability to be affected by input information. and adaptive spatial aggregation constrained by task information.
Therefore, the proposed InternImage reduces the bias of inductive bias of traditional CNNs, making it possible to learn stronger and more robust patterns with large-scale parameters from massive data like VIT. The effectiveness of our model is verified on challenging benchmarks such as ImageNet, COCO and ADE20K. It is worth mentioning that InternImage-H achieves 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, surpassing the current leading CNNs and VITs.

1. Introduction


Original translation
With the remarkable success of Transformer in large-scale language models [3-8], vision transformer (ViTs) [2,9-15] has also swept the field of computer vision, becoming the first choice for large-scale vision basic model research and practice . Some pioneers [16-20] have tried to extend VIT to very large models with more than 1 billion parameters, which surpasses convolutional neural networks (cnn) and significantly improves the performance of various computer vision tasks. Performance limits, including basic classification, detection, and segmentation. Although these results show that in the era of massive parameters and data, CNN is not as good as VIT, but we believe that with similar operator-/architecture-level designs, scaling-up parameters, massive data, the basic model based on CNN can also be achieved Comparable or even better performance than vit.

To bridge the gap between CNN and VIT, we first summarize their differences from two aspects:
(1) From the operator level [9, 21, 22], VIT's multi-head self-attention (MHSA) has long-range dependence and adaptive Spatial clustering (see Figure 1(a)). Thanks to the flexible MHSA, VIT can learn more powerful and robust representation than CNN from massive data.
(2) From the perspective of architecture [9,22,23], in addition to MHSA, VIT also contains a series of advanced components that are not included in standard CNN, such as Layer normalized (LN) [24], feedforward network (FFN) [1], GELU[25], etc. Although recent studies [21, 22] have made meaningful attempts to introduce long-range dependencies into CNNs by using dense convolutions with very large kernels (e.g., 31×31) as shown in Fig. The current state-of-the-art large-scale VIT [16, 18–20, 26] still has a considerable gap in terms of performance and model size.

In this work, we focus on designing a basic CNN-based model that can scale efficiently to large-scale parameters and data. Specifically, we start with a flexible convolution variant-deformable convolution (DCN) [27, 28]. By combining it with a series of custom block-level and structure-level designs similar to Transformer, we design a brand new convolutional backbone named InternImage.
insert image description here
Figure 1: Comparing different core operators.
(a) shows the global aggregation of multihead self-attention (MHSA) [1] whose computational and memory overhead is expensive in downstream tasks that require high-resolution inputs.
(b) Limit the scope of MHSA to a local window [2] (Swin Transformer ) to reduce the cost.
(c) is a depthwise convolution with a very large kernel to model long-term dependencies.
(d) is a deformable convolution with good properties similar to MHSA and is efficient enough for large-scale models. We start with it and build a large-scale CNN.

As shown in Figure 1, unlike CNNs with very large kernels such as 31×31 [22], the core operator of InternImage is a dynamic sparse convolution with a common window size of 3×3.

(1) Its sampling offset Flexible, it can dynamically learn the appropriate receiving field from the given data (it can be long-range or short-range); (2) adaptively
adjust the sampling offset and modulation scalar according to the input data, and realize the VIT-style Adaptive spatial aggregation reduces the over-inductive bias of regular convolution;
(3) The convolution window is a common 3×3, which avoids optimization problems and expensive costs due to large dense kernels [22,29].

With the above designs, the proposed InternImage can be efficiently scaled to large parameter sizes and learn stronger representations from large-scale training data, achieving comparable or even better performance than large-scale vits in a wide range of vision tasks [2, 11 ,30]. In summary, our main contributions are as follows:

(1) A basic model based on large-scale CNN - Internimage is proposed. To the best of our knowledge, it is the first CNN that scales efficiently to over 1 billion parameters and 400 million training images, and achieves comparable or even better performance than the state-of-the-art vit, suggesting that convolutional models are also a worthy Exploratory directions for large-scale model studies.

(2) Using a modified 3×3 DCN operator, by introducing long-term dependencies and adaptive spatial aggregation, we successfully scale CNNs to large-scale settings, and explore operator-centric customized basic blocks, superposition rules and scaling strategy. These designs leverage operators efficiently, enabling our models to benefit from large-scale parameters and data.

(3) The model was evaluated on representative vision tasks (including image classification, object detection, instance and semantic segmentation), and the model scale was expanded from 30 million to 1 billion, and the data scale was expanded from 1 million to 400 million, compared with current state-of-the-art CNNs and large vits. Specifically, our model with different parameter sizes can consistently outperform previous techniques on ImageNet [31]. Trained only on the ImageNet-1K dataset, InternImageB achieves a top-1 accuracy of 84.9%, outperforming its CNN-based counterparts by at least 1.1 points [21, 22]. With the increase of large-scale parameters (i.e., 1 billion) and training data (i.e., 427 million), the previous accuracy of InternImage-H is further improved to 89.6%, which is close to well-engineering ViTs [2, 30] and hybrid-ViTs [20 ]. Furthermore, on the challenging downstream benchmark COCO [32], our best model InternImage-H achieves a state-of-the-art 65.4% box mAP with 2.18 billion parameters, which is 2.3 points higher than SwinV2-G [16] (65.4 than 63.1), with 27% fewer parameters, as shown in Figure 2.

2. Related Work

  • Vision foundation models.

With large-scale datasets and computing resources, convolutional neural networks (CNN) have become the mainstream of visual recognition. Extracted from AlexNet [33], many deeper and more efficient neural network architectures have been proposed, such as VGG [34], GoogleNet [35], ResNet [36], ResNeXt [37], EfficientNet [38,39], etc. In addition to structural design, there are more complex convolution operations such as depthwise convolution [40] and deformable convolution [27, 28]. Considering the advanced design of Transformer, modern CNNs by discovering better components in the macro/micro design and introducing improved convolutions [21, 41–43] or dynamic weights [44] with long dependencies, in vision tasks showed good performance. Visual base model.

In recent years, a new vision-based model has focused on Transformer-based architectures. ViT [9] is the most representative model among them, which has achieved great success in vision tasks due to its global receptive field and dynamic spatial aggregation. However, ViT's global attention suffers from expensive computation/memory complexity, especially on large feature maps, which limits its application in downstream tasks. To address this problem, PVT [10, 11] and Linformer [45] perform global attention on downsampled key-value maps, DAT [46] perform deformation attention on sparse sample information in value maps, and HaloNet [47] and SwinTransformer [2] developed a local attention mechanism and used haloing and shift operations to transfer information between adjacent local regions.

  • Large-scale models
    Extending the model is an important strategy to improve the quality of feature representation and has been extensively studied in the field of natural language processing [48]. Inspired by the success in the field of NLP, Zhai et al. extended ViT to 2 billion parameters for the first time. [16] extended the hierarchically structured Swin transformer to a deeper and wider model with 3 billion parameters. Some researchers combined the advantages of vit and cnn at different levels to develop a large-scale hybrid vit [20, 49]. More recently, BEiT-3 [17] further explores ViT-based stronger representation of large-scale parameters by utilizing multimodal pre-training. These methods greatly improve the upper bounds of basic vision tasks. However, research on CNN-based large-scale models lags behind Transformer-based architectures in terms of total number of parameters and performance. Although newly proposed CNNs [21, 41–43] introduce long-distance dependencies by using convolutions with very large kernels or recursively gated kernels, there is still a considerable gap compared to the state-of-the-art vit. In this work, we aim to develop a basic CNN-based model that can scale efficiently to a large scale comparable to ViT.

3. Proposed Method

To design a large CNN-based base model, we first start with a flexible variant of convolution, Deformable convolution v2 (DCNv2), and make some adjustments to better suit the requirements of large base models. We then combine the tuned convolution operators with advanced block designs used in modern backbones (Swin transformer v2, Scaling vision transformers) to build basic blocks. Finally, we explored the principle of stacking and scaling based on the DCN block, and constructed a large-scale convolutional model capable of learning strong representations from massive amounts of data.

3.1 Deformable Convolution v3


original translation

  • Convolution vs. MHSA (compared to ordinary convolution kernel MHSA)

Previous works [21, 22, 50] have extensively discussed the differences between CNNs and VITs. Before deciding on the core operator of InternImage, we first summarize the main differences between ordinary convolution and MHSA.

(1) Long-range dependencies

Models with larger effective receptive fields (long-distance dependence) usually perform better on downstream vision tasks [51-53], and the de-facto effective receptive field of CNN networks with 3x3 regular convolution stacks is relatively small. Even with very deep models, CNN-based models still cannot obtain long-distance dependencies like vit, which limits its performance.

(2) Dynamic spatial aggregation (Adaptive spatial aggregation)

Compared with MHSA (Multi-Head Self-Attention) whose weight is dynamically constrained by the input, regular convolution[54] is a static weight operator with strong inductive biases, such as 2D locality, neighborhood structure, translation equivalence . Compared with VIT, the model formed by regular convolution has highly-inductive properties, faster convergence speed, and requires less training data, but it also limits cnn to learn more general and more robust patterns from web-scale data.

  • Revisiting DCNv2. (Revisiting DCNv2)

A straightforward way to build a bridge between convolution and MHSA is to introduce long-range dependencies and adaptive spatial aggregation into regular convolution. We start with DCNv2 [28], which is a general variant of regular convolution. Given an input x ∈ RC × H × W x ∈ R^{C×H×W}xRC × H × W , current pixelp 0 p_0p0, DCNv2 can be expressed as:

Eq.1

y ( p 0 ) = ∑ k = 1 K wkmkx ( p 0 + pk + Δ pk ) \mathbf{y}\left(p_0\right)=\sum_{k=1}^K \mathbf{w}_k \ mathbf{m}_k \mathbf{x}\left(p_0+p_k+\Delta p_k\right)y(p0)=k=1Kwkmkx(p0+pk+p _k)

where KKK is the total number of sampling points,kkk is the enumerator of sampling points. wk ∈ RC × C w_k∈R^{C×C}wkRC × C means thekkthProjection weights of k sampling points,mk ∈ R m_k∈RmkR means thekkthThe modulation scalar of k sampling points is normalized by the sigmod function. pk p_kpkFor the k-th position (−1, −1) sampled in a predefined grid in a regularized convolution, (−1, 0) , … , (0, + 1 ) , … , (+ 1 , + 1 ) { (−1, −1), (−1,0), …, (0, +1), …, (+1, +1)}(11)(1,0)(0+1)(+1+1) ∆ p k ∆p_k pkis the offset corresponding to the sampling position of the kth grid.

It can be seen from the formula that (1) for Long-range dependencies, the sampling offset ∆pk is flexible and can interact with short-range or long-range features; (2) for Adaptive spatial aggregation, the sampling offset ∆ pk ∆p_kpkand modulation scalar mk m_kmkare learnable and are subject to the input xxx constraints. It can thus be found that DCNv2 has similar good properties to MHSA, which motivates us to develop a large-scale CNN-based base model based on this operator.

Deformable convolution (Deformabel Convolution) Reference:
"Paper and Code Detailed Explanation - Deformable Convolution (DCNv1)"
"Paper and Code Detailed Explanation - Deformable Convolution (DCNv2)"

  • Extending DCNv2 for Vision Foundation Models (DCNv3 proposed on the basis of DCNv2)

In common practice, DCNv2 is usually used as an extension of regular convolution, loading the pre-trained weights of regular convolution and fine-tuning for better performance, which is not entirely suitable for large-scale vision bases that need to be trained from scratch Model. In this work, to address this issue, we extend DCNv2 from the following aspects:

(1) Sharing weights among convolutional neurons.
Similar to regular convolution, different convolutional neurons in the original DCNv2 have independent linear projection weights, so its parameters and storage complexity and sampling The total number of points is linear, which greatly limits the efficiency of the model especially in large-scale models. In order to solve this problem, we borrow the idea of ​​separable convolution [55], the original convolution weight wk w_kwkSeparated into two parts, depth-wise and point-wise, where the depth-wise part is composed of the original location-aware modulation scalar mk m_kmkResponsible, the point-wise part is the shared projection weight w between sampling points.

(2) Introducing multi-group mechanism (Introducing multi-group mechanism).
The multi-group (head) design first appeared in group convolution [33] and is widely used in Transformers in MHSA [1]. Through adaptive spatial aggregation, Richer information can be effectively learned from different representation subspaces at different locations. Inspired by this, we divide the spatial aggregation process into GGGroup G , each with a separate sampling offset∆pgk ∆p_{gk}pgkand modulation scale mgk m_{gk}mgk, so different groups on one convolutional layer can have different spatial aggregation patterns, leading to stronger features for downstream tasks.

(3) Normalize the modulation scalars between sampling points (Normalizing modulation scalars along sampling points.) The
modulation scalars in the original DCNv2 are element-wise normalized by the sigmoid function. Therefore, each modulation scalar is in the range of [0,1], and the sum of the modulation scalars of all sampling points is not stable, changing between 0 and k, which leads to the use of large-scale parameters and data for training, The gradient of the DCNv2 layer is unstable. In order to alleviate the instability problem, we changed the element-wise sigmoid normalization to sample point-based softmax normalization. In this way, the sum of the modulation scalars is limited to 1, making the training process of the model at different scales more stable.

✨✨✨ Combining the above modifications, the extended DCNv2 is marked as DCNv3 , which can be written as Eqn. (2).

y ( p 0 ) = ∑ g = 1 G ∑ k = 1 K wgmgkxg ( p 0 + pk + Δ pgk ) \mathbf{y}\left(p_0\right)=\sum_{g=1}^G \sum_ {k=1}^K \mathbf{w}_g \mathbf{m}_{gk} \mathbf{x}_g\left(p_0+p_k+\Delta p_{gk}\right)y(p0)=g=1Gk=1Kwgmgkxg(p0+pk+p _gk)

where G is the total number of aggregation groups. For the ggg groups,wg ∈ RC × C ′ w_g∈R^{C×C'}wgRC×C' Represents the location-irrelevant projection weights of the group, whereC' = C / G C'=C/GC=C / G stands for groupu dimension. mgk ∈ R m_{gk}∈RmgkR meansggThe kthin the g groupThe modulation scalar of k sampling points is normalized by the softmax function along the k dimension. xg ∈ RC ′ × H × W x_g∈R^{C'×H×W}xgRC ×H×Wrepresents the sliced ​​input feature map. ∆pgk ∆p_{gk}pgkfor the ggGroup g grid sampling positionpk p_kpkthe corresponding offset.

Overall, the DCNv3 operator, as an extension of the DCN sequence, has the following three advantages :

(1) This operator makes up for the shortcomings of regular convolution in long-distance dependence and adaptive spatial aggregation ;
(2) Compared with attention-based operators such as ordinary MHSA and closely related deformation attention [46, 56], this operator The operator inherits the inductive bias of convolution, which makes our model more efficient in the case of less training data and shorter training time ;
(3) The operator is based on sparse sampling, which is better than the previous MHSA[1], Methods such as re-parameterizing large kernels [22] are more computationally and memory efficient. In addition, due to the sparse sampling, DCNv3 only needs a 3×3 kernel to learn long-range dependencies, which is easier to optimize and avoids additional auxiliary techniques such as re-parameterizing [22] used in large kernels.

3.2 InternImage Model


original translation

Using DCNv3 as the core operator brings a new problem: how to build a model that can effectively utilize the core operator?

In this section, we first introduce the details of other integral layers of the basic block of our model, and then we construct a new CNN-based basic model InternImage by studying the tailored stacking strategy of these basic blocks. Finally, we investigate the scaling-up rule of the model to obtain the gain of parameter increase.
insert image description here
As shown above: the core operator is DCNv3, the basic block is composed of layer normalization (LN) [24] and feedforward network (FFN) [1] as Transformer, stem and downsampling follow the design of traditional CNN, where "s2" and "p1" represent stride 2 and padding 1, respectively. Constrained by superposition rules, only 4 hyperparameters (C1, C', L1, L3) can determine model variables.

  • Basic block

Different from the widely used bottleneck in traditional CNN [36], our basic block design is closer to VIT, which is equipped with more advanced components, including LN [24], Feedforward Network (FFN) [1] and GELU [25] . This design has been proven effective in various vision tasks [2, 10, 11, 21, 22]. The details of our basic block are shown in Figure 3, where the core operator is DCNv3, sampling offsets and modulation scales are obtained by taking the input features xxx is predicted by a separable convolution (a 3×3 depth wise convolution followed by a linear projection). For other components, we use post-normalization [57] by default and follow the same design as normal Transformer [1, 9].

  • Stem & downsampling layers

To obtain hierarchical feature maps, we use convolutional stem and downsampling layers to resize feature maps to different scales. As shown in Figure 3, placing the stem layer before the first stage reduces the input resolution by a factor of 4. It consists of two convolutions, two LN layers, and one GELU layer. The kernel size of the two convolutions is 3, the stride is 2, and the padding is 1. The output channel of the first convolution is the second convolution. half of. Similarly, the downsampling layer consists of a 3×3 convolution with stride 2 and padding 1, followed by an LN layer. It sits between two stages and is used to downsample the input feature map by a factor of 2.

  • Stacking rules

To clarify the process of block stacking, we first list the complete hyperparameters of InternImage as follows:

C i C_i Ci: No. iithe number of channels in the i stage;

G i G_i Gi: No. iiGroup number of DCNv3 in phase i ;

L i L_i Li: No. iiThe number of basic blocks in phase i .

Since our model has 4 stages and variables are determined by 12 hyperparameters, its search space is too large to exhaustively enumerate and find the best variables. In order to reduce the search space, we summarize the design experience of the prior art [2,21,36] into 4 rules
(1) C i = 2 i − 1 C 1 C_i=2^{i-1}C_1Ci=2i1C1
(2) G i = C i / C ′ G_i=C_i/C' Gi=Ci/C
(3)L1 = L2 = L4 L_1=L_2=L_4L1=L2=L4
(4) L 1 < = L 3 L1<=L_3 L 1<=L3
The first rule lets the channel number of the last three stages be changed from the channel number of the first stage C 1 C1C 1 decides.
The second rule makes the group number correspond to the channel number of the stage.
For the number of stacked blocks in different stages, we simplify the stacking mode to "AABA", that is, the number of blocks in stages 1, 2, and 4 is the same and not greater than the number of blocks in stage 3 shown in the last two rules. With these rules in place, an InternImage variable can be defined
with only 4 hyperparameters(C1, C’,L1, L3)

Select a model with a parameter of 30 million as the origin, discretize C1 as {48, 64, 80}, discretize L1 as {1, 2, 3, 4, 5}, and discretize C' as {16, 32}. In this way, the original huge search space can be compressed to 30, and then the optimal model can be found from these 30 variables through training and evaluation in ImageNet [31]. In practice, we use 最好的超参数设置(64、16、4、18)to define the original model and scale it differently.

  • Scaling rules

Based on the optimal origin model under the above constraints, we further explore the parameter scaling rules inspired by [38]. Specifically, we consider two scales: 深度D(e.g. 3 L 1 + L 3 3L_1+L_33 L1+L3) and 宽度C 1 C_1C1, and use a、b a、bα , β and composite factorφ φφ scales two dimensions.
The scaling rule can be written as: D ′ = α φ D : D'= α^φD:D=aφ D,C 1 ′ = β φ C 1 C'_1 = β^φ C_1C1=bφC1, among themα≥1,β≥1,α β 1.99 ≈ 2 αβ^{1.99}≈2a b1.992 .
Here, 1.99 is specific to InternImage, calculated by doubling the model width and keeping the depth constant. The experiment found that最佳缩放设置为α = 1.09, β = 1.36.
On this basis, InternImage variants of different parameter scales are constructed, i.e.InternImage- t /S/B/L/XL, their complexity is similar to that of ConvNeXt [21]. To further test the function, we build a larger InternImage-H with 1 billion parameters, and to accommodate very large model widths, we also change the group dimension C' to 32. Table 1 summarizes these configurations.
insert image description here

4. Experiment

We analyze and compare InternImage with leading CNNs and VITs on representative vision tasks, including image classification, object detection, instance and semantic segmentation.

4.1 Image Classification

Table 2 shows the classification results of different scale models. With similar parameters and computational cost, our model is comparable to, or even better than, state-of-the-art Transformer-based and CNN-based models. For example, InternImage-T achieves a top-1 accuracy of 83.5%, significantly surpassing ConvNext-T [21] by 1.4 points. InternImage-S/B maintains the leading position and surpasses hybridViT CoAtNet-2 [20] by 0.8 percentage points. When pre-trained on ImageNet-22K and large-scale joint datasets, the former accuracy of InternImage-XL and -H improves to 88.0% and 89.6%, respectively, outperforming previous CNNs that also use large-scale data for training [22, 67], and narrow the gap with the state-of-the-art large-scale vit to about 1 point. This gap may be due to the discrepancy between large-scale inaccessible private data and the aforementioned federated public data. These results demonstrate that our InternImage not only performs well on public parameter scales and public training data, but also scales efficiently to large-scale parameters and data.
insert image description here
insert image description here

4.2 Object Detection

As shown in Table 3, when using Mask RCNN for object detection, we find that our model significantly outperforms other models with a comparable number of parameters. For example, in the 1× training schedule, the box AP (APb) of InternImage-T is 4.5 points higher than that of SWI-T [2] (47.2 vs. 42.7) and 3.0 points higher than that of Convext-T [21] (47.2 vs. 44.2). With a 3× multi-scale training scheme, more parameters, and a more advanced Cascaded Mask R-CNN [71], InternImage-XL achieves an APb of 56.2, surpassing Convext-XL by 1.0 points (56.2 vs. 55.2). Instance segmentation experiments also obtained similar results. Under the 1× training schedule, InternImage-T obtains 42.5 mask AP (i.e., APm), which is 3.2 points (42.5 vs. 39.3) and 2.4 points (42.5 vs. 40.1) higher than SwinT and ConvNeXt-T, respectively. The best APm obtained using InternImage-XL of Cascade Mask R-CNN is 48.8, which is at least 1.1 points higher than the corresponding APm.

To further push the performance boundary of object detection, we follow the high-level settings of leading methods [16, 17, 26, 74, 78] to initialize the backbone with weights pre-trained on ImageNet-22K or large-scale joint datasets, by combining technology doubles its parameters [78] (see model with 2 billion parameters in Fig. 2). Then, fine-tuning with the DINO [74] detector is performed on the Objects365 [79] and COCO datasets for 26 epochs and 12 epochs, respectively. As shown in Table 4, our method achieves the best results of 65.0 APb and 65.4 APb on COCO val2017 and test-dev. Compared with the previous state-of-the-art model, our model outperforms FD-SwinV2-G [26] by 1.2 points (65.4 vs. 64.2) with 27% fewer parameters without complex distillation process, which shows that our model Effectiveness on detection tasks.
insert image description here

4.3 Semantic Segmentation

insert image description here
As shown in Table 5, our InternImage consistently outperforms the state-of-the-art [2, 21, 22, 29] when using UpperNet [81] for semantic segmentation. For example, with nearly the same number of parameters and FLOPs, our InternImage-B reports 50.8 mIoU on ADE20K val, which is excellent for strong peers such as convext-b (50.8 vs. 49.1) and RepLKNet-31B (50.8 vs. . 49.9). Furthermore, our InternImage-H obtains 60.3 MS mIoU, which is better than SwinV2-G [16] but with a much smaller number of parameters (1.12B vs. 3.00B). Notably, our InternImage-H achieves the best mIoU of 62.9 when using Mask2Former [80] and multi-scale testing, which is higher than the current best BEiT-3 [17] on the ADE20K benchmark. These results show that CNN-based base models can also enjoy the benefits of massive data and challenge the leading position of Transformer-based models.

4.4 Ablation Study

  • Sharing weights among convolution neurons matters
    Due to hardware limitations, large-scale models are very sensitive to core operator parameters and memory overhead. To address this issue, we share weights among the convolutional neurons of DCNv3. As shown in Figure 4, we compare the parameters and memory overhead of DCNv3-based weight-sharing and non-sharing models. We can see that the parameter and memory cost of the non-shared weight model is much higher than that of the shared weight model, especially at the -H scale, the proportion of saved parameters and GPU memory is 42.0% and 84.2%, respectively. As shown in Table 6, we also check that the two models at -T scale have similar top-1 accuracy on ImageNet (83.5 vs. 83.6) and APb on COCO (47.2 vs. 47.4), even without sharing The weight model also has 66.1% more parameters.
    insert image description here

  • Multi-group spatial aggregation brings stronger features
    We introduce aggregation groups to allow our model to learn information from different representation subspaces, such as Transformer [9]. As shown in Figure 5, for the same query pixel, different groups of offsets are concentrated in different regions, resulting in hierarchical semantic features.
    insert image description here

We also compare the performance of models with and without multiple groups. As shown in Table 6, the model has a significant drop of 1.2 points on ImageNet and a significant drop of 3.4 points on COCO val2017. In addition, we also found that the learned effective receptive field (ERF) is relatively small in the first two stages, and as the model goes deeper (i.e., stage 3 and stage 4), the ERF increases globally. This phenomenon is different from ViTs [9,10,83], whose ERF is usually global.
insert image description here

5. Conclusion & Limitations

We introduce InternImage, a new CNN-based large-scale base model that can provide powerful representations for a wide variety of vision tasks, such as image classification, object detection, and semantic segmentation. We tuned the flexible DCNv2 operator to meet the needs of the basic model, and developed a series of block rules, stacking rules, and scaling rules with the core operator as the core. Extensive experiments on object detection and semantic segmentation benchmarks verify that our InternImage can achieve comparable or better performance than well-designed large-scale vision transformers, suggesting that CNNs are also a considerable choice for large-scale vision fundamental model research. Still, latency remains an issue for DCN-based operators adapting to downstream high-speed tasks. Furthermore, large CNNs are still in the early stages of development, and we hope that InternImage can serve as a good starting point.

Guess you like

Origin blog.csdn.net/zyw2002/article/details/132351593