Fast-ParC study notes

Fast-ParC study notes

Fast-ParC: Position Aware Global Kernel for ConvNets and ViTs

Abstract

In recent years, T-type transformer models have made great progress in various fields. In the field of computer vision, visual transformers (ViTs) have also become a powerful alternative to convolutional neural networks (ConvNets). However, since both convolutional neural networks and visual transformers have their own advantages, they cannot replace convolutional neural networks. For example, vit is good at using the attention mechanism to extract global features, while ConvNets is good at extracting global features due to its strong induction bias O ( n 2 ) O(n^2) while maintaining position sensitivity by using position embeddings. Our Fast-ParC uses the fast Fourier T transform to further transform ParC's global kernels and circular convolutions. A natural idea is to combine the advantages of ConvNets and vit to design new structures. This paper proposes a new basic neural network operator - position-aware circular convolution (ParC) and its accelerated version Fast-ParC. The ParC operator can capture global features by using model local relationshipsand more efficientlyO(nO ( n l o g n ) O(n log n) < /span>O(nlo gn). This speedup makes it possible to use global convolutions in the early stages of models with large feature maps, but still maintain the overall computational cost compared to using 3x3 or 7x7 kernels

https://github.com/yangtao2019yt/Fast_ParC.git.

Index Terms: global receptive field, position-aware circular convolution, position embedding, pure convolution operation, fast Fourier transform

1 INTRODUCTION

In recent years, vision transformers have gradually emerged. Transformer was first proposed in 2017 to solve NLP tasks [2]. In 2020, Dosovitskiy et al. [3] directly applied the original transformer to the image classification task and found that when pre-trained on large datasets (such as ImageNet-21K or JFM-300M [4]), it achieved better results than convolutional networks ( ConvNets) better results. ViT and its variants have subsequently been widely used in other downstream vision tasks such as object detection [5][6], semantic segmentation [7] and multi-modal tasks such as human-object interaction (HOI) [8], text- Image synthesis (T2I)[9] etc. Despite the huge success of transformers, they still cannot completely replace ConvNets. As summarized in the previous article [1][10][11][12], compared with ViTS, ConvNets has betterhardware support and is easier to train . In addition, ConvNets still dominate the field of lightweight models [13][14] for mobile and other edge computing scenarios.

Both transformers and convnets have their own characteristics. For transformers, the widely recognized multi-head attention mechanism is designed to capture long-range pairwise relationships between tokens, which provides transformers with powerful global modeling capabilities. Achieving this representation power also requires a high computational budget. The time complexity of self-attention is quadratic in the number of tokens, so it is slower to process high-resolution feature maps. Different from this, the convolution operation is good at extracting local information. It captures local information within a small sliding window (usually 3x3) and reuses the same convolution kernel for different inputs and different spatial locations. It can be interpreted asan effective implicit weight sharing scheme so that the parameters required for convolution only grow linearly with the input. In addition, ConvNets have been studied and used for a long time, so they have some other unique advantages. For example, for ConvNets, compression algorithms like pruning [17] and quantization [18] are mature. As for hardware implementation, there are many existing acceleration schemes (e.g. Winograd [19], FFT [20], im2col [21]), whether on general platforms such as CPU, GPU or dedicated accelerators such as FPGA, ASIC . In summary, convolution operations are cheaper to implement but cannot capture global relationships like self-attention. Obviously,there is a complementary relationship between the representation ability of the transformer and the efficiency of the convolutional network, both of which are indispensable for practical applications.

There have been some recent works that combine the advantages of transformers and convnets. PVT [22], Swin [23] and CoAtNet [24] attempt to reintroduce the inductive bias of convolution (such as its sliding window strategy) to help transformer models learn better. Just like LeViT [25], MobileViT [11], EfficientFormer [26] focus on designing efficient hybrid architectures. Most of these works combine these two types of networks, but they fail to address key issues:The additional computational and engineering complexity of the newly introduced attention operator. People will naturally ask:Is it possible to design a new operator that is different from self-attention and traditional convolution and has the advantages of both??

image-20221109094015128

This paper combines the advantages of transformers and ConvNets to construct a new plug-and-play operator ParC. Due to the use of global kernel ( K h = H or K w = W K_h = H or K_w = W Kh=HorKw=W), and using the circular convolution scheme, ParC has a global receptive field. Explicitly learnable position embeddings are then employed before convolution to maintain the position sensitivity of the model. As shown in Figure 1, different ConvNet models improve their effective receptive fields to global by simply applying the proposed ParC operator. SinceParC uses pure convolution operation, it can be deployed efficiently on different platforms. Finally, we decompose the 2D convolution into two 1D convolutions to overcome the increase in flop/parameters. Based on the above design, we achieve the goal of extracting global features while maintaining low cost in space and time complexity. Through experiments, we verify the effectiveness of the new operator in a wide range of tasks and models. In short, the contributions of this article can be summarized as the following three points:

1) Combining the advantages of ViTs and ConvNets, an effective new operator ParC is proposed. Experiments demonstrate the advantages of ParC, applying it to a wide range of models, including MobileViT [27], ResNet50 [28], MobileNetV2 [14] and ConvNext [27]. We also evaluate these models on multiple tasks, including classification, detection, and segmentation.

2) Aiming at the problem that the ParC algorithm is too complex when the input feature resolution is large, a fast ParC algorithm is proposed. Fast-Parc is theoretically equivalent to ParC, i.e. their outputs are the same when given the same input. However, it is much more efficient than ParC when given large resolutions (e.g. 112×122). Fast-ParC expands the usage scenarios of ParC, making it an operator with a wider range of applications.

3) The internal mechanism of the new operator is analyzed. Through visualization, we show several clear differences between ParC and pure convolutions. Research shows thatthe effective receptive domain [15] of ordinary ConvNet is very limited, while the effective receptive domain [15] of the parcel-based network is global . We also show via Grad-CAM [29] that parcel-based networks are more comprehensive than ordinary ConvNets in focusing on important regions of images. We also analyzed the differences between ParC and vanilla convolution in detail.

2 RELATING WORK

2.1 Theoretical/Effective Receptive Field

Hebel et al. [30] found in neuroscience that shallow neurons only extract local features, and the covered range is accumulated layer by layer, which is called the "receptive field (RF)". Since the success of VGGNet [31], the design of CNN architectures has followed a similar pattern [28] [32] [14] [13] - using stacks of small kernels, such as 3×3, instead of larger kernels. Some previous work has given the theoretical calculation of the receptive field [33] [34] of CNN, that is, the theoretical receptive field (TRF). Under this concept, two layers of 3×3 The receptive field is equal to a layer of 5×5 receptive field. However, some works [15][34] have questioned this view, due to the fact that the importance of pixels rapidly changes from the center in the feature map Regressed to the edge. Subsequently, the effective receptive field (ERF) was proposed to measure the area in the input image that actually affects the neuron activation pattern. Luo et al. [15] backpropagated the center pixel and calculated the partial derivative of the input image to examine the region. By studying a series of convolutional networks, they found that the effective receptive electric field is often much smaller than theoretically expected. SKNet [35] uses an attention mechanism to select appropriate receptive fields. RF-Next [36] proposed a nas-based workflow to automatically search the receptive domain of a model. These works show that the determination of appropriate receptive fields is very beneficial to the performance of the network. Recent research also found that increasing the receptive field of the convolutional network can lead to better model performance. We call it "large kernel convolutional network", which will be discussed later in Section 2.3.

2.2 Vision Transformer and Hybrid Structures

ViTs achieve impressive performance on a variety of vision tasks. However, the original ViT[3] had some limitations. For example, it has heavy weights, low computational efficiency, and is difficult to train. To overcome these problems, subsequent variants of vit were proposed. From the perspective of improving training strategies, Touvron et al. [37] proposed using knowledge distillation to train the ViT model and achieve competitive accuracy with less pre-training data. In order to further improve the model architecture,some researchers try to learn from ConvNets to optimize ViTs. Among them, PVT [22] and CVT [38] insert convolution operations at each stage of ViT to reduce the number of tokens and build a hierarchical multi-level structure. Swin Transformer [23] computes self-attention within a shifted local window. PiT [39] jointly uses pooling layers and deep convolutional layers to achieve channel multiplication and spatial reduction. CCNet [40] proposed a simplified version of the self-attention mechanism, cross-attention, and inserted it into ConvNets to construct ConvNets with global receptive fields. These papers clearly show that some techniques from ConvNets can be applied to visual transformers to design better visual transformer models.

Another popular research direction iscombining elements of vit and ConvNets to design new backbones. Graham et al. hybridized ConvNet and Transformer in their LeViT [25] model, which significantly outperformed previous ConvNet and ViT models in terms of speed/accuracy trade-off. BoTNet [41] replaces standard convolutions with multi-head attention in the last few blocks of ResNet. vit-c [42] adds early convolutions to pure ViT. ConViT [43] incorporates soft convolutional inductive bias via gated position self-attention. The CMT [10] block consists of a depthwise convolution-based local sensing unit and a lightweight transformer module. CoatNet [24] combines convolution and self-attention, and designs a new transformer module that focuses on both local and global information.

2.3 Large Kernel Convolution Network

Early ConvNets such as AlexNet [44] and GoogleNet [45] use large kernels, such as 5×5 or 7×7. But since the success of VGGNet [31], stacking small kernels like 3x3 and 1x1 is considered an effective choice for computation and storage. Recently, inspired by the success of visual deformers, large kernels have been reused again as a powerful tool for improving model performance. ConvNext [27] modernizes standard ResNet towards the design of visual transformers by introducing a series of incremental but effective designs, where 7×7 depthwise convolutions follow the spirit of window-sa use in Swin [23]. RepLKNet[16] extends the convolution kernel to 31×31 and obtains performance gain, but the reparameterization technique used will increase the burden of the training process, and requires an additional transformation step to deploy the model. Later, Rao et al. used a larger kernel 51×51 with dynamic sparsity [46]. GFNet [47] uses a global Fourier convolution instead of SA (self-attention) in the transformer block, implemented with FFT.

Our work is closely related to RepLKNet [16] and GFNet [47]. Both these methods and our proposed ParC method focus on enlarging the effective receptive field, but our proposed op differs from the following viewpoints: 1) Our ParC uses learnable position embeddings to preserve the result features Graph position sensitive. This is very important for position-sensitive tasks such as semantic segmentation and object detection. Experiments in ablation studies also confirmed this. 2) Our ParC adopts lightweight design. RepLKNet uses a heavy 2D convolution kernel, GFNet uses a learnable complex weight matrix of shape 2CHW, while we use two 1D convolutions to reduce the kernel to CH or CW. 3) Different from RepLKNet and GFNet, which emphasize the overall design of the network, the ParC we propose is a new type of basic operator that can be plugged into vit and ConvNets, plug and play. Our experimental results in Sections 4.1 and 4.2 verify this. In addition, we also proposed Fast-ParC, which further broadens the usage scenarios of ParC.

3 THE PROPOSED FAST-PARC OPERATOR

In this section, we first introduce the proposed ParC operator through comparison with ordinary convolution operators. Then, we propose an FFT-accelerated version of ParC, named Fast-ParC. Finally, we explain how to use the proposed ParC in ViT and ConvNet models.

3.1 ParC Operation

3.1.1 Vanilla Depth-wise Convolution

To describe the horizontal one-dimensional depth convolution (denoted as Conv1d-H) on a 4D input tensor of shape B×C×H×W, we can first focus on a specific channel. We denote the output as y = y 1 , … , y H − 1 y = {y_1, …, y_{H−1}} and=and1yH1, importing x = x 0 , x 1 , … , x H − 1 x = {x_0, x_1, …, x_{H−1}} x=x0,x1xH1,卷积权值为 w = w 0 , w 1 , … , w K h − 1 w = {w_0, w_1, …, w_{K_h−1} } In=In0,In1,InKh1}. A pytorch-style convolution with zero padding (i.e. f.c conv1d) can be expressed as

image-20221109094236795

Among K h / 2 K_h/2 Kh/2 is used to offset both ends of the input K h / 2 K_h/2 Kh/2Extra padding for scalars. Equation 1 shows that yi is a function of its local neighboring inputs (i.e. x i − K h / 2 , … , x K h / 2 − 1 + i x_i−K_h/2, …, x_ {K_h/2−1+i} xiKh/2xKh/21+i), the size of the neighborhood is determined by the kernel size K h K_h Khcontrol. Therefore, it is impossible to collect long-distance information with a single layer of small kernel convolution. To address this shortcoming of vanilla convolution, we propose ParC with global receptive fields.

3.1.2 ParC: Positional-Aware Circular Convolution

image-20221109094614065

defined义 w = w 0 , w 1 , … , w K h − 1 w = {w_0, w_1, …, w_{K_h−1}} In=In0,In1wKh1为心权值, p e = p e 0 , p e 1 , … , p e K h − 1 pe = {pe_0, pe_1, …, pe_{K_h−1}} pe=pe0,pe1peKh1Embed for location. Corresponding to Figure 2, ParC can be described as:

image-20221109094551198

这り I = 0 , 1 , … , h − 1 I = 0,1, …, h−1 I=0,1h1. w is a fixed-size learnable kernel (specified by hyperparameters), w H w_H InH is the adjusted learnable kernel whose size matches the corresponding input feature map size. PE is position embedding. Here we use the interpolation function f(·, N) (such as bilinear, bicubic) to adapt to the size and position embedding of the kernel (from K h K_h KhTo H), mod represents the modulo operation.

Compared with conventional convolution, the ParC operator has four main differences:1) global kernel; 2) circular convolution; 3) position embedding; 4) one Dimensional decomposition. In order to effectively extract global features, these designs are essential, which will be proven in the ablation experiment in Section 4.4. Below, we elaborate on the reasons for these design differences:

Global kernel and Circular Convolution.

In order to extract the global relationship of the entire input map, ParC uses a global kernel. The size of the global kernel is the same as the size of the corresponding feature map, denoted as K h = H K_h = H < /span>Kh=Hor K w = W K_w = W Kw=W. In some architectures this is reduced by half at each stage. For example, in ResNet50 or ConvNeXt, the feature resolutions of the four stages are [56, 28, 14, 7] respectively. However, simply expanding the size of the ordinary convolution kernel itself cannot effectively extract global relationships. Due to the use of zero padding, even if the kernel size increases to the resolution, the kernel weights will be aligned with the zero padding, which does not provide useful information other than absolute position. This effect occurs most severely when the kernel is aligned with the edge of the image - for a 2D convolution, 3/4 of the input is actually zero. Therefore, we additionally propose to use circular convolution. When performing circular convolution, when sliding the window, the kernel weight is always aligned with the effective pixels , as shown in Figure 2.

**Positional Embedding. **

As summarized in previous work [48],vanilla convolutions can encode positional information when using zero padding. Circular convolution periodically reuses the input image and loses part of the position information. To overcome this problem, we introduce a learnable position encoding inserted before the circular convolution. In the following experiments, we demonstrate that this is important for model performance, especially for downstream tasks that are sensitive to spatial information.

1D Decomposition

Finally, in order to ensure acceptable cost of model size and calculation. Wedivide the 2D convolution and position encoding into H (horizontal) and V (vertical) directions, which reduces the number of parameters and FLOPs from O( H × W) is reduced to O(H + W), which is a considerable compression when the resolution is larger.

**Implementation of Circular Convolution. **

Conceptually, circular convolutions need to be implemented separately from ordinary convolutions because additional modulo ops are required when calculating the index of the convolved pixels. In fact, it can be easily implemented by filling the input feature map with the 'concat' function before calling the ordinary 1D convolution routine (see Algorithm 1).

When considering the vertical dimension W and the channel dimension C, Equation 2 can be expanded to

image-20221109100926190

∀ i ∈ [ 0 , H − 1 ] , ∀ j ∈ [ 0 , W − 1 ] and ∀ c ∈ [ 0 , c − 1 ] ∀i∈[0,H−1], ∀j∈[0,W −1] and ∀c∈[0,c−1]i[0,H1]j[0,IN1]sumc[0,c1], which is a complete representation of the one-layer depth ParC-H of channel c, and the input resolution is H×W. In ResNet50ParC, we also extend the per-channel ParC to its dense counterpart, reintroducing the channel interaction, which can be expressed as:

image-20221109101054419

considering ∀ i ∈ [ 0 , H − 1 ] , ∀ j ∈ [ 0 , W − 1 ] , ∀ c i ∈ [ 0 , C i − 1 ] ∀i ∈ [0, H −1], ∀j ∈ [0, W −1], ∀c_i ∈ [0, C_{i−1}] i[0,H1],j[0,IN1],ci[0,Ci1] and ∀ c o ∈ [ 0 , C o − 1 ] ∀c_o ∈ [0, C_o−1] co[0,Co1].

3.2 Fast-ParC: Speed up ParC with its FFT Equivalent

Form

image-20221109101606541

As shown in Figure 3, when the feature resolution is small (such as 7×7), applying ParC can effectively extract global features and reduce computational complexity. But as the input resolution grows, the complexity of ParC quickly exceeds the 7×7 convolution. To overcome this problem, we propose an accelerated version of ParC called Fast-ParC. When the feature resolution is larger (e.g. 56×56), Fast-ParC is more effective than ParC. In fact, Fast-ParC, while having global extraction capabilities, is even more efficient than 7×7 convolution in a wide resolution interval.

image-20221109101342317

We derive the fast parc with the help of Fast Fourier Transform (FFT). It is well known that FFT [50] can simplify linear convolution operations. But according to the convolution theorem [50], for discrete signals, the dot product in the Fourier domain is closer to the circular convolution in the spatial domain. This is one of the significant differences between ParC and ordinary convolution. Furthermore, two other features of ParC, global kernel and right padding, also fit well with the default convolution mode of the Fourier domain. This interesting fact allows us to develop a very neat and beautiful frequency domain implementation for ParC. Definition x ( n ) , w ( n ) , y ( n ) x(n), w(n), y(n) x(n), w(n), y(n)Regional import , exit order, x ( k ) , w ( k ) , y ( k ) x(k), w(k), y(k) x(k), w(k), y(k)Yufuli Leaf Area Middle order, can be achieved as below, etc. system:

image-20221109101522144

Eq. 5 gives two strictly equivalent mathematical expressions. ParC in the spatial domain requires convolution operations, while in the Fourier domain it becomes a simple multiplication operation. On this basis, we propose a Fourier domain version of the ParC operation, called Fast-ParC. It can be theoretically proven that Fast-ParC is strictly equivalent to ParC in the spatial domain (see Appendix A.1). The error between these two implementations is negligible. We can choose the appropriate implementation based on the actual platform used, respectively for training and inference. This provides ParC with great flexibility. Algorithm 1 gives the pseudocode. The advantages of Fast-ParC are obvious:

First of all, the multiplication complexity of one-dimensional - fft with length N is only O(N logN), while one-dimensional convolution in the spatial domain requires O ( n 2 ) O(n^2) O(n2)

As can be seen from Table 1, when N is large, the complexity of spatial convolution greatly exceeds that of fft-based convolution. ParC employs global kernel and circular convolution, matching the default spatial format of Fourier convolution. For downstream tasks, such as detection or segmentation of multiple instances, higher resolutions are often required. For example, for COCO [51], the resolution usually used for testing is 1280 × 800, and for ADE20k [52] it is 2048 × 512. When N is large, Fast-ParC can save model flops and obtain better acceleration. Fast-ParC also allows us to use ParC in shallower stages where the computational budget is acceptable. This is necessary to implement ParC in new architectures [53].

image-20221109102646443

Another advantage actually comes from FFT’s software/hardware support.

Since FFT is a classic signal processing algorithm, many platforms have ready support for its acceleration. When ParC is applied to customized platforms such as FPGA, it can effectively utilize multiple resources such as on-chip DSP and pre-designed IP cores. In addition, general computing platforms also have out-of-the-box toolkits (such as CPU:torch). FFT [54], numpy .fft [55]; GPU: cuFFT[56]). The flexibility of Fast-ParC allows us to choose better implementations based on different criteria (e.g., maximum throughput, minimum memory footprint), as well as based on the actual preferences of different computing platforms for the algorithm.

Fortunately, Fast-ParC is completely equivalent to ParC, and the replacement does not require any additional transformations. We can choose the specific implementation form of ParC according to the requirements of the platform. Since ParC and Fast-ParC offer the highest flexibility, the implementations for training and inference can also be decoupled. Further exploring other efficient convolution operations based on FFT is also a promising direction. We leave this issue to the future.

3.3 Applying ParC on ViTs and ConvNets

To verify the effectiveness of ParC as a plug-and-play meta-operator, we constructed a series of ParC-based models based on the operations proposed in Section 3.1. Here, the baseline models include vit and ConvNets. Specifically, for ViTs, MobileViT [11] is chosen as the baseline because it achieves the best parameter/accuracy trade-off among recently proposed lightweight hybrid structures. ResNet50[28], MobileNetv2[14], and ConvNext[27] are used as the ConvNet baseline. ResNet50 is the most widely used model in practical applications. MobileNetV2 is the most popular model in mobile devices. ConvNext is the first ConvNet to retain the pure ConvNet architecture while integrating some features of vit. The four models we employ here are all representative.

3.3.1 ParC-ViTs

image-20221109103707726

image-20221109103357457

ParC-MetaFormer block. As shown in Figures 4 and 5, ConvNets and vit have large differences in their outer structures. Vit generally uses the Yuanqian block as the basic architecture. In order to apply the ParC operator in vit, we designed the ParC-metaformer block and used it to replace the transformer block in vit.

Adopt a structure similar to MetaFormer. The MetaFormer[49] block is the most commonly used block structure in vit. It is usually composed of two block sequences: Token Mixer and Channel Mixer . Both components use residual structures. We adopt ParC as a token mixer to build the ParC-metaformer block. We do this because ParC can extract global features and interaction information between pixels from the global space to meet the requirements of the token mix module. Unlike self-attention, where the complexity is quadratic, ParC is much more computationally efficient. Replacing this part with ParC can significantly reduce the computational cost. In the parparmetaformer Block, we adopt the serial structure of ParC-H and ParC-V. Considering symmetry, half of the channels pass through ParC-H first, and the other half pass through ParC-V first (as shown in Figure 4).

Pay attention to the Channel Mixer section when adding channels. Although ParC maintains the global receptive field and positional embedding, another advantage of vit over ConvNets is that it is data-driven . In ViTs, the self-attention module can adaptively weight according to the input. This makes vit a data-driven model that can focus on important features and suppress unnecessary features, leading to better performance. Previous literature [57] 58] [59] has explained the importance of keeping the model data-driven. Using the proposed global circular convolution instead of self-attention, a pure convolutional network that can extract global features is obtained. But the replaced model is no longer data-driven. To compensate, we insert the channel wise attention module into the channel mixer section as shown in Figure 4. Following SENet [57], we first pair the input features via global average pooling x ∈ R c × h × w x∈R^{c×h×w} xRc×h×w Space information progresses and joins, special conquest x a ∈ R c × 1 × 1 x_a∈R^{c×1×1} xaRc×1×1 , Ran Hou General x a x_a xaInput to multi-layer sensing and generate channel wise weights a ∈ R c × 1 × 1 a∈R^{c×1×1} aRc×1×1 , the direction a gives x passage wise mutual, the generation ends.

MobieViT-ParC Network.

At present, existing hybrid structures can be basically divided into three main structures, namely series structure [25] [42], parallel structure [12] and bifurcated structure [11] [24]. Among these three structures, the third structure currently performs best. MobileViT [11] also adopts a bifurcated structure. Inspired by this, we built our bifurcation structure model based on MobileViT. MobileViT consists of two major categories of modules. The shallow stage consists of MobileNetV2 blocks with local receptive fields. The deep segment consists of ViT blocks with global receptive fields. We retain all MobileNetV2 blocks and replace all ViT blocks with corresponding ParC blocks. This replacement converts the hybrid structure model into a pure ConvNet model while retaining its global feature extraction capabilities.

3.3.2 ParC-ConvNets

For ParC-ConvNets, we focus on providing global receptive fields for ConvNets. Replacing vanilla convolutions with ParC operations (shown in Figure 5(a)), we construct different ParC-based blocks (shown in Figure 5(b)©(d)). Previous hybrid structure work [25] [42] [11] also reached similar conclusions: The model using local extraction blocks in the early stage and the global extraction block in the deep stage has the best performance< /span>. Since ParC has a global receptive field, we follow this rule to insert ParC-based blocks into ConvNets (as shown in Figure 5(e)).

ParC BottleNeck and ResNet50-ParC Network.

ResNet[28] is one of the most classic ConvNets. By simply replacing the 3×3 convolution of the original ResNet50 bottleneck with the ParC operator, we can obtain the ParC bottleneck (see Figure 5 (b)). Because there may be significant differences in the properties of ParC-H and ParCV, there is no channel interaction between them. This is similar to group convolution with group=2 [60]. The main part of ResNet can be divided into 4 stages, each stage consists of a pair of repeated bottleneck blocks. Specifically, ResNet50 has [3, 4, 6, 3] blocks in four stages respectively. Replace the last 1/2 of the penultimate segment and the last 1/3 of the last segment of ResNet50 with the ParC bottleneck to obtain ResNet50-ParC.

ParC-MobileNetV2 Block and MobileNetV2-ParC Net-work.

MobileNetV2[14] is a typical representative of lightweight models. By replacing the 3×3 depthwise convolutions in the inverted bottleneck with depth ParC, we obtain the ParC-mobilenetv2 block (see Figure 5 ©). MobileNetV2 is thinner and deeper than Resnet50, and the number of blocks in level 7 is [1, 2, 3, 4, 3, 1] respectively. MobilenetV2-ParC can be obtained by replacing the last 1/2 block of stage 4 and the last 1/3 block of stage[5,6] with the ParC-MobileNetV2 block.

ParC-ConvNeXt Block and ConvNeXt-ParC Network.

ConvNeXt [27] made a series of modifications to the original ResNet50 structure and learned from the transformer. During this time, 3×3 convolutions will be replaced by 7×7 depthwise convolutions. This expands the local receptive field but still fails to classify global information. We further use depth ParC to replace the 7×7 depth convolution in ConvNeXt Block. In this way we obtain the ParC-ConvNeXt block (see Figure 5(d)). Replacing the last 1/3 blocks of the last two stages of ConvNeXt with ParC-ConvNeXt blocks resulted in a ConvNeXtParC example. We reduce the number of basic channels in ConvNeXt-T to 48 (i.e., [48, 96, 192, 384] per stage), resulting in lightweight ConvNeXt-XT, which is more popular for deployment on edge computing devices and has Shorter experimental cycles.

It should be noted that in ParC-MetaFormer, the sequence of ParC-H and ParC-W is used to keep the receptive field consistent with self-attention, because this design is used to replace self-attention. In ParC-convnets, we adopt parallelism (per layer) of ParCH and ParC-v, as shown in Figure 5. According to experimental results, this setup can already provide sufficient performance gains relative to traditional ConvNets. In fact, since we are not just using one ParC-ConvNet block, ParC-ConvNet still has a global receptive field.

4 EXPERIMENT

image-20221109110219420

image-20221109110411549

image-20221109110418323

image-20221109110509502

image-20221109110523937

image-20221109110610999

image-20221109110654877

image-20221109110706796

image-20221109110713706

image-20221109110831184

image-20221109110905031

5 DISCUSSION

In this section, we first analyze the spatial distribution of weights learned by vanilla convolution and ParC. On this basis, it is further explained that they still learn similar feature extraction patterns.

In addition, we also provide the results of two commonly used visualization schemes, effective receptive field (ERF) [15] and Grada-Cam [29], to analyze the mechanism of ParC. The results show that the PARC-based model can obtain global features even at high resolution, which makes the model more comprehensive in obtaining instance-related semantic information.

image-20221109111008539

image-20221109111250366

6 CONCLUSION

We design a new plug-and-play operator ParC (position-aware circular convolution). ParC has a global receptive field like self-attention in ViT, but due to the pure convolution operation, different hardware platforms can support it more conveniently. We demonstrate that it improves the classification performance of the network, whether inserted into a transformer-based network or a convolution-based network. In addition, parcel-based models also have advantages when handling downstream tasks. Its internal mechanism and its difference from ordinary large kernel convolution are analyzed, and a convincing explanation of its superiority is given. In order to apply ParC under large resolution conditions, fast ParC based on ffc is also proposed. Fast ParC operations are capable of keeping computational budgets low at high input resolutions, making ParC a competitive and versatile choice for most computer vision tasks.

Guess you like

Origin blog.csdn.net/charles_zhang_/article/details/127766183