PConv : Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks

Summary

To design fast neural networks, much research has focused onReduce floating point operations (FLOPs)quantity. However, we observe that this reduction in FLOPs does not necessarily result in the same degree of latency reduction. This is mainly due toFloating point operations are less efficient per secondcaused by the problem. To achieve faster networks, we revisit popular operators and demonstrate that this low FLOPS is mainly due to frequent memory accesses by operators (especially deep convolutions).
Therefore, we propose a novel partial convolution (PConv) method to extract spatial features more effectively by simultaneously cutting redundant computation and memory access. Building on our previous PConv, we further propose FasterNet, a new family of neural networks that is capable of achieving considerably higher accuracy on a wide range of devices at higher running speeds in various vision tasks without Compromise on accuracy.

introduction

Neural networks have experienced rapid development in computer vision tasks such as image classification, detection, and segmentation. While their impressive performance drives many applications, the current trend is to pursue fast neural networks with low latency and high throughput to achieve requirements such as good user experience, instant response, and security .

How to increase speed?
Rather than requiring more expensive computing equipment, researchers and practitioners are increasingly interested in designing cost-effective , fast neural networks with low computational complexity , primarily measured by the number of floating-point operations (FLOPs). Networks such as MobileNets [24, 25, 54], ShuffleNets [46, 84] and GhostNet [17] adopt deep convolution (DWConv) [55] and/ or group convolution (GConv) [31] to extract spatial features to Reduce FLOPs. However, in an effort to reduce FLOPs, these operators often cause memory access side effects . MicroNet [33] further decomposes and sparsifies the network, reducing its FLOPs to extremely low levels.
Insert image description here
Despite improvements in FLOPs, this approach has problems with efficient piecewise computation . In addition, the above network is usually accompanied by other data operations** such as concatenation, reordering and pooling**, whose running time is often quite important for small models. In addition to the pure convolutional neural networks mentioned above, there has also emerged interest in scaling down and accelerating VITs and MLPs architectures, for example: MobileVits and MobileFormer reduce computational complexity by combining DWConv with a modified attention mechanism. However, they still suffer from the above-mentioned problems of DWConv, and also require specialized hardware support to implement the modified attention mechanism. The use of advanced but time-consuming normalization layers and activation layers may also limit their speed on devices.

Together, these issues raise the question: Are these "fast" neural networks really fast?

To answer this question, we studied the relationship between latency and FLOPs. where FLOPS is the number of floating point operations per second, as a measure of effective computing speed. Although there are many attempts to reduce FLOPs, they rarely consider the problem of simultaneously optimizing FLOPS for truly low latency. We compared the FLOPS of typical neural networks on Intel CPUs. The results in Figure 2 show that many existing neural networks have low FLOPS, and their FLOPS is generally lower than the popular ResNet50. Due to low FLOPS, these "fast" neural networks are not actually fast enough. Their reduction in FLOPS does not fully translate into a reduction in latency. Previous studies [46, 48] also noted the difference between FLOPs and latency, but it remains partially unresolved because they employed low-FLOPS DWConv/GConv and various data operations. It is believed that there is currently no better option.
This paper aims to eliminate this discrepancy by developing a simple yet fast and efficient operator that reduces FLOPs and maintains high FLOPS. Specifically, we revisit existing operators, especially DWConv, in terms of computational speed . From perspective, we found that the main cause of the low FLOPS problem is frequent memory accesses . We then propose a novel partial convolution (PConv) as a competitive alternative that reduces both computational redundancy and the number of memory accesses.
Essentially, compared to regular Conv, PConv's FLOPs are lower, but higher than DWConv/GConv . In other words, PConv makes better use of the computing power on the device. Experiments have proven that PConv is better at extracting space. Feature wise it's also very effective.
We further introduced FasterNet, which is mainly based on our PConv building blocks, as a new family of networks that runs very fast on a variety of devices. In particular, FasterNet achieves state-of-the-art performance in classification, detection and segmentation tasks . performance, with both lower latency and higher throughput.
The contributions of this article are as follows:

Introduced a simple, fast and efficient operator named PConv, which has great potential to replace the existing preferred operator DWConv
FasterNet delivers great performance and universal speeds on a variety of devices

Related work

We briefly review previous work on fast and efficient neural networks and distinguish this work from them.

CNN

Mainstream architectures in the field of computer vision, especially in practical deployment**, speed and accuracy are equally important,** although many studies have been conducted to achieve higher efficiency, the basic principles behind them are more or less or At least a low-rank approximation is performed. Specifically, group convolution [31] and depthwise separable convolution [55] (including depthwise convolution and pointwise convolution) are probably the most popular methods. They have been widely used in mobile/edge-oriented networks, such as MobileNets [24, 25, 54], ShuffleNets [46, 84], GhostNet [17], EfficientNets [61, 62], TinyNet [18], Xception [8], CondenseNet [27, 78], TVConv [4], MnasNet [60] and FBNet [74]. Although they exploit redundancy in the filters to reduce the number of parameters and FLOPs, they result in increased memory accesses when increasing network width to compensate for the decrease in accuracy. In contrast, we consider redundancy in feature maps and propose partial convolution to reduce FLOPs and memory accesses simultaneously.

ViT, MLP and variants.

Research on ViT has attracted increasing attention since Dosovitskiy et al. (2020) [12] expanded the application scope of transformers [69] from machine translation [69] or prediction [73] to the field of computer vision.
Many subsequent studies have been devoted to improving ViT in terms of training settings and model design [58, 65, 66]. A notable trend is to introduce convolutions into ViTs [6, 10, 57] by reducing the complexity of attention operations [1, 29, 45, 63, 68], or doing both simultaneously [3, 34 ,49,52], pursuing a better trade-off between accuracy and latency. Furthermore, other studies [5, 35, 64] propose replacing attention with simple MLP-based operators, but they often evolve into CNN-like forms [39]. In this paper, we mainly focus on convolution operations, especially DWConv, for the following reasons: First, the advantages between attention and convolution are not clear or debatable [42, 71]. Secondly, attention-based mechanisms are generally slower than convolution operations and therefore less advantageous for the current industry [26, 48]. Finally, DWConv remains a popular choice among many hybrid models, so it deserves a closer look.

Design of PConv and FasterNet

In this section, we first revisit DWConv and analyze its frequent memory access issues. Then, we introduce PConv as a competitive alternative operator to solve this problem. Afterwards, we introduce FasterNet and explain its details, including design considerations.
DWConv is a popular variant of Conv that has been widely adopted as an important component of many neural networks. For an input I∈Rc×h×w, DWConv applies c filters W∈Rk×k to compute the output O∈Rc×h×w. As shown in Figure 1(b), each filter slides on one input channel and contributes to one output channel. This depth calculation makes the FLOPs (floating point operations) of DWConv only h×w×k2×c, which is lower than the h×w×k2×c2 of conventional Conv. Although effective in reducing FLOPs, DWConv is usually followed by a pointwise convolution (PWConv) and cannot be simply replaced with a regular Conv, because this will cause serious accuracy degradation.

Paranoid convolution as a basic operation

Feature maps have high similarity between different channels, and this redundancy has been addressed in many other studies [17, 82], but few of them have been able to fully exploit it in a simple and effective way . Specifically, we propose a simple convolution method called PConv that can simultaneously reduce redundancy and memory accesses . It only applies regular convolution on a part of the input channels for spatial feature extraction and keeps other channels unchanged. In order to achieve coherent or regular memory access, we take the first or last consecutive CP channels as the representative of the entire feature map. Make calculations . By using the typical partial ratio r = cp/4, PConv has only 16 times the number of floating point operations of a regular convolution. In addition, PConv has smaller memory access.
Insert image description here

PConv followed by PWConv

In order to fully and efficiently utilize all channel information, we further add a point convolution after PConv. Their effective receptive field shape on the input feature map is similar to a T-shaped convolution, compared with conventional convolutions that uniformly process a patch. , paying more attention to the central location. To justify this T-shaped receptive field, we first evaluate the importance of each position by computing its position-wise Frobenius norm. We hypothesize that if a position has a larger Frobenius norm than other positions, it tends to be more important. For a regular convolution
Insert image description here
we consider the position with the largest Frobenius norm to be the salient position. We then jointly inspect each filter in the pretrained ResNet18 to find their salient positions and plot a histogram of the salient positions. The central location is weighted more than its surrounding neighbors. This is consistent with the characteristics of T-shaped calculations.
Insert image description here

FasterNet as a universal backbone network

Considering our innovative PConv and ready-made PWConv as the main building blocks operators, we further propose FasterNet, a family of new neural networks that run fast and are efficient in many visual tasks.

Architecture overview

Our goal is to keep the architecture as simple as possible, without bells and whistles, in order to make it overall hardware-friendly. We present the overall architecture in the diagram below. It has four stages, each stage consists of an embedding layer (a regular Conv4X4 with a stride of 4) or a pooling layer (a regular Conv2X2 with a stride of 2) for spatial downsampling and number of channels. As an extension, each stage has a set of FasterNet blocks. We observe that the blocks in the last two stages consume less memory access and tend to have higher FLOPS, which is verified by Table 1. Therefore, We place more FasterNet blocks in the last two stages and assign more computing tasks accordingly. Each FasterNet block has a PConv layer followed by two PWConv (or Conv1X1) layers. Together they form the inverted residual Difference block, where the middle layer has an extended number of channels and a shortcut connection is placed to reuse input features.
Insert image description here
In addition to the above operators, normalization and activation layers are also essential for a well-performing neural network. However, many previous research works [17, 20, 54] overuse these layers, which may limit feature diversity and thus affect performance. This also reduces overall calculation speed. In contrast, we only add these layers after each intermediate PWConv to preserve feature diversity and achieve lower latency. Furthermore, we use batch normalization (BN) instead of other alternatives. The advantage of BN is that it can be merged with adjacent convolutional layers to make inference faster, while the effect is the same as other methods.
For the activation layer, we empirically selected GELU for the smaller FasterNet variant and Relu for the smaller FasterNet variant. For larger FasterNet variants, taking into account running time and performance, the last three layers, namely global average pooling, a 1X1 convolutional layer and a fully connected layer, are used together for feature conversion and classification.
Insert image description here

appendix

Paper address: https://export.arxiv.org/pdf/2303.03667v1.pdf

FasterNet code:

# --------------------------FasterNet----------------------------
from timm.models.layers import DropPath


class Partial_conv3(nn.Module):
    def __init__(self, dim, n_div, forward):
        super().__init__()
        self.dim_conv3 = dim // n_div
        self.dim_untouched = dim - self.dim_conv3
        self.partial_conv3 = nn.Conv2d(self.dim_conv3, self.dim_conv3, 3, 1, 1, bias=False)

        if forward == 'slicing':
            self.forward = self.forward_slicing
        elif forward == 'split_cat':
            self.forward = self.forward_split_cat
        else:
            raise NotImplementedError

    def forward_slicing(self, x):
        # only for inference
        x = x.clone()  # !!! Keep the original input intact for the residual connection later
        x[:, :self.dim_conv3, :, :] = self.partial_conv3(x[:, :self.dim_conv3, :, :])

        return x

    def forward_split_cat(self, x):
        # for training/inference
        x1, x2 = torch.split(x, [self.dim_conv3, self.dim_untouched], dim=1)
        x1 = self.partial_conv3(x1)
        x = torch.cat((x1, x2), 1)
        return x


class MLPBlock(nn.Module):
    def __init__(self,
                 dim,
                 n_div,
                 mlp_ratio,
                 drop_path,
                 layer_scale_init_value,
                 act_layer,
                 norm_layer,
                 pconv_fw_type
                 ):

        super().__init__()
        self.dim = dim
        self.mlp_ratio = mlp_ratio
        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
        self.n_div = n_div

        mlp_hidden_dim = int(dim * mlp_ratio)
        mlp_layer = [
            nn.Conv2d(dim, mlp_hidden_dim, 1, bias=False),
            norm_layer(mlp_hidden_dim),
            act_layer(),
            nn.Conv2d(mlp_hidden_dim, dim, 1, bias=False)
        ]
        self.mlp = nn.Sequential(*mlp_layer)
        self.spatial_mixing = Partial_conv3(
            dim,
            n_div,
            pconv_fw_type
        )
        if layer_scale_init_value > 0:
            self.layer_scale = nn.Parameter(layer_scale_init_value * torch.ones((dim)), requires_grad=True)
            self.forward = self.forward_layer_scale
        else:
            self.forward = self.forward

    def forward(self, x):
        shortcut = x
        x = self.spatial_mixing(x)
        x = shortcut + self.drop_path(self.mlp(x))
        return x

    def forward_layer_scale(self, x):
        shortcut = x
        x = self.spatial_mixing(x)
        x = shortcut + self.drop_path(
            self.layer_scale.unsqueeze(-1).unsqueeze(-1) * self.mlp(x))
        return x


class BasicStage(nn.Module):
    def __init__(self,
                 dim,
                 depth=1,
                 n_div=4,
                 mlp_ratio=2,
                 layer_scale_init_value=0,
                 norm_layer=nn.BatchNorm2d,
                 act_layer=nn.ReLU,
                 pconv_fw_type='split_cat'
                 ):
        super().__init__()
        dpr = [x.item()
               for x in torch.linspace(0, 0.0, sum([1, 2, 8, 2]))]
        blocks_list = [
            MLPBlock(
                dim=dim,
                n_div=n_div,
                mlp_ratio=mlp_ratio,
                drop_path=dpr[i],
                layer_scale_init_value=layer_scale_init_value,
                norm_layer=norm_layer,
                act_layer=act_layer,
                pconv_fw_type=pconv_fw_type
            )
            for i in range(depth)
        ]

        self.blocks = nn.Sequential(*blocks_list)

    def forward(self, x):
        x = self.blocks(x)
        return x


class PatchEmbed_FasterNet(nn.Module):

    def __init__(self, in_chans, embed_dim, patch_size, patch_stride, norm_layer=nn.BatchNorm2d):
        super().__init__()
        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_stride, bias=False)
        if norm_layer is not None:
            self.norm = norm_layer(embed_dim)
        else:
            self.norm = nn.Identity()

    def forward(self, x):
        x = self.norm(self.proj(x))
        return x

    def fuseforward(self, x):
        x = self.proj(x)
        return x


class PatchMerging_FasterNet(nn.Module):

    def __init__(self, dim, out_dim, k, patch_stride2, norm_layer=nn.BatchNorm2d):
        super().__init__()
        self.reduction = nn.Conv2d(dim, out_dim, kernel_size=k, stride=patch_stride2, bias=False)
        if norm_layer is not None:
            self.norm = norm_layer(out_dim)
        else:
            self.norm = nn.Identity()

    def forward(self, x):
        x = self.norm(self.reduction(x))
        return x

    def fuseforward(self, x):
        x = self.reduction(x)
        return x