ICCV 2023 | USTC and MSRA jointly propose a lightweight neural network architecture AFFNet: Adaptive Frequency Filter

guide

论文:《Adaptive Frequency Filters As Efficient Global Token Mixers》

TL;DR : This paper reveals that adaptive frequency filters can act as efficient global tokenmixers, mainly by using the convolution theorem. This enables global tokenmixtures to be implemented as large-kernel convolutions in the latent space, Hadamardefficiently implemented by product operations in the frequency domain.

problem definition

At present, the three mainstream visual infrastructures: CNN, Transformer and MLP, all perform well on major visual tasks, largely due to their effective information fusion on a global scale. However, due to the high computational cost of the self-attention mechanism, large convolution kernels, and fully-connected layers, efficient deployment especially on mobile devices still remains a challenge.

solution

To this end, a novel adaptive frequency filter is introduced today. This method transfers latent representations to the frequency domain and performs semantically adaptive frequency filtering via element-wise multiplication. This operation is mathematically equivalent to a tokenhybrid operation using dynamic convolution kernels in the original latent space. In addition, the author further used AFF token mixeras the main neural operator to construct a lightweight neural network, called AFFNet, and demonstrated its effectiveness and efficiency through experiments.

tokenUltimately, global mixing can be effectively done by transferring to the frequency domain and performing operations there. On the other hand, by employing the Fast Fourier Transform (FFT), we can effectively reduce the complexity of token mixing from O ( N 2 ) O(N^{2}) reduced to O ( N l o g N ) O(N logN)

method

In many mainstream neural networks, tokenmixing is crucial, since learning non-local representations is key for visual understanding. We first describe a unified token mixing method that updates tokens by mixing in their context regions. The authors review existing token mixing methods for different types of network architectures in CNNs, Transformers, and MLPs, and point out issues with their efficiency and effectiveness. Interested students can refer to the original text, which will not be detailed here.

So, what is token mixing?

In neural networks that process images, the input is often divided into small squares, or "tokens." These tokens are processed through the layers of the network. Token mixing refers to the way these little squares interact with each other and combine information. Think of it as a dialogue between different parts of an image, sharing information to better understand the image as a whole.

Secondly, how to understand "adaptive frequency filtering"?

  • Adaptive : This means that the system can change and adjust based on the data it is processing. This is not a one-size-fits-all method, but is dynamically adjusted according to the specific content of the image. Usually, something similar to a "weight" is calculated based on existing information and then applied to the region of interest.

  • Frequency filtering : In the context of images and signals, "frequency" refers to the different patterns or waveforms that make up an image. Filtering means selectively focusing on certain frequencies or patterns and ignoring others. Did not understand? You can imagine it as a child adjusting the radio to hear a particular station clearly by filtering out all other noise.

录音机转啊转,从此,命运的齿轮开始转动……

那么,AFF 是如何工作的?

转换图像

AFF 令牌混合器使用叫做傅里叶变换的武器,将图像从空间描述转变为频率描述。就像用描述图像中的模式和波形的不同语言来翻译图像。

过滤频率

一旦进入这种频率“语言”,AFF 系统应用一个已学习的滤波器,集中注意图像的重要部分并忽略不重要的部分。这个滤波器是自适应的,这意味着它会根据正在查看的特定图像内容进行更改。

重建图像

最后,系统将过滤后的频率重新翻译为常规的基于像素的描述,但此刻不“需要”的部分会被过滤掉,重要部分则会得到强调。

整个过程以计算效率的方式完成,意味着它可以快速完成,而不需要大量的计算能力。简单来说,AFF 令牌混合器为神经网络提供了更有效和有效地理解和处理图像的方法。通过关注重要的模式并忽略噪声,它使网络能够更清晰地看到“全局”,并进行更准确的预测或分析。

我们可以尝试从另一个角度去理解它。自适应频率滤波令牌混合器就像神经网络中的智能翻译和编辑器。通过将深度学习与频率域分析结合,这项工作成功地设计了一种全新的token混合方法。它通过FFT和逆FFT,将图像翻译成频率语言,适应并关注关键部分,去除噪音,然后再翻译回来。如此一来便能够将全局 token 混合操作简化为频率域中的元素乘法,从而实现了更高的效率和灵活性。这为深度学习领域提供了一种新的视角和可能的优化方向。你学废了没?

以下是整体框架图:

可以看出,AFFNet 是基于多个 AFF Blocks构建的轻量级主干网络。以下是它的主要特点:

  • AFFNet通过堆叠多个AFF Blocks构建。
  • 卷积茎(Convolution Stem):用于令牌化。
  • 原始融合(Plain Fusion):用于在每个阶段组合局部和全局特征。

此外,AFFNet 针对不同应用场景提供三个版本,它们的通道数量不同,从而产生不同的参数规模。

  • AFFNet: 5.5M
  • AFFNet-T(Tiny):2.6M
  • AFFNet-ET(Extremely Tiny):1.4M

AFF 模块特色

实验

定量分析

定性分析

对小目标的检测效果好像还不错。

总结

通过引入自适应频率滤波(AFF)token混合器,本文提出了一种新颖的全局token混合方法,并构建了一种轻量级视觉网络架构AFFNet。该方法有效地克服了传统深度学习模型在移动和边缘设备上的计算挑战,并展示了在广泛视觉任务上的卓越性能。

写在最后

如果有对神经网络架构相关研究感兴趣的童鞋,非常欢迎扫描屏幕下方二维码或者直接搜索微信号 cv_huber 添加小编好友,备注:学校/公司-研究方向-昵称,与更多小伙伴一起交流学习!

Guess you like

Origin juejin.im/post/7266299564344999955