ICCV 2023 | Pixel-based MIM: A Simple and Efficient Self-Supervised Method for Multilevel Feature Fusion

guide

论文:《Improving Pixel-based MIM by Reducing Wasted Modeling Capability》

Problem Background : Masked Image Modeling, MIMis an effective self-supervised learning framework, but existing pixel-based MIM methods tend to focus too much on high-frequency details . In this way, the ability of the model will be wasted, and low-frequency semantic information cannot be fully captured .

Main work : Based on this problem, this paper proposes a new method to assist pixel reconstruction by explicitly using shallow low-level features. This design was integrated into MAE, reducing Pixel-based MIMthe "waste of resources" in modeling capabilities, while improving convergence and achieving decent improvements in various downstream tasks. Especially on smaller models, this approach can significantly improve performance.

motivation

Self-supervised learning has made remarkable progress in computer vision. Among them, the MIM paradigm captures the semantics of an input image by reconstructing its occluded parts. It has a simple training process and high downstream task performance. However, basic pixel-based methods like MAE have simple pre-training pipelines and minimal computational overhead, but they are usually biased towards capturing high-frequency details, wasting modeling that could be better used to capture low-frequency semantics. ability. The authors aim to reduce this waste of modeling power to improve the quality of learned representations for downstream vision tasks. To this end, they designed two pilot experiments and proposed a corresponding solution MFF:

Fusing Shallow Layers : Here not only the output layer is used for pixel reconstruction, but also a weight averaging strategy is implemented to fuse all previous layers. These weights are dynamically updated during pre-training, revealing the importance of each layer for the reconstruction task.

Frequency analysis : Here the frequency response of each layer feature is analyzed, and it is found that the shallow layer contains more high-frequency components, which are related to low-level details (such as texture).

Multi-level Feature Fusion : By explicitly incorporating shallow low-level features into the output layer, the model is relieved of the burden of over-focusing on these low-level details, allowing it to better capture high-level semantics.

method

As mentioned above, this paper proposes a new method for pixel-level occlusion image modeling (MIM), which especially focuses on multi-level feature fusion (Multi-level Feature Fusion, MFF). Below, we will follow the context of the article to introduce the specific method in detail.

An Introduction to Pixel-Level MIM

Pixel-wise MIM aims to predict raw pixel values ​​of raw or post-processed images. The process can be viewed as a denoising autoencoder and follows a simple pipeline. For occluded images, visible markers and/or occluded markers can be fed into the encoder; if only visible markers are used, then both the occluded markers and the latent features output by the encoder must be fed into the decoder.

Multilayer Feature Fusion

This paper proposes a multi-layer feature fusion mechanism and integrates it into existing pixel-level MIM methods. The following are the specific steps:

  • Input and encoding : Given an image I I , through the encoder E E acquires latent representation X X

  • Select fusion layer : select the depth layer of the encoder N N , and determine the number of layers to fuse M M (in this paper, M = 5). The author first selects shallow layers through ablation studies, and selects 6 layers including the last layer for fusion through experiments.

  • Projection layer : before fusion, through the projection layer P i P_{i} 对额外的 M M 层进行调整,以便在不同层次之间对齐特征空间。

  • 融合层:引入融合层 F F 来融合多层特征 X ˜ 。同时将对应输出输入解码器进行像素重建。

投影和融合层的实例化

投影层一般可以设置成线性或非线性的,不过根据文章的实验表明,简单的线性层在框架内就足够有效。

而对于融合层来说,其目的是从浅层特征中收集低级信息。文章评估了两种常用的融合方法:加权平均池化和基于自注意力的融合。加权平均池化策略通过动态更新权重来实现,自注意力方法则使用现有的Transformer层。实验结果表明,加权平均池化与自注意力相当,但更简单且计算效率更高。

总的来说,这种方法通过集成浅层和深层的特点,弥补了像素级 MIM 倾向于捕捉高频细节而忽略低频语义信息的问题,从而提高了模型的性能。

实验

从实验结果可以看出,结合 MFF 策略的 MIM 模型大都可以有效涨点。

消融实验的结果分析了三个关键方面:浅层的重要性、用于融合的层数,以及投影层和融合策略的影响。

浅层是否重要?

实验考虑了将输出层与浅层或深层融合的效果。结果表明,与深层融合只带来了微小的改进,而将低级特征直接从浅层融合到输出层则显著提高了性能。这是因为这样做使模型能够更专注于语义信息。因此,本文方法最终决定使用浅层(即第一层)进行多层特征融合。

用于融合的层数多少合适?

除了输出层和前面选择的浅层外,合理的做法是考虑使用中间层进行融合,因为它们可能包含有助于重建任务的额外低级特征或高级含义。实验尝试了在浅层和输出层之间均匀选择1、2和5层。结果显示,引入更多层会带来持续的改进,因为它们可能包含有助于模型完成重建任务的独特特征,例如纹理或颜色。然而,当融合所有这些层时,所有下游任务的性能都会下降,这可能是因为这些层之间的冗余导致优化难度增加。

投影层和融合策略是否重要?

实验还调查了投影层对最终结果的影响,发现简单的线性投影层足以取得令人满意的结果,与不使用投影层或非线性投影层相比。线性投影层有助于减轻不同层之间的领域或分布差距,但非线性投影层则引入了计算开销,并更难以优化,从而实现了次优性能。至于融合策略,作者发现加权平均池化策略最有效,与attn相比,这种策略更简单,计算开销更小。


让我们简单总结下,消融实验的结果揭示了浅层的重要性,选择适当数量的层数以及采用线性投影和加权平均池化策略的重要性。这些发现有助于提高多层特征融合在像素级 MIM 方法中的性能,并提供了实现这些改进的具体指导方针。通过对浅层、中间层的混合和合适的投影与融合策略,该方法提升了图像重建任务的精度,为未来的研究提供了有益的参考。

总结

在这项研究中,研究人员系统地探索了等向性架构(如ViT)中多层特征融合在遮挡图像建模中的应用。通过一项初步实验,揭示了浅层低级特征在像素重建任务中的重要性,并在 MAE 和 PixMIM 两种像素级 MIM 方法中应用了简单直观的多层特征融合策略,实现了显著的性能提升。消融实验进一步优化了层数选择和投影融合策略,并发现了该融合可以抑制高频信息并弱化损失。这项工作为像素级 MIM 方法提供了新的视角,推动了这种简单高效的自监督学习范式的发展。

写在最后

欢迎对自监督学习相关研究感兴趣的童鞋扫描屏幕下方二维码或者直接搜索微信号 cv_huber 添加小编好友,备注:学校/公司-研究方向-昵称,与更多小伙伴一起交流学习!

Guess you like

Origin juejin.im/post/7266299564344934419