90.94% accuracy! Google sets a new record for ImageNet! Model soups: Improve model accuracy and robustness

Click the card below to follow the " CVer " public account

AI/CV heavy dry goods, delivered as soon as possible

Fengse is from the concave temple and
reproduced from: qubit (QbitAI)

How to maximize model accuracy ?

More recently, Google and others have found that:

Don't throw the fine-tuning model with poor performance first, ask for the average weight!

The accuracy and robustness of the model can be improved without increasing inference time and memory overhead .

For example, the researchers used this method to set a new ImageNet1K record : 90.94%.

a58e6dc1fb45ed914f0d2364085ce4fb.png

Extending it to multiple image classification and natural language processing tasks also improves the model's out-of-distribution performance and improves zero-shot performance on new downstream tasks.

63d94f4b99986f546a3a58bfd77df437.png

And this method also has an interesting name called Module soup ——

Does it immediately remind one of the Fibonacci soup joke ? (Yesterday's soup + the day before yesterday's soup = today's new soup)

bdcb7230740dd7e3a6b910058766cc28.png

 Zhihu netizen @hzwer, authorized

A total of three recipes

Recall that before this, how did everyone increase the model?

Do you first train multiple fine-tuned models with various hyperparameters, then pick the one that performs best on the validation set and keep it, and discard the rest ?

Since the neural network is nonlinear, there may be many solutions in different loss basins, so it is a little surprising that Module soup retains the weights of all fine-tuned models and averages them to improve performance.

However, recent studies have found that fine-tuned models independently optimized from the same initialization configuration  lie in the same basin of the error landscape .

Previous studies have also demonstrated that weight averaging along a single training trajectory can improve the performance of randomly initialized training models.

The author is inspired by these conclusions.

Module soup has three "recipes" (implementations): uniform soup, greedy soup, and learned soup.

Among them, greedy soup is the most adopted implementation because its performance is higher than directly averaging all weights evenly.

Specifically, Greedy soup is built by sequentially adding each model as a latent ingredient in the "soup", and keeping the corresponding model in the "soup" only if the performance on the validation set has improved.

Sorting is in descending order of validation set precision.

d8a8b176ba4a78f3642284a59478f76b.png

Outperforms the single best fine-tuned model

The authors conducted comprehensive fine-tuning experiments to determine the effectiveness of Module soup.

The first is to fine-tune CLIP and ALIGN , two models pre-trained with a contrastive loss on image-text pairs.

Results After the module soup operation, both outperformed the best single fine-tuned model on both within-distribution and natural distribution shifts test sets.

7322441195951dcf418190ddd789f4f5.png

△The  left is CLIP, the right is ALIGN

Then there is the ViT-G model pretrained on the JFT dataset .

That is, it achieved an accuracy of 90.94% on the ImageNet1K dataset, breaking the previous CoAtNet's 90.88%, while reducing FLOPs by 25% in the inference phase.

c2d797d53308abf40b66cada7c4184d8.png

In addition to the image classification task, the author also verified module soup in the NLP field.

The following table shows the results of the BERT and T5 models on four text classification tasks on the GLUE benchmark:

d91be47d7a5a413a29316352f3a9d1d9.png

It can be found that although the improvement is not as obvious as the effect in image classification , greedy soup can improve the performance of the best single model on most tasks.

Of course, the author also pointed out that module soup has limitations in applicability and other aspects . For example, the models currently tested are all pre-trained on large heterogeneous datasets. Outside these models, the effect is not very obvious.

Finally, Zhihu netizen @ Gongjiang Craftsman said that in fact, such model parameters are on average a classic trick, and the original paper of the transformer was used.

abcad7e115ba47c5f5c3215d41ce0c09.png

Did you discover it?

Paper address:
https://arxiv.org/abs/2203.0548

Zhihu @ Gongjiang craftsman, @hzwer answered (authorized): https://www.zhihu.com/question/521497951

ICCV和CVPR 2021论文和代码下载

后台回复:CVPR2021,即可下载CVPR 2021论文和代码开源的论文合集

后台回复:ICCV2021,即可下载ICCV 2021论文和代码开源的论文合集

后台回复:Transformer综述,即可下载最新的3篇Transformer综述PDF
CVer-Transformer交流群成立
扫码添加CVer助手,可申请加入CVer-Transformer 微信交流群,方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch和TensorFlow等群。
一定要备注:研究方向+地点+学校/公司+昵称(如Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲长按加小助手微信,进交流群
CVer学术交流群(知识星球)来了!想要了解最新最快最好的CV/DL/ML论文速递、优质开源项目、学习教程和实战训练等资料,欢迎扫描下方二维码,加入CVer学术交流群,已汇集数千人!

▲扫码进群
▲点击上方卡片,关注CVer公众号

整理不易,请点赞和在看

Guess you like

Origin blog.csdn.net/amusi1994/article/details/123625737