人工智能 | ShowMeAI资讯日报 #2022.06.12

持续创作,加速成长!这是我参与「掘金日新计划 · 6 月更文挑战」的第14天,点击查看活动详情

ShowMeAI日报系列全新升级!覆盖AI人工智能 工具&框架 | 项目&代码 | 博文&分享 | 数据&资源 | 研究&论文 等方向。点击查看 历史文章列表,在公众号内订阅话题 #ShowMeAI资讯日报,可接收每日最新推送。点击 专题合辑&电子月刊 快速浏览各专题全集。

1.工具&框架

工具:Watermark-Removal - 图片自动去水印

tags: [去水印,机器学习,人工智能,工具]

'Watermark-Removal - a machine learning image inpainting task that instinctively removes watermarks from image indistinguishable from the ground truth image' by Zuruoke Okafor

GitHub: github.com/zuruoke/wat…

工具:Frontend (FE) - ImageNet标注工具(前端)

tags: [图像标注,工具]

'ImageNet Annotation Tool - Frontend (FE)' by NAVER AI

GitHub: github.com/naver-ai/im…

工具:feder - 神经网络可视化工具

tags: [神经网络,可视化,工具]

'feder - Visualization for hnsw, faiss and other anns index' by Zilliz

GitHub: github.com/zilliztech/…

工具库:Maua - 深度学习图像/视频/音频合成库

tags: [图像合成,视频合成,音频合成,深度学习]

'Maua - Deep learning toolkit for image, video, and audio synthesis'

GitHub: github.com/maua-maua-m…

工具拓展:cfonts - 控制台的酷字体

tags: [字体,控制台]

'cfonts - Sexy fonts for the console' by Dominik Wilkowski

GitHub: github.com/dominikwilk…

工具框架:Nebulgym - 深度网络训练加速框架,仅需增加几行代码

tags: [深度学习,训练框架,加速]

'Nebulgym - Accelerate AI training in a few lines of code without changing the training setup.' by Nebuly

GitHub: github.com/nebuly-ai/n…

2.博文&分享

教程:微软官方发布的面向初学者和学生的Rust教程

tags: [rust,微软,教程]

Link: docs.microsoft.com/zh-cn/learn…

免费书籍:Fundamentals of Data Visualization / 数据可视化基础

tags: [可视化,书籍,数据可视化]

Link: clauswilke.com/dataviz/

3.数据&资源

资源列表:Rust并发开发速查清单

tags: [rust,并发,开发,速查清单]

'Rust Concurrency Cheat Sheet' by quambene

GitHub: github.com/quambene/ru…

资源列表:Awesome-Implicit-NeRF-Robotics - 与机器人/强化学习领域相关的隐表示与NeRF的论文资源大列表

Awesome-Implicit-NeRF-Robotics - A comprehensive list of Implicit Representations and NeRF papers relating to Robotics/RL domain, including papers, codes, and related websites' by Zubair Irshad

GitHub: github.com/zubair-irsh…

4.研究&论文

公众号回复关键字 日报,免费获取整理好的6月论文合辑。

论文:Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation

论文标题:Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation

论文时间:6 Jun 2022

所属领域:计算机视觉

对应任务:Instance Segmentation,Object Detection,Panoptic Segmentation,Semantic Segmentation,实例分割,目标检测,全景分割,语义分割

论文地址arxiv.org/abs/2206.02…

代码实现github.com/IDEACVR/Mas…, github.com/IDEACVR/DIN… , github.com/IDEA-openso… , github.com/fengli-ust/… , github.com/IDEA-openso…

论文作者:Feng Li, Hao Zhang, Huaizhe xu, Shilong Liu, Lei Zhang, Lionel M. Ni, Heung-Yeung Shum

论文简介:In this paper we present Mask DINO, a unified object detection and segmentation framework. / 在本文中,我们介绍了统一的对象检测和分割框架 Mask DINO。

论文摘要:In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by adding a mask prediction branch which supports all image segmentation tasks (instance, panoptic, and semantic). It makes use of the query embeddings from DINO to dot-product a high-resolution pixel embedding map to predict a set of binary masks. Some key components in DINO are extended for segmentation through a shared architecture and training process. Mask DINO is simple, efficient, scalable, and benefits from joint large-scale detection and segmentation datasets. Our experiments show that Mask DINO significantly outperforms all existing specialized segmentation methods, both on a ResNet-50 backbone and a pre-trained model with SwinL backbone. Notably, Mask DINO establishes the best results to date on instance segmentation (54.5 AP on COCO), panoptic segmentation (59.4 PQ on COCO), and semantic segmentation (60.8 mIoU on ADE20K). Code will be avaliable at github.com/IDEACVR/Mas…

在本文中,我们提出了 Mask DINO,一个统一的对象检测和分割框架。 Mask DINO 通过添加一个支持所有图像分割任务(实例、全景和语义)的掩码预测分支来扩展 DINO(带有改进的去噪锚框的 DETR)。它利用来自 DINO 的查询嵌入来点积高分辨率像素嵌入图,以预测一组二进制掩码。 DINO 中的一些关键组件通过共享架构和训练过程进行了分割扩展。 Mask DINO 简单、高效、可扩展,并受益于联合大规模检测和分割数据集。我们的实验表明,无论是在 ResNet-50 骨干网还是具有 SwinL 骨干网的预训练模型上,Mask DINO 都显着优于所有现有的专业分割方法。值得注意的是,Mask DINO 在实例分割(COCO 上 54.5 AP)、全景分割(COCO 上 59.4 PQ)和语义分割(ADE20K 上 60.8 mIoU)方面建立了迄今为止最好的结果。代码将在 github.com/IDEACVR/Mas… 上提供

论文:Diffusion-LM Improves Controllable Text Generation

论文标题:Diffusion-LM Improves Controllable Text Generation

论文时间:27 May 2022

所属领域:自然语言处理

对应任务:Language Modelling,Text Generation,语言建模,文本生成

论文地址arxiv.org/abs/2205.14…

代码实现github.com/xiangli1999…

论文作者:Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, Tatsunori B. Hashimoto

论文简介:Controlling the behavior of language models (LMs) without re-training is a major open problem in natural language generation. / 在不重新训练的情况下控制语言模型 (LM) 的行为是自然语言生成中的一个主要开放问题。

论文摘要:Controlling the behavior of language models (LMs) without re-training is a major open problem in natural language generation. While recent works have demonstrated successes on controlling simple sentence attributes (e.g., sentiment), there has been little progress on complex, fine-grained controls (e.g., syntactic structure). To address this challenge, we develop a new non-autoregressive language model based on continuous diffusions that we call Diffusion-LM. Building upon the recent successes of diffusion models in continuous domains, Diffusion-LM iteratively denoises a sequence of Gaussian vectors into word vectors, yielding a sequence of intermediate latent variables. The continuous, hierarchical nature of these intermediate variables enables a simple gradient-based algorithm to perform complex, controllable generation tasks. We demonstrate successful control of Diffusion-LM for six challenging fine-grained control tasks, significantly outperforming prior work.

在不重新训练的情况下控制语言模型 (LM) 的行为是自然语言生成中的一个主要开放问题。虽然最近的工作已经证明在控制简单句子属性(例如情感)方面取得了成功,但在复杂、细粒度的控制(例如句法结构)方面进展甚微。为了应对这一挑战,我们开发了一种新的基于连续扩散的非自回归语言模型,我们称之为 Diffusion-LM。基于最近在连续域中的扩散模型取得的成功,Diffusion-LM 迭代地将一系列高斯向量去噪为词向量,从而产生一系列中间潜在变量。这些中间变量的连续、分层特性使简单的基于梯度的算法能够执行复杂、可控的生成任务。我们展示了 Diffusion-LM 成功控制六项具有挑战性的细粒度控制任务,显着优于先前的工作。

论文:Separable Self-attention for Mobile Vision Transformers

论文标题:Separable Self-attention for Mobile Vision Transformers

论文时间:6 Jun 2022

所属领域:计算机视觉

对应任务:Object Detection,物体检测

论文地址arxiv.org/abs/2206.02…

代码实现github.com/apple/ml-cv…

论文作者:Sachin Mehta, Mohammad Rastegari

论文简介:The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection. / 改进的模型 MobileViTv2 在包括 ImageNet 对象分类和 MS-COCO 对象检测在内的多个移动视觉任务上是最先进的。

论文摘要:Mobile vision transformers (MobileViT) can achieve state-of-the-art performance across several mobile vision tasks, including classification and detection. Though these models have fewer parameters, they have high latency as compared to convolutional neural network-based models. The main efficiency bottleneck in MobileViT is the multi-headed self-attention (MHA) in transformers, which requires O(k2) time complexity with respect to the number of tokens (or patches) k. Moreover, MHA requires costly operations (e.g., batch-wise matrix multiplication) for computing self-attention, impacting latency on resource-constrained devices. This paper introduces a separable self-attention method with linear complexity, i.e. O(k). A simple yet effective characteristic of the proposed method is that it uses element-wise operations for computing self-attention, making it a good choice for resource-constrained devices. The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection. With about three million parameters, MobileViTv2 achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming MobileViT by about 1% while running 3.2× faster on a mobile device. Our source code is available at github.com/apple/ml-cv…

移动视觉transformer(MobileViT) 可以在包括分类和检测在内的多个移动视觉任务中实现最先进的性能。尽管这些模型的参数较少,但与基于卷积神经网络的模型相比,它们具有较高的延迟。 MobileViT 的主要效率瓶颈是 Transformer 中的多头自注意力 (MHA),相对于tokens(或patches)k 的数量,它需要 O(k2) 时间复杂度。此外,MHA 需要昂贵的操作(例如,批量矩阵乘法)来计算自注意力,从而影响资源受限设备的延迟。本文介绍了一种具有线性复杂度的可分离的self-attention方法,即O(k)。所提出方法的一个简单而有效的特征是它使用元素操作来计算自注意力,使其成为资源受限设备的不错选择。改进的模型 MobileViTv2 在多个移动视觉任务(包括 ImageNet 对象分类和 MS-COCO 对象检测)上是最先进的。 MobileViTv2 使用大约 300 万个参数,在 ImageNet 数据集上实现了 75.6% 的 top-1 准确率,比 MobileViT 高出约 1%,同时在移动设备上的运行速度提高了 3.2 倍。我们的源代码可在 github.com/apple/ml-cv…

论文:Masked Unsupervised Self-training for Zero-shot Image Classification

论文标题:Masked Unsupervised Self-training for Zero-shot Image Classification

论文时间:7 Jun 2022

所属领域:计算机视觉

对应任务:Classification,Image Classification,Representation Learning,Zero-Shot Image Classification,分类,图像分类,表征学习,零样本图像分类

论文地址arxiv.org/abs/2206.02…

代码实现github.com/salesforce/…

论文作者:Junnan Li, Silvio Savarese, Steven C. H. Hoi

论文简介:We demonstrate the efficacy of MUST on 8 downstream tasks across a variety of domains, where it improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification. / 我们展示了 MUST 在各个领域的 8 个下游任务上的功效,它在很大程度上改进了 CLIP,并缩小了无监督分类和监督分类之间的性能差距。

论文摘要:State-of-the-art computer vision models are mostly trained with supervised learning using human-labeled images, which limits their scalability due to the expensive annotation cost. While self-supervised representation learning has achieved impressive progress, it still requires a second stage of finetuning on labeled data. On the other hand, models pre-trained with large-scale text-image supervision (e.g., CLIP) have enabled zero-shot transfer to downstream image classification tasks. However, the zero-shot performance of CLIP-like models are often insufficient for real-world adoption. In this paper, we aim to leverage the abundant unlabeled data to improve the performance of a pre-trained zero-shot classifier on downstream tasks. We propose Masked Unsupervised Self-Training (MUST), a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images. MUST jointly optimizes three objectives to learn both class-level global feature and pixel-level local feature and enforces a regularization between the two. We demonstrate the efficacy of MUST on 8 downstream tasks across a variety of domains, where it improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification. For instance, MUST achieves a zero-shot top-1 accuracy of 77.7% on ImageNet using ViT-B, +9.4% higher than CLIP. Our code is available at github.com/salesforce/…

最先进的计算机视觉模型大多是使用人工标记的图像通过监督学习进行训练的,由于昂贵的注释成本,这限制了它们的可扩展性。虽然自监督表示学习取得了令人瞩目的进展,但它仍然需要对标记数据进行第二阶段的微调。另一方面,使用大规模文本图像监督(例如 CLIP)预训练的模型已经能够零样本转移到下游图像分类任务。然而,类 CLIP 模型的零样本性能通常不足以在现实世界中采用。在本文中,我们旨在利用丰富的未标记数据来提高预训练的零样本分类器在下游任务中的性能。我们提出了掩码无监督自我训练(MUST),这是一种利用两种不同且互补的监督来源的新方法:伪标签和原始图像。必须联合优化三个目标来学习类级全局特征和像素级局部特征,并在两者之间实施正则化。我们展示了 MUST 在各种领域的 8 个下游任务上的功效,它在很大程度上改进了 CLIP,并缩小了无监督和有监督分类之间的性能差距。例如,MUST 在 ImageNet 上使用 ViT-B 实现了 77.7% 的 zero-shot top-1 准确率,比 CLIP 高 9.4%。我们的代码可在 github.com/salesforce/… 获得

论文:BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

论文标题:BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

论文时间:26 May 2022

所属领域:计算机视觉

对应任务:3D Object Detection,Autonomous Driving,Object Detection,Scene Segmentation,3D物体检测,自动驾驶,物体检测,场景分割

论文地址arxiv.org/abs/2205.13…

代码实现github.com/mit-han-lab…

论文作者:Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, Song Han

论文简介:Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. / 多传感器融合对于准确可靠的自动驾驶系统至关重要。

论文摘要:Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features. However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmentation). In this paper, we break this deeply-rooted convention with BEVFusion, an efficient and generic multi-task multi-sensor fusion framework. It unifies multi-modal features in the shared bird's-eye view (BEV) representation space, which nicely preserves both geometric and semantic information. To achieve this, we diagnose and lift key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than 40x. BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes. It establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower computation cost.

多传感器融合对于准确可靠的自动驾驶系统至关重要。最近的方法是基于点级融合:用相机特征增强 LiDAR 点云。然而,相机到激光雷达的投影丢弃了相机特征的语义密度,阻碍了这种方法的有效性,特别是对于面向语义的任务(如 3D 场景分割)。在本文中,我们使用 BEVFusion 打破了这种根深蒂固的惯例,这是一种高效且通用的多任务多传感器融合框架。它统一了共享鸟瞰图(BEV)表示空间中的多模态特征,很好地保留了几何和语义信息。为了实现这一目标,我们通过优化的 BEV 池来诊断和提升视图转换中的关键效率瓶颈,将延迟减少 40 倍以上。 BEVFusion 基本上与任务无关,并且无缝支持不同的 3D 感知任务,几乎没有架构变化。它建立了 nuScenes 的最新技术,在 3D 对象检测上实现了 1.3% 的 mAP 和 NDS,在 BEV 地图分割上实现了 13.6% 的 mIoU,计算成本降低了 1.9 倍。

论文:Can CNNs Be More Robust Than Transformers?

论文标题:Can CNNs Be More Robust Than Transformers?

论文时间:7 Jun 2022

所属领域:计算机视觉

对应任务:图像分类,图像识别

论文地址arxiv.org/abs/2206.03…

代码实现github.com/ucsc-vlaa/r…

论文作者:Zeyu Wang, Yutong Bai, Yuyin Zhou, Cihang Xie

论文简介:The recent success of Vision Transformers is shaking the long dominance of Convolutional Neural Networks (CNNs) in image recognition for a decade. / Vision Transformers 最近的成功正在动摇卷积神经网络 (CNN) 在图像识别领域长达十年的主导地位。

论文摘要:The recent success of Vision Transformers is shaking the long dominance of Convolutional Neural Networks (CNNs) in image recognition for a decade. Specifically, in terms of robustness on out-of-distribution samples, recent research finds that Transformers are inherently more robust than CNNs, regardless of different training setups. Moreover, it is believed that such superiority of Transformers should largely be credited to their self-attention-like architectures per se. In this paper, we question that belief by closely examining the design of Transformers. Our findings lead to three highly effective architecture designs for boosting robustness, yet simple enough to be implemented in several lines of code, namely a) patchifying input images, b) enlarging kernel size, and c) reducing activation layers and normalization layers. Bringing these components together, we are able to build pure CNN architectures without any attention-like operations that is as robust as, or even more robust than, Transformers. We hope this work can help the community better understand the design of robust neural architectures. The code is publicly available at github.com/UCSC-VLAA/R…

视觉转换器最近的成功动摇了卷积神经网络 (CNN) 在图像识别领域的长期主导地位十年。具体来说,就分布外样本的鲁棒性而言,最近的研究发现,无论训练设置如何,Transformer 本质上都比 CNN 更鲁棒。此外,人们认为 Transformer 的这种优越性在很大程度上应归功于其类似自注意力的架构本身。在本文中,我们通过仔细检查Transformer的设计来对这个想法提出质疑。我们的研究结果导致了三种高效的架构设计来提高鲁棒性,但足够简单,可以在几行代码中实现,即 a) 修补输入图像,b) 扩大内核大小,以及 c) 减少激活层和规范化层。将这些组件组合在一起,我们能够构建纯 CNN 架构,而无需任何类似注意力的操作,这些操作与 Transformers 一样健壮,甚至更健壮。我们希望这项工作可以帮助社区更好地理解鲁棒神经架构的设计。该代码可在 github.com/UCSC-VLAA/R… 上公开获得

论文:FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance

论文标题:FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance

论文时间:19 Nov 2020

所属领域:金融科技

对应任务:reinforcement-learning,Stock Market Prediction,强化学习,股市预测

论文地址arxiv.org/abs/2011.09…

代码实现github.com/AI4Finance-… , github.com/AI4Finance-… , github.com/AI4Finance-… , github.com/polixir/Neo… , github.com/forrestneo/…

论文作者:Xiao-Yang Liu, Hongyang Yang, Qian Chen, Runjia Zhang, Liuqing Yang, Bowen Xiao, Christina Dan Wang

论文简介:In this paper, we introduce a DRL library FinRL that facilitates beginners to expose themselves to quantitative finance and to develop their own stock trading strategies. / 在本文中,我们介绍了一个 DRL 库 FinRL,它可以帮助初学者接触量化金融并制定自己的股票交易策略。

论文摘要:As deep reinforcement learning (DRL) has been recognized as an effective approach in quantitative finance, getting hands-on experiences is attractive to beginners. However, to train a practical DRL trading agent that decides where to trade, at what price, and what quantity involves error-prone and arduous development and debugging. In this paper, we introduce a DRL library FinRL that facilitates beginners to expose themselves to quantitative finance and to develop their own stock trading strategies. Along with easily-reproducible tutorials, FinRL library allows users to streamline their own developments and to compare with existing schemes easily. Within FinRL, virtual environments are configured with stock market datasets, trading agents are trained with neural networks, and extensive backtesting is analyzed via trading performance. Moreover, it incorporates important trading constraints such as transaction cost, market liquidity and the investor's degree of risk-aversion. FinRL is featured with completeness, hands-on tutorial and reproducibility that favors beginners: (i) at multiple levels of time granularity, FinRL simulates trading environments across various stock markets, including NASDAQ-100, DJIA, S&P 500, HSI, SSE 50, and CSI 300; (ii) organized in a layered architecture with modular structure, FinRL provides fine-tuned state-of-the-art DRL algorithms (DQN, DDPG, PPO, SAC, A2C, TD3, etc.), commonly-used reward functions and standard evaluation baselines to alleviate the debugging workloads and promote the reproducibility, and (iii) being highly extendable, FinRL reserves a complete set of user-import interfaces. Furthermore, we incorporated three application demonstrations, namely single stock trading, multiple stock trading, and portfolio allocation. The FinRL library will be available on Github at link github.com/AI4Finance-…

由于深度强化学习 (DRL) 已被公认为量化金融领域的一种有效方法,因此获得实践经验对初学者具有吸引力。但是,要训练一个实用的 DRL 交易代理来决定交易地点、价格和数量,这涉及到容易出错和艰巨的开发和调试。在本文中,我们介绍了一个 DRL 库 FinRL,它可以帮助初学者接触量化金融并制定自己的股票交易策略。除了易于复制的教程外,FinRL 库还允许用户简化自己的开发并轻松与现有方案进行比较。在 FinRL 中,虚拟环境配置有股票市场数据集,交易代理使用神经网络进行训练,并通过交易性能分析广泛的回测。此外,它还包含重要的交易约束,例如交易成本、市场流动性和投资者的风险规避程度。 FinRL 具有适合初学者的完整性、实践教程和可重复性:(i) 在多个时间粒度级别,FinRL 模拟各种股票市场的交易环境,包括 NASDAQ-100、DJIA、S&P 500、HSI、SSE 50、和沪深300; (ii) 以模块化结构的分层架构组织,FinRL 提供微调的最先进的 DRL 算法(DQN、DDPG、PPO、SAC、A2C、TD3 等)、常用的奖励函数和标准评估基线以减轻调试工作量并提高可重复性,并且(iii)具有高度可扩展性,FinRL 保留了一套完整的用户导入接口。此外,我们还加入了三个应用演示,分别是单只股票交易、多只股票交易和组合配置。 FinRL 库将在 Github 上的链接 github.com/AI4Finance-… 上提供

论文:Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework

论文标题:Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework

论文时间:8 May 2022

所属领域:计算机视觉,自然语言处理

对应任务:视觉语言

论文地址arxiv.org/abs/2205.03…

代码实现github.com/yuxie11/R2D…

论文作者:Chunyu Xie, Heng Cai, Jianfei Song, Jincheng Li, Fanjing Kong, Xiaoyu Wu, Henrique Morimitsu, Lin Yao, Dexin Wang, Dawei Leng, Xiangyang Ji, Yafeng Deng

论文简介:Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various downstream tasks. / 大规模数据集上的视觉语言预训练(VLP)在各种下游任务中表现出卓越的性能。

论文摘要:Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various downstream tasks. A complete and fair benchmark (i.e., including large-scale pre-training datasets and diverse downstream tasks) is essential for VLP. While there are plenty of benchmarks with English corpus, building a rich benchmark for VLP with other languages, such as Chinese, remains a critical problem. To this end, we build a large-scale Chinese cross-modal benchmark called Zero for the research community to fairly compare VLP models. We release two pre-training datasets and five fine-tuning datasets for downstream tasks. Alongside, we propose a novel pre-training framework of pre-Ranking + Ranking for cross-modal learning. Specifically, we apply global contrastive pre-ranking to learn the individual representations of images and texts, respectively. We then fuse the representations in a fine-grained ranking manner via an image-text cross encoder and a text-image cross encoder. To further enhance the capability of the model, we propose a two-way distillation strategy consisting of target-guided Distillation and feature-guided Distillation. For brevity, we name our model R2D2. We achieve state-of-the-art performance on four public cross-modal datasets and the proposed five downstream datasets. When conducting zero-shot tasks on Flickr30k-CN, COCO-CN, and MUGE, R2D2 pre-trained on a 250 million dataset achieves significant improvements of 4.7%, 5.4%, and 6.3% in mean recall compared to the state-of-the-art. The datasets, models, and codes are available at github.com/yuxie11/R2D…

大规模数据集上的视觉语言预训练(VLP)在各种下游任务中表现出卓越的性能。完整且公平的基准(即包括大规模预训练数据集和各种下游任务)对于 VLP 至关重要。虽然英语语料库有很多基准,但用其他语言(例如中文)为 VLP 构建丰富的基准仍然是一个关键问题。为此,我们构建了一个名为 Zero 的大型中国跨模态基准,供研究界公平地比较 VLP 模型。我们为下游任务发布了两个预训练数据集和五个微调数据集。此外,我们提出了一种用于跨模态学习的 pre-Ranking + Ranking 的新型预训练框架。具体来说,我们应用全局对比预排序来分别学习图像和文本的个体表示。然后,我们通过图像-文本交叉编码器和文本-图像交叉编码器以细粒度的排序方式融合表示。为了进一步增强模型的能力,我们提出了一种由目标引导蒸馏和特征引导蒸馏组成的双向蒸馏策略。为简洁起见,我们将模型命名为 R2D2。我们在四个公共跨模式数据集和建议的五个下游数据集上实现了最先进的性能。在 Flickr30k-CN、COCO-CN 和 MUGE 上执行零样本任务时,在 2.5 亿数据集上进行预训练的 R2D2 与 state-of-state 相比,平均召回率分别提高了 4.7%、5.4% 和 6.3%。艺术。数据集、模型和代码可在 github.com/yuxie11/R2D… 获得。

我们是 ShowMeAI,致力于传播AI优质内容,分享行业解决方案,用知识加速每一次技术成长!点击查看 历史文章列表,在公众号内订阅话题 #ShowMeAI资讯日报,可接收每日最新推送。点击 专题合辑&电子月刊 快速浏览各专题全集。

猜你喜欢

转载自juejin.im/post/7108203596592152589