Real-time tracking of scientific research trends丨New papers selected on 9.25 from Microsoft, Nanyang Polytechnic, MetaAI and other institutions

As a scientific researcher, you need to search and browse a large amount of academic literature every day to obtain the latest scientific and technological progress and research results. However, traditional retrieval and reading methods can no longer meet the needs of scientific researchers.

AMiner AI is a literature knowledge tool that integrates retrieval, reading, and knowledge Q&A. Help you quickly improve the efficiency of retrieval and reading papers, obtain the latest research trends in the field, and make scientific research work more comfortable.
Insert image description here
Combined with the cutting-edge news subscription function, arXiv selects the most popular new papers of the day and forms a paper review, allowing everyone to understand the cutting-edge news more quickly.

If you want to have an in-depth conversation about a certain paper, you can directly copy the paper link to the browser or go directly to the AMiner AI page: https://www.aminer.cn/chat/g/explain

List of selected new papers on September 25, 2023:

1.CodePlan: Repository-level Coding using LLMs and Planning

The paper illustrates that in software engineering, performing editing activities for an entire code repository (such as package migration, fixing error reports in static analysis or testing, and adding type annotations or other specifications to the code base) is a complex task that traditional methods It cannot be solved directly. While some recent tools that leverage large language models (LLMs), such as GitHub Copilot, can successfully provide high-quality solutions to local coding problems, for the task of editing an entire codebase, traditional methods cannot be applied because of the code in the codebase. are interdependent, and the entire code base may be too large to accommodate the input. Therefore, this paper proposes a framework to solve the editing task of the entire code base in the form of a planning problem, called CodePlan. CodePlan works by breaking down the entire editing process into multiple steps (plans), each step edits a location in the code base, the context is determined by the entire code base, previous code modifications and task-specific instructions, and calls LLM . CodePlan is based on a novel combination of incremental dependency analysis, change likely impact analysis, and adaptive planning algorithms. Experimental results show that CodePlan can better match the benchmark results compared to the baseline method, and in terms of verification checks (such as building without errors and correct code editing), CodePlan can pass 5/6 code bases, while the benchmark Method fails any of the codebase's validation checks.

https://www.aminer.cn/pub/6510edb83fda6d7f06b90db1/?f=cs

2.MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

This research addresses the problem of data augmentation in large vocabulary instance segmentation. The study proposes a data enhancement method based on the diffusion model, called MosaicFusion. The method requires no training, does not rely on any label supervision, and is able to leverage off-the-shelf text-to-image diffusion models as useful dataset generators. The study generates multiple instances simultaneously by dividing the image canvas into multiple areas and performing a round of diffusion process under different text prompt conditions. Then, the corresponding instance masks are obtained by aggregating cross-attention maps related to object cues and performing simple thresholding and edge-aware refinement after multiple levels and time steps of diffusion. Experiments demonstrate that MosaicFusion is capable of generating large amounts of synthetic annotation data, especially for rare and novel categories. MosaicFusion is able to significantly improve the performance of existing instance segmentation models on the LVIS long-tail and open vocabulary benchmarks.

https://www.aminer.cn/pub/6510edb83fda6d7f06b90fd8/?f=cs

3.Self-Explanation Prompting Improves Dialogue Understanding in Large Language Models

Research points out that large language models (LLMs) often encounter difficulties in understanding complex dialogue tasks. To solve this problem, the study proposed a strategy called "Self-Explanation" to improve the understanding ability of LLMs by analyzing each dialogue utterance before task execution. Experiments demonstrate that this method consistently outperforms other zero-shot cues on six benchmark datasets, and matches or exceeds the effect of few-shot cues, demonstrating its effectiveness in improving LLMs' understanding capabilities in complex dialogue tasks. potential.

https://www.aminer.cn/pub/6510edb83fda6d7f06b90f71/?f=cs

4.Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

This article describes how to efficiently prune multilingual automatic speech recognition models by using adaptive masking methods. The authors point out that neural network pruning is an effective method to compress multilingual ASR models with almost no loss in performance. However, each language requires running multiple rounds of pruning and retraining, which is a tedious task. Therefore, the author proposes an adaptive mask method that can efficiently prune multilingual ASR models in two scenarios, thereby obtaining a sparse monolingual model or a sparse multilingual model (called dynamic ASR Pathways). This method dynamically adapts to subnetworks and avoids early decisions of fixed subnetwork structures. Research results show that this method outperforms existing pruning methods when targeting sparse monolingual models. Furthermore, the authors demonstrate that dynamic ASR Pathways can find and train better subnetworks (paths) of a single multilingual model by adapting from different subnetwork initializations, thereby reducing the need for language-specific pruning.

https://www.aminer.cn/pub/6510edb83fda6d7f06b90fc0/?f=cs

5.Robotic Offline RL from Internet Videos via Value-Function Pre-Training

The paper discusses the application of Internet videos in robot reinforcement learning. The paper points out that pre-training on Internet data has proven to be a key factor in the widespread generalization of many modern machine learning systems. However, there are still challenges in applying this capability to reinforcement learning in robots. Offline reinforcement learning methods can solve this problem by leveraging robot experience datasets. However, these methods suffer from a "type mismatch" with video data, as video data (such as Ego4D) only provide observational experience and lack the action and reward annotations required by RL methods. Therefore, this paper develops a system entirely based on temporal difference learning that can leverage large-scale human video datasets for offline reinforcement learning of robots. Research shows that performing value learning on video data sets can learn representations that are more conducive to offline reinforcement learning of robots. The system combines the advantages of pre-training on video data with offline reinforcement learning methods on diverse robot data, resulting in better value functions and strategies that can perform better, be more robust and implement in manipulation tasks Broad generalization. Our framework achieves significant improvements over previous methods on several manipulation tasks on a real WidowX robot.

https://www.aminer.cn/pub/6510edb83fda6d7f06b90fd7/?f=cs

6.DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion

The paper points out that the visual Transformer (ViTs) with self-attention mechanism is competitive in the field of computer vision and can achieve global information sharing. However, the quadratic complexity of self-attention makes ViTs computationally expensive, and they lack inductive bias in terms of local information and translational equivariance, requiring larger model sizes compared to convolutional neural networks (CNNs). Effectively learn visual features. To solve this problem, the paper proposes a lightweight and efficient visual Transformer model called DualToken-ViT, which takes advantage of CNNs and ViTs. DualToken-ViT achieves an efficient attention structure by effectively fusing local information obtained based on the convolutional structure and global information obtained based on the self-attention structure. In addition, the paper also uses global tokens with location awareness to enrich global information and further enhance the effect of DualToken-ViT. The position-aware global tokens also contain the position information of the image, making our model more suitable for vision tasks. The paper proves the effectiveness of DualToken-ViT by conducting extensive experiments on image classification, target detection and semantic segmentation tasks. On the ImageNet-1K dataset, our models at different scales achieved 75.4% and 79.4% accuracy using only 0.5G and 1.0G FLOPs, respectively, while our 1.0G FLOPs model outperformed using The LightViT-T model of global tokens is 0.7%.

https://www.aminer.cn/pub/6510eda73fda6d7f06b901e3/?f=cs

Guess you like

Origin blog.csdn.net/AI_Conf/article/details/133303096