Real-time tracking of scientific research developments丨The first pixel-level grounded large-scale multi-modal model, 11.7 selected papers

As a scientific researcher, you need to search and browse a large amount of academic literature every day to obtain the latest scientific and technological progress and research results.

However, traditional retrieval and reading methods can no longer meet the needs of scientific researchers.

AMiner AI is a literature knowledge tool that integrates retrieval, reading, and knowledge Q&A. Help you quickly improve the efficiency of retrieval and reading papers, obtain the latest research trends in the field, and make scientific research work more comfortable.

Insert image description here

If you want to have an in-depth conversation about a certain paper, you can directly copy the paper link to the browser or go directly to the AMiner AI page:https://www.aminer.cn/ chat/g/explain

List of selected new papers on November 7, 2023:

1.Relax: Composable Abstractions for End-to-End Dynamic Machine Learning

This paper introduces a compiler abstraction called Relax for optimizing end-to-end dynamic machine learning workloads. Especially in emerging large-scale language models, dynamic shape calculation has become critical. The success of these models has created a need to deploy them in diverse backend environments. Relax introduces first-class symbolic shape annotations to track global dynamic shape calculations in programs. It also introduces a cross-level abstraction that encapsulates computation graphs, loop tensor procedures, and library calls in a single representation to enable cross-level optimization. The authors use the proposed method to build an end-to-end compilation framework for optimizing dynamic shape models. Experimental results on large-scale language models show that Relax delivers performance comparable to state-of-the-art hand-optimized systems across a variety of platforms, and enables the deployment of emerging dynamic models to a wider range of environments, including mobile phones, embedded devices and web browser.

https://www.aminer.cn/pub/65499d88939a5f4082be98c0/?f=cs

2.S-LoRA: Serving Thousands of Concurrent LoRA Adapters

The paper introduces the S-LoRA system for massively parallel processing of low-rank adaptive (LoRA) adapters. In the deployment of large-scale language models, the "pre-training-then fine-tuning" paradigm is usually adopted, while LoRA is a parameter-efficient fine-tuning method that is usually used to adapt the base model to multiple tasks, thereby forming a large number of base models derived from a single base. LoRA adapter for the model. The authors observe that this paradigm provides significant opportunities for batch reasoning during service processes. To exploit these opportunities, they proposed the S-LoRA system for scalably serving many LoRA adapters. S-LoRA stores all adapters in main memory and fetches the adapters used by the currently running query into GPU memory. In order to effectively utilize GPU memory and reduce fragmentation, S-LoRA proposes a unified paging (Unified Paging) method. Unified paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with different sequence lengths. In addition, S-LoRA also adopts a novel tensor parallel strategy and highly optimized custom CUDA kernels for heterogeneous batch processing of LoRa calculations. These features enable S-LoRA to serve thousands of LoRA adapters with small overhead on a single GPU or multiple GPUs. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM, S-LoRA can increase throughput by up to 4x and increase the number of served adapters by orders of magnitude. Therefore, S-LoRA enables scalable services of large-scale task-specific fine-tuning models, providing the potential for large-scale customized fine-tuning services.

https://www.aminer.cn/pub/65499d90939a5f4082bea069/?f=cs

3.MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning

The paper introduces a multi-task fine-tuning framework called MFTCoder to improve the performance of code language models. Existing code language models often improve their coding capabilities by fine-tuning them for specific downstream tasks or scenarios, but this approach requires separate fine-tuning for each task, requires a lot of training resources, and is difficult to deploy and maintain. There are challenges. Furthermore, these methods fail to exploit the inherent connections between different code-related tasks. MFTCoder adopts a multi-task learning framework that can perform parallel fine-tuning on multiple tasks at the same time. By combining various loss functions, common problems in multi-task learning, such as data imbalance, different difficulty levels and inconsistent convergence speeds, are effectively solved. Experimental results show that MFTCoder's multi-task fine-tuning method outperforms single-task fine-tuning methods and mixed task set fine-tuning methods on multiple tasks. In addition, MFTCoder also has efficient training capabilities, including efficient data labeling methods and PEFT fine-tuning, which significantly improves speed compared with traditional fine-tuning methods. MFTCoder can be seamlessly integrated with mainstream open source code language models such as CodeLLama and Qwen. By leveraging CodeLLama's base model, MFTCoder fine-tuned the model and achieved a pass@1 score of 74.4% on the HumaneEval benchmark, exceeding the performance of GPT-4 (67%, zero-shot). MFTCoder is open source on GitHub.

https://www.aminer.cn/pub/65499d88939a5f4082be9990/?f=cs

4.Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs

The paper proposes a method called PASTA (Post-hoc Attention Steering Approach) to guide the model to focus on user-specified information in large language models (LLM). Existing methods are limited to processing only plain text and do not support this mechanism. PASTA directs model attention to user-specified parts by identifying a small subset of attention heads and reweighting them with precise attention. Similar to hints, PASTA is applied at inference time without changing any model parameters. Experiments have shown that PASTA can significantly improve LLM's ability to follow user instructions or integrate new knowledge from user input, thereby achieving significant performance improvements in various tasks. For example, the average accuracy of LLAMA-7B increased by 22%.

https://www.aminer.cn/pub/65499d88939a5f4082be9966/?f=cs

5.Ultra-Long Sequence Distributed Transformer

The paper proposes a new distributed training method called "Long Short-Term Sequence Transformer (LSS Transformer)" for training transformer models with long sequences. This method splits long sequences into parts and computes them between GPUs. Each GPU computes its partial self-attention and then uses fused communication and a novel dual-gradient averaging technique to avoid aggregating partial self-attention, thereby minimizing communication overhead. Performance comparison on the Wikipedia enwik8 dataset shows that our approach is 5.6x faster and 10.2x more memory efficient on 144 Nvidia V100 GPUs compared to state-of-the-art Nvidia sequential parallelism. Furthermore, our algorithm scales to an extreme sequence length of 50,112 on 3456 GPUs, achieving a superlinear parallel efficiency of 161% and a throughput of 32 petapflops.

https://www.aminer.cn/pub/65499d88939a5f4082be99e1/?f=cs

6.Levels of AGI: Operationalizing Progress on the Path to AGI

The paper proposes a method for classifying the capabilities and behaviors of artificial general intelligence (AGI) models and their precursors. This framework proposes a hierarchy of AGI performance, ubiquity, and autonomy. The authors hope that this framework will be as useful as autonomous driving levels, providing a common language for comparing models, assessing risks, and measuring progress on the path to AGI. In developing this framework, the authors analyzed existing AGI definitions and extracted six principles that a useful AGI ontology should satisfy. These principles include focusing on capabilities rather than mechanisms; assessing generalizability and performance separately; and defining stages on the road to AGI rather than focusing on the endpoint. Based on these principles, the author proposes an "AGI level" based on capability depth (performance) and breadth (universality), and considers how current systems fit into this ontology. They discuss challenging requirements for future benchmarks that quantitatively measure the behavior and capabilities of AGI models versus these levels. Finally, they discuss how these AGI levels interact with deployment considerations such as autonomy and risk, and highlight the importance of selecting a human-computer interaction paradigm for the responsible and safe deployment of highly capable AI systems.

https://www.aminer.cn/pub/65499d88939a5f4082be9a34/?f=cs

7.GLaMM: Pixel Grounding Large Multimodal Model

This paper introduces GLaMM: the first pixel-level grounded large-scale multimodal model. Multimodal models (LMMs) extend large language models to the visual domain. Previous studies have used whole images and text cues to generate ungrounded textual responses, while recent studies have used region-level LMMs to generate visually grounded responses, but they can only specify one object category at a time and require the user to specify the region in the input. Or it cannot provide dense pixel-level object grounding. In this paper, the authors present the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks - GLaMM. Not only can GLaMM ground objects that appear in conversations, but it is flexible enough to accept text and optional visual cues (regions of interest) as input. This allows users to interact with the model at different granularities in textual and visual domains. Due to the lack of standard benchmarks for generating visually grounded detailed dialogues, the authors introduce a comprehensive evaluation protocol and carefully curated grounded dialogues. The Grounded Conversation Generation (GCG) task proposed by the authors requires large-scale densely grounded natural scene concepts. To this end, we propose a densely annotated Grounding-anything dataset (GranD), using our proposed automated annotation pipeline, including 7.5 million unique concepts grounded on a total of 810 million regions with segmentation masks. . In addition to GCG, GLaMM also performs well on downstream tasks such as referential expression segmentation, image and region-level description, and visual language dialogue.

https://www.aminer.cn/pub/65499e11939a5f4082beca5d/?f=cs

8.CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

This paper introduces the CoVLM (Combining Visual Entities and Relationships in Large Language Models via Communication Decoding) method, which explicitly combines visual entities and relationships between entities in text by guiding a large language model (LLM), and Dynamically communicate with visual encoders and detection networks to enable visual language communication decoding. Specifically, the authors first design a new set of communication markers for dynamic communication between the visual detection system and the language system. LLM generates a communication token based on visual entities or relationships to inform the detection network to propose regions relevant to the sentences generated so far. The proposed regions of interest (ROIs) are then fed back into LLM for better language generation based on relevant regions. LLM is able to combine visual entities and relationships through communicating tags. Visual-to-speech and speech-to-visual communication proceeds iteratively until an entire sentence is generated. The framework seamlessly bridges the gap between visual perception and LLMs and significantly outperforms previous VLMs on combinatorial inference benchmarks. At the same time, state-of-the-art performance is also achieved in traditional visual language tasks such as referential expression understanding and visual question answering.

https://www.aminer.cn/pub/65499e0e939a5f4082bec940/?f=cs

9.Ziya2: Data-centric Learning is All LLMs Need

This paper introduces a large language model (LLM) called Ziya2, which uses LLaMA2 as the base model and is further pre-trained on 70 billion tokens. The research focuses on pre-training techniques and data center optimization to enhance Ziya2’s learning process at different stages. Experimental results show that Ziya2 significantly outperforms other models in multiple benchmarks, especially with encouraging results when compared with representative open source models.

https://www.aminer.cn/pub/65499deb939a5f4082bebeea/?f=cs

10.Co-training and Co-distillation for Quality Improvement and Compression of Language Models

This paper proposes a new framework called CTCD (Co-Training and Co-Distillation), which aims to improve performance and inference speed by jointly training two models and transferring knowledge to each other. This framework is based on two important findings: 1) During the joint training process, transferring knowledge from a small model to a large model can improve the performance of the large model; 2) The improvement of the performance of the large model further improves the performance of the small model. The CTCD framework has great potential to be combined with existing technologies (such as architectural design or data augmentation) to replace one-way knowledge distillation methods and achieve further performance improvements. Through a large number of ablation experiments, the effectiveness of CTCD was proved. The small model distilled using CTCD improved by 1.66 percentage points compared with the original large model in the GLUE benchmark test.

https://www.aminer.cn/pub/65499d88939a5f4082be9bb8/?f=cs

11.VR-NeRF: High-Fidelity Virtualized Walkable Spaces

The paper introduces an end-to-end system called VR-NeRF that uses neural radiation fields (NeRF) to capture, model, and render walkable spaces in virtual reality with high fidelity in real time. To achieve this goal, the authors designed and built a custom multi-camera rig to densely capture multi-view high dynamic range images of walkable spaces with high fidelity and unprecedented quality and density. The authors also extend the Instant Neural Graphics Primitives (INGP) into a novel perceptual color space for learning accurate high dynamic range appearances, and employ an efficient MIP mapping mechanism to achieve anti-aliased level-of-detail rendering. Simultaneously optimize the trade-off between quality and speed. The system is capable of rendering our neural radiation field model with high fidelity at dual 2K×2K full VR resolution and 36Hz on our custom demo machine. The authors demonstrate the quality of our results on our challenging high-fidelity dataset and compare our method and dataset with existing baselines. We also publish our dataset on our project website.

https://www.aminer.cn/pub/65499d88939a5f4082be9a84/?f=cs

Guess you like

Origin blog.csdn.net/AI_Conf/article/details/134309547