Real-time tracking of scientific research trends丨New papers selected by Zhou Jingren, Sun Maosong, Gabriel Synnaeve and others on 8.25, with ChatPaper review

As a scientific researcher, you need to search and browse a large amount of academic literature every day to obtain the latest scientific and technological progress and research results. However, traditional retrieval and reading methods can no longer meet the needs of scientific researchers.
ChatPaper is a literature knowledge tool that integrates retrieval, reading, and knowledge Q&A. Help you quickly improve the efficiency of retrieval and reading papers, obtain the latest research trends in the field, and make scientific research work more comfortable.
picture

Combined with the cutting-edge news subscription function, arXiv selects the most popular new papers of the day and forms a paper review, allowing everyone to understand the cutting-edge news more quickly.
If you want to have an in-depth conversation about a certain paper, you can directly copy the paper link to the browser or go directly to the ChatPaper page: https://www.aminer.cn/chat/g/explain

List of selected new papers on August 25, 2023:

1.Code Llama: Open Foundation Models for CodeRead the original text

Code Llama is a family of large-scale language models for code, developed based on Llama 2, with leading performance in open models, padding capabilities, support for large input contexts, and zero-instruction following capabilities for programming tasks. It provides a variety of different versions to cover a wide range of application fields, including a basic model (Code Llama), a Python-specific version (Code Llama-Python) and an instruction following model (Code Llama-Instruct). Each model has 7B parameters. , 13B and 34B. All models are trained on 16k labeled sequences and show improvement on up to 100k labeled inputs. The Code Llama and Code Llama-Instruct versions of 7B and 13B support padding based on surrounding content. Code Llama achieves state-of-the-art performance among open models in multiple code benchmarks, with scores as high as 53% on HumanEval and 55% on MBPP. Notably, Code Llama-Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform other publicly available models on the MultiPL-E benchmark. Code Llama is released under a permissive license, allowing research and commercial use.

https://www.aminer.cn/pub/64e82e45d1d14e646633f5aa/

2.Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities 阅读原文

The author introduces the Qwen-VL series of models, including Qwen-VL and Qwen-VL-Chat, which demonstrate excellent performance in tasks such as image description, question answering, visual localization, and flexible interaction. The evaluation covers a wide range of tasks such as zero-shot image description, visual question answering of images or documents, and image localization. The authors demonstrate that Qwen-VL has better performance than existing large-scale visual language models. The article also introduces the architecture, training methods, capabilities, and performance of these models, and highlights their contributions in advancing multimodal artificial intelligence.

https://www.aminer.cn/pub/64e826d63fda6d7f06c3150c/

3.Large Language Model as Autonomous Decision MakerRead the original text

The paper points out that the current decision-making capabilities of large language models (LLMs) as autonomous decision makers still heavily rely on the guidance of task-specific expert knowledge. In order to unleash the potential of LLMs as autonomous decision makers, this paper proposes a method called JuDec, which gives LLMs the ability to self-judge, enabling them to realize autonomous judgment and exploration for decision-making. Specifically, in JuDec, a self-judgment mechanism based on Elo scores is designed. By comparing two solutions in pairs, assigning Elo scores to decision steps to judge their value and utility, and then guide the decision search process accordingly. towards the optimal solution. Experimental results on the ToolBench data set show that JuDec is superior to the baseline model, with the pass rate increasing by more than 10% in different tasks. It provides a higher quality solution and reduces costs (ChatGPT API calls), highlighting its effectiveness and efficiency.

https://www.aminer.cn/pub/64e826d03fda6d7f06c2e109/

4.VIGC: Visual Instruction Generation and CorrectionRead the original text

The lack of high-quality instruction-tuned data in visual-verbal tasks is pointed out. Existing methods rely on models that only use language to generate data, but due to a lack of understanding of image details, this approach requires pre-annotated image captions and detected bounding boxes. In order to solve this problem, this paper proposes the Visual Instruction Generation and Correction (VIGC) framework, which uses multimodal large language models (MLLMs) to generate instruction adjustment data and gradually improves the data quality during the generation process. Specifically, Visual Instruction Generation (VIG) is used to guide the visual-language model to generate diverse instruction adjustment data, while Visual Instruction Correction (VIC) uses an iterative update mechanism to correct inaccuracies in the data generated by VIG, thereby Reduce the risk of false information. By leveraging the diverse, high-quality data generated by VIGC, mainstream models can be fine-tuned and data quality verified through various evaluations. Experimental results show that VIGC not only makes up for the shortcomings of methods that only use language to generate data, but also effectively improves benchmark performance.

https://www.aminer.cn/pub/64e826d63fda6d7f06c31394/

5.Language as Reality: A Co-Creative Storytelling Game Experience in 1001 Nights using Generative AI 阅读原文

The paper introduces an AI localization game called "1001 Nights", which is implemented in game reality through a narrative co-created by players and characters driven by a large language model. The game concept was inspired by Wittgenstein's idea of ​​a person's world being limited by the boundaries of his or her language. By using advanced AI tools like GPT-4 and Stable Diffusion, the second version of the game enables protagonist Shahrzad to realize words and stories in her world. Players can guide the discussion of specific keywords by talking to the AI ​​king, and these keywords will become combat equipment in the game. This combination of interactive storytelling and text-to-image translation challenges traditional boundaries between game worlds and reality through dual perspectives. The paper focuses primarily on Shahrzad, his attempt to change his fate compared to the original folk tale, and the role of players working with AI to create the narrative and shape the game world. The paper explores the technical and design elements of implementing such a game, with the aim of enhancing the narrative game genre through AI-generated content and exploring the possibilities of AI-localized games.

https://www.aminer.cn/pub/64e826d63fda6d7f06c314d5/

6.DLIP: Distilling Language-Image Pre-trainingRead the original text

ChatPaper review: The paper points out that Vision-Language Pre-training (VLP) has made significant progress in parameter weight, but this brings challenges to deployment in practical applications. The paper points out that existing knowledge distillation techniques lack in-depth research and analysis of VLP, and lack practical guidelines for VLP-oriented distillation. Therefore, this paper proposes a simple and efficient Distilling Language-Image Pre-training (DLIP) framework, through which we explore how to distill lightweight VLP models. The paper analyzes model distillation from multiple dimensions, such as the architectural characteristics of different modules and the information transmission of different modes. Through comprehensive experiments, the paper provides insights into achieving the accuracy/efficiency trade-off on different cross-modal tasks such as image-text retrieval, image captioning, and visual question answering. For example, DLIP compresses the parameters of BLIP by 1.9 times, from 213M to 108M, while achieving comparable or better performance. Furthermore, compared to the teacher model, DLIP retains more than 95% of the performance in terms of parameters and FLOPs and accelerates inference by 2.7x.

https://www.aminer.cn/pub/64e826d63fda6d7f06c31502/

Guess you like

Origin blog.csdn.net/AI_Conf/article/details/132535540