LLM Paper Weekly | Research on cutting-edge papers from Tsinghua University, Peking University, Meta AI and other institutions

Large Model (LLM) is an artificial intelligence model designed to understand and generate human language. They are trained on large amounts of text data and can perform a wide range of tasks, including text summarization, translation, sentiment analysis, and more. LLMs are characterized by their large scale, containing billions of parameters, helping them learn complex patterns in linguistic data. These models are often based on deep learning architectures such as transformers, which helps them achieve impressive performance on a variety of NLP tasks.

At the end of 2022, OpenAI launched ChatGPT, a large-scale language model based on GPT-3.5. Due to its excellent performance, ChatGPT and the large-scale language model behind it quickly became a hot topic in the field of artificial intelligence, attracting the attention and attention of a large number of scientific researchers and developers. participate.

This week we have selected 10 outstanding papers in the field of LLM, from institutions such as Meta AI, Peking University, and Tsinghua University.

1.Reinforcement Learning for Generative AI: A Survey

The paradigm currently used primarily for training generative models is maximum likelihood estimation, which captures and approximates the target data distribution by reducing the difference between the model distribution and the target distribution. Although this approach successfully establishes the goals of the generative task, it cannot satisfy all user requirements for generative models. Reinforcement learning, as a competitive alternative to inject new training signals by creating new goals, demonstrates its ability to leverage human inductive preferences from multiple perspectives (e.g., adversarial learning, manually designed rules, and learning reward models) to build a The power and flexibility of a high-performance model. Therefore, reinforcement learning has become a research hotspot and has expanded the boundaries of generative artificial intelligence in terms of model design and application. The article presents a comprehensive review summarizing the progress made in this field in recent years. Although there have been some recent survey reports in different application areas, the purpose of this article is to provide a high-level overview of multiple application areas. We provide a rigorous taxonomy in the field with adequate coverage of various models and applications. Notably, we also survey the rapidly growing field of large-scale language models. The article concludes by showing potential directions that might address current model limitations and expand the boundaries of generative AI.

Link: https://www.aminer.cn/pub/64ed716d3fda6d7f0658aa83

2. Nougat: Neural Optical Understanding for Academic Documents

The article explains that scientific knowledge is mainly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to the loss of semantic information, especially for mathematical expressions. To solve this problem, the authors propose a visual Transformer model called Nougat, which can perform optical character recognition (OCR) processing of scientific documents and convert them into a markup language. By demonstrating the effectiveness of the model on a new dataset of scientific documents, the authors show that this approach offers a promising solution for improving the accessibility of scientific knowledge in the digital age, bridging the gap between human-readable documents. and machine-readable text. The authors release the model and code to accelerate future work in scientific text recognition.

Link: https://www.aminer.cn/pub/64ec1b7e3fda6d7f06270245

3. InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4

These models gain the ability to follow instructions through a two-stage training process of pre-training image-text pairs and fine-tuning on visual language instruction data. Recent research shows that large language models can achieve satisfactory results even with a limited number of high-quality instructions following the data. This paper introduces InstructionGPT-4, which is fine-tuned only on a small dataset consisting of 200 examples, equivalent to ~6% of the instruction following data used in the MiniGPT-4 aligned dataset. The authors first propose several metrics to evaluate the quality of multimodal instruction data. Based on these metrics, they propose a simple yet effective data selector that can automatically identify and filter low-quality visual language data. By adopting this approach, InstructionGPT-4 outperforms the original MiniGPT-4 in various evaluations (e.g. visual question answering, GPT-4 preference). Overall, the results show that small but high-quality instruction fine-tuning data can effectively enable multi-modal large language models to generate better output.

Link: https://www.aminer.cn/pub/64e6d5bd3fda6d7f0652c7f8

4. Large Graph Models: A Perspective

The paper points out that in the fields of artificial intelligence and machine learning, large models have achieved major breakthroughs, but in the field of graphics, especially in other fields such as natural language processing and computer vision, large models have not yet achieved the same success. To advance the adoption of large graphical models, this paper presents a perspective paper discussing the challenges and opportunities in developing large graphical models. First, the paper discusses the desirable properties of large graphical models. Then, it is discussed in detail from three important perspectives: representation basis, graphical data and graphical model. Within each category, the paper briefly introduces recent advances, highlights remaining challenges, and our outlook. Finally, the paper discusses valuable applications of large graphical models. The paper believes that this perspective paper can encourage further exploration of large graphical models, ultimately bringing us one step closer to artificial general intelligence (AGI).

Link: https://www.aminer.cn/pub/64ed716d3fda6d7f0658ab4a

5. Computation-efficient Deep Learning for Computer Vision: A Survey

While deep learning models have shown great progress in computer vision tasks, the computing resources they require are also increasing, posing some challenges to real-world applications. Existing advanced models often require large amounts of computing resources, which may result in unrealistic power consumption, latency, or carbon emissions in real-world scenarios. In order to minimize the computational cost during inference, the field of computer vision has begun to focus on computationally efficient deep learning. The review provided in this abstract provides an extensive analysis of this rapidly growing field, covering four main aspects: 1) the development of static or dynamic lightweight backbone models for efficient extraction of discriminative deep representations; 2 ) specialized network structures or algorithms designed for specific computer vision tasks; 3) techniques for compressing deep learning models; and 4) strategies for deploying efficient deep networks on hardware platforms. In addition, the abstract also provides a systematic discussion of key challenges facing the field, such as network architecture design, training schemes, practical efficiency and more realistic model compression methods, as well as possible future research directions.

Link: https://www.aminer.cn/pub/64ed716d3fda6d7f0658a92f

6. LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models

This paper is an overview of research on autonomous agents based on large language models. Previous research often focused on training agents in isolated environments with limited knowledge, which is far from the human learning process, thus making it difficult for agents to achieve human-like decision-making. In recent years, large language models (LLMs) have shown great potential in achieving human-level intelligence by acquiring large amounts of network knowledge. This has triggered a surge in research on autonomous agents based on LLM. To fully exploit the potential of LLM, researchers have designed various agent architectures for different applications. In this paper, we conduct a systematic review of these studies as a whole. Specifically, we focus on building LLM-based agents, for which we propose a unified framework that covers most of the previous work. . In addition, we provide an overview of various applications of LLM-based artificial intelligence agents in the fields of social sciences, natural sciences, and engineering. Finally, we discuss common strategies for evaluating LLM-based artificial intelligence agents. Based on previous research, we also propose several challenges and future directions in this field.

Link: https://www.aminer.cn/pub/64f00ff53fda6d7f06eced18

7.LLaSM: Large Language and Speech Model

Most current research focuses on visual-verbal multimodal models, which have strong capabilities in understanding and executing visual-verbal instructions. However, the authors claim that speech is also an important way in which humans interact with the world, so it is crucial that a universal assistant be able to understand and follow multimodal speech-language instructions. To this end, the authors propose a large-scale language and speech model (LLaSM). LLaSM is an end-to-end trained large-scale multimodal speech-language model with cross-modal conversational capabilities and the ability to follow speech and language instructions. Early experiments show that LLaSM demonstrates a more convenient and natural way for humans to interact with artificial intelligence. In addition, the author also released a large speech instruction data set LLaSM-Audio-Instructions.

Link: https://www.aminer.cn/pub/64f00ff43fda6d7f06ecec49

8.Dual-Stream Diffusion Net for Text-to-Video Generation

There is an important bottleneck in the field of text-to-video generation, that is, the generated videos often have some flickers and artifacts. The authors propose a dual-stream diffusion network (DSDN) to improve the consistency of content changes in generated videos. This method works by designing two diffusion streams, video content and dynamic branches, to run separately in a private space to produce personalized video changes and content, and by utilizing the author-designed cross-converter interaction module between content and dynamic domains. Perform good alignment, which benefits the smoothness of the generated video. In addition, the author also introduces motion decomposers and combiners to facilitate the manipulation of video motion. Qualitative and quantitative experiments show that the method is able to generate stunning continuous videos with less flicker. Therefore, the abstract illustrates the problem of flickering and artifacts in generated videos and proposes a solution with a two-stream diffusion network.

Link: https://www.aminer.cn/pub/64dd9b053fda6d7f0622e793

9. Teach LLMs to Personalize – An Approach inspired by Writing Education

The paper proposes a new method to solve the problem of personalized text generation. Currently, research in this field mainly focuses on solving domain-specific personalized text generation problems by designing customized features or models. However, the method proposed in this paper is based on the practice of writing education by developing a multi-stage and multi-task framework to teach large language models (LLMs) for personalized generation. This framework decomposes the personalized text generation task into multiple stages such as retrieval, ranking, summarization, synthesis and generation. At the same time, the method also introduces a multi-task setting to further improve the model's generative ability, which is based on the educational observation that students' reading ability and writing ability are usually related. Evaluated on three public datasets, the results show that the proposed method achieves significant improvements in personalized text generation compared to various baseline methods.

Link: https://www.aminer.cn/pub/64dd9b053fda6d7f0622e61f

10. OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

Recent post-training quantization (PTQ) methods can reduce the memory footprint and improve the computational efficiency of LLM, but they hand-design the quantization parameters, resulting in lower performance, and cannot handle extremely low-bit quantization. To solve this problem, the authors introduced a technology called Omnidirectionally calibrated Quantization (OmniQuant), which achieves good performance under different quantization settings by effectively optimizing various quantization parameters while maintaining the computational efficiency of PTQ.

Link: https://www.aminer.cn/pub/64ec1b763fda6d7f0626f449

How to use ChatPaper?

The method of using ChatPaper is very simple. Open the AMiner homepage and enter the ChatPaper page from the navigation bar at the top of the page or the lower right corner.

Insert image description here

Guess you like

Origin blog.csdn.net/AI_Conf/article/details/132691765