LLM Weekly Paper|Frontier paper research from Google, Huawei, Stanford University, Hong Kong University and other institutions

A large model (LLM) is an artificial intelligence model designed to understand and generate human language. They are trained on large amounts of text data and can perform a wide range of tasks, including text summarization, translation, sentiment analysis, and more. LLMs are characterized by their large scale, containing billions of parameters, which help them learn complex patterns in linguistic data. These models are often based on deep learning architectures such as Transformers, which helps them achieve impressive performance on various NLP tasks.

At the end of 2022, OpenAI launched ChatGPT, a large-scale language model based on GPT-3.5. Due to its excellent performance, ChatGPT and the large-scale language model behind it quickly became a hot topic in the field of artificial intelligence, attracting the attention of researchers and developers. participate.

This week, 10 excellent papers in the field of LLM were selected, from Google, Huawei, Stanford University, Hong Kong University and other institutions.

For the convenience of reading, only the paper title, author, ChatPaper summary and other information are listed. If you are interested, you can click the link to view the original text, PC-side data synchronization (collection can be viewed on the PC side), and daily new papers can also log in to the small page. program view.

ChatPaper entrance: https://www.aminer.cn/chat/g

1. CAME: Confidence-guided Adaptive Memory Efficient Optimization paper details page

Authors: Yang Luo, Xiaozhe Ren, Zangwei Zheng, Zhuo Jiang, Xin Jiang, Yang You

Link: https://www.aminer.cn/pub/64a63bddd68f896efaec6604/?f=cs

ChatPaper review: This paper discusses that adaptive gradient methods such as Adam and LAMB show very good performance when training large language models, but need to maintain second-order moment estimates for each parameter gradient, which requires additional memory overhead . To address this problem, the paper proposes CAME, an adaptive memory-efficient optimizer based on confident guidance. CAME reduces the instability of existing memory-efficient optimizers by using a confident guidance strategy. Based on this strategy, CAME simultaneously achieves two goals: fast convergence like traditional adaptive methods, and low memory usage like memory-efficient methods. Extensive experimental results show that CAME is stable and performs well in a variety of natural language processing tasks, especially in the large batch size 32,768 of BERT pre-training. Compared with the Adam optimizer, our proposed method achieves faster Convergence and higher precision. Implementations of CAME are publicly available.

2.BiPhone: Modeling Inter Language Phonetic Influences in Text paper details page

作说:Abhirut Gupta, Ananya B. Sai, Richard Sproat, Yuri Vasilevski, James S. Ren, Ambarish Jash, Sukhdeep S. Sodhi, Aravindan Raghuveer

Link: https://www.aminer.cn/pub/64ab82833fda6d7f06f77db1/?f=cs

ChatPaper review: This paper discusses that many people are forced to communicate on the Internet in a second language (L2) with which they are not familiar due to reasons such as technological asymmetry, which makes L2 texts often contain a large number of errors, which are affected by Influence of their native language (L1). The paper proposes a method to mine speech confusions between L1 and L2 (i.e. voices in L2 that L1 listeners might confuse), and inject these confusions into a generative model (Bi-Phone) to synthetically generate impairments of the L2 text. Through human evaluation, the paper shows that Bi-Phone can generate plausible corruptions that vary across different L1s, and that this corruption has wide coverage on the Web. Furthermore, by applying this corruption technique to SuperGLUE, a popular language understanding benchmark, the paper finds that SoTA language understanding models perform poorly under this approach. In addition, the paper also introduces a new speech prediction pre-training task, which can help the byte model restore the performance close to SuperGLUE. Finally, the paper also releases a benchmark called FunGLUE to facilitate further research into robust language models for spoken language.

3. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models paper details page

Authors: Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, Li Fei-Fei

Link: https://www.aminer.cn/pub/64abee0f286e8b4b6fcd5c84/?f=cs

ChatPaper review: This paper aims to synthesize dynamic robot trajectories for a large number of different tasks for robotic manipulation using large language models (LLM). Prior to this, most robotic manipulation research relied on pre-defined locomotion patterns that largely limited the robot's ability to interact. The paper proposes a method that leverages the inference capabilities of LMMs and the ability to write code, interacting with visual language models (VLMs) to generate 3D value maps, and using them in a model-based planning framework to synthesize closed loops with zero-shot Robot trajectory while being robust to dynamic perturbations. The framework also leverages online experience to efficiently learn dynamic models that are exposed to rich scenes. The method has been studied on a large scale in simulated and real robotic environments, demonstrating the ability to perform more than 30 everyday robotic manipulation tasks specified by free-text descriptions.

4. PolyLM: An Open Source Polyglot Large Language Model paper details page

Authors: Xiangpeng Wei, Haoran Wei, Huan Lin, Tianhao Li, Pei Zhang, Xingzhang Ren, Mei Li, Yu Wan, Zhiwei Cao, Binbin Xie, Tianxiang Hu, Shangjie Li, Binyuan Hui, Bowen Yu, Dayiheng Liu, Baosong Yang, Fei Huang, Jun Xie

Link: https://www.aminer.cn/pub/64af76ed3fda6d7f0647132f/?f=cs

ChatPaper review: This paper introduces an open source multilingual large-scale language model called PolyLM, which improves its multilingual ability by fusing bilingual data and adopting curriculum learning strategies, and incorporates bilingual data in the training data. In addition, a multilingual self-guided method is proposed, which can automatically generate 132,700 diverse multilingual instructions for model fine-tuning. Through extensive experiments, the paper shows that PolyLM performs well in multilingual tasks, while performing on par with existing open source models LLaMA and BLOOM in English.

5. Details page of Teaching Arithmetic to Small Transformers paper

作说:Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos

Link: https://www.aminer.cn/pub/64ab82833fda6d7f06f77dee/?f=cs

ChatPaper review: This paper studies how to teach basic arithmetic operations to small Transformer models. We find that small Transformer models trained on large amounts of text data can efficiently learn arithmetic operations such as addition, multiplication, and basic square root functions starting from random initialization. We first demonstrate that traditional training data is not the most efficient for arithmetic learning and that simple data format changes can significantly improve accuracy. As the training data grows, there is a pronounced phase shift, which can be explained by a link related to low-rank matrix filling. Based on this, we use the chained idea data including intermediate step results for training. Even without pre-training, this approach can significantly improve accuracy, sample complexity, and convergence speed simultaneously. We also investigate the interaction between arithmetic and text data, and examine the effects of few hints, pre-training, and model size. Additionally, we discuss length generalization challenges. Our work highlights the importance of high-quality, guided data that should take into account the special nature of next-token prediction targets to rapidly induce arithmetic ability.

6. Lost in the Middle: How Language Models Use Long Contexts paper details page

Performers: Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michelle Bevilacqua, Fabio Petroni, Percy Liang

Link: https://www.aminer.cn/pub/64a78f1fd68f896efa01eb25/?f=cs

ChatPaper review: This paper studies how language models use long contexts. Although some language models capable of handling long contexts have appeared in recent years, less is known about how language models utilize information in long contexts. This paper analyzes two tasks that require identifying relevant information from input context: multi-document question answering and key-value retrieval. It is found that language models tend to work best when acquiring information at the beginning or end of the input context, while acquiring information in the middle of long contexts can significantly degrade performance. Furthermore, for long-context models, an increase in the length of the input context significantly degrades performance. The analysis in this paper provides new insights into how language models employ input context, and provides new evaluation criteria for future long-context models.

7.VideoGLUE: Video General Understanding Evaluation of Foundation Models paper details page

Authors: Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, Mikhail Sirotenko, Huisheng Wang, Florian Schroff, Hartwig Adam, Ming-Hsuan Yang, Ting Liu,Boqing Gong

Link: https://www.aminer.cn/pub/64a78f1fd68f896efa01eb1f/?f=cs

ChatPaper review: This paper evaluates the existing foundation model's ability in video understanding, using a well-designed experimental protocol, including three landmark tasks (action recognition, temporal positioning and spatio-temporal positioning), and eight are welcomed by the community datasets, and four ways to tune foundation models for downstream tasks. In addition, we also propose a metric VideoGLUE Score (VGS) to measure the effectiveness and efficiency of foundation models on general video understanding tasks. Our results show that the task specialization model significantly outperforms the six foundation models we study, which is quite different from the achievements of foundation models in natural language and image understanding. Furthermore, video-native foundation models (which include video patterns in pre-trained data) generally perform better than image-native foundation models in classifying motion-rich videos, temporally localizing actions, and understanding videos with multiple actions. The third finding shows that for video tasks, video-native foundation models perform well when lightly adapted to downstream tasks (such as freezing the backbone of a foundation model), while image-native foundation models perform better when fully end-to-end fine-tuned. The first two observations point to the need to focus on the research of foundation models for video focus, and the last observation shows that tasks and adaptation methods are crucial for the evaluation of foundation models.

8. Focused Transformer: Contrastive Training for Context Scaling paper details page

作者:Szymon Tworkowski,Konrad Staniszewski,Mikołaj Pacek,Yuhuai Wu,Henryk Michalewski,Piotr Miłoś

Link: https://www.aminer.cn/pub/64a78f1fd68f896efa01eb23/?f=cs

ChatPaper review: This paper studies how to solve the problem of poor memory in external memory by contrastive training. The memory content in external memory consists of (key, value). As the number of documents increases, the number of relevant keys decreases, causing the model to focus more on non-keys. This situation is called the interference problem. To solve this problem, the authors propose the Focused Transformer (FoT) technique, which utilizes contrastive training to enhance the structure of the (key, value) space, thereby extending the context length. The authors also show that fine-tuning existing large language models using FoT techniques can extend their effective context. Empirical results show that the LongLLaMA model using FoT technique achieves progress in tasks requiring long context, such as password retrieval. It is also demonstrated that the LongLLaMA model can efficiently handle 256 thousand context lengths, which was previously intractable.

9. GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest paper details page

Authors: Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, Ping Luo

Link: https://www.aminer.cn/pub/64ab828f3fda6d7f06f78840/?f=cs

ChatPaper review: This paper proposes a new method called GPT4RoI to fine-tune a large-scale language model (LLM) using regional instructions for more accurate multimodal understanding. Traditional image-to-text instruction training methods can only establish image-level visual-linguistic alignment, lacking region-level alignment, which limits their progress on fine-grained multimodal understanding. In this paper, the author proposes a method called regional instruction fine-tuning, which uses regional instructions to convert the bounding box into instructions in the format of spatial instruction. Then, the interleaved sequences of regional instructions and language embeddings are fed as input into the LLM and trained on the regional text data transformed in instruction fine-tuning format. The proposal of the GPT4RoI region-level visual language model provides a new dialogue and interactive experience beyond the ability of image-level understanding. (1) Controllability: Users can interact with the model in two ways, language and regional instructions, to flexibly adjust the level of detail of the problem. (2) Capabilities: The model supports not only single-region region commands, but also multi-region region commands, unlocking more region-level multimodal capabilities, such as detailed region titles and complex region reasoning. (3) Compositionality: Any off-the-shelf object detector can be used as a region instruction provider to mine useful object properties from our model, such as color, shape, material, motion, association with other objects, etc.

10. Generative Pretraining in Multimodality paper details page

Authors: Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang

Link: https://www.aminer.cn/pub/64ae259c3fda6d7f0658f3b5/?f=cs

ChatPaper review: This paper introduces Emu, a Transformer-based multifunctional base model that can seamlessly generate images and text in a multimodal context. The model is an omnivorous model that can accept arbitrary single-modal or multi-modal inputs (e.g. alternating images, text, and videos) and is trained one-to-one through a general auto-regression training procedure. First, visual signals are encoded as embeddings and together with text tokens form alternating input sequences. Emu then classifies with the unified objective of predicting the next text token or regressing the next visual embedding across the entire multimodal sequence. This versatile multimodality enables the model to explore multiple sources of large-scale pre-training data, such as alternating sequences of frames and text from videos, alternating sequences of images and text on web pages, and large-scale pairs of images and text and video and text pair. Emu can serve as a versatile multimodal interface, supporting image-to-text and text-to-image tasks, and enabling in-context image and text generation. In a wide range of zero/few-shot tasks, such as image captioning, visual question answering, video question answering, and text-to-image generation tasks, Emu demonstrates superior performance on top of state-of-the-art large multimodal models. In addition, Emu also demonstrates excellent scalability capabilities, such as implementing multimodal assistants through instruction fine-tuning.


How to use ChatPaper?

The method of using ChatPaper is very simple. Open the AMiner homepage and enter the ChatPaper page from the navigation bar at the top of the page or the lower right corner.
insert image description here

On the ChatPaper page, you can choose to have a dialogue based on a single document or a dialogue based on the entire library (personal library), and you can choose to upload a local PDF or directly search for documents on AMiner.

Guess you like

Origin blog.csdn.net/AI_Conf/article/details/131761290
Recommended