Large Model Weekly Report | What are the problems in large model evaluation? University of Science and Technology of China and others proposed Ziya2

A large model (LLM) is an artificial intelligence model designed to understand and generate human language. They are trained on large amounts of text data and can perform a wide range of tasks, including text summarization, translation, sentiment analysis, and more. LLMs are characterized by their large scale, containing billions of parameters, helping them learn complex patterns in linguistic data. These models are often based on deep learning architectures such as transformers, which helps them achieve impressive performance on a variety of NLP tasks.

At the end of 2022, OpenAI launched ChatGPT, a large-scale language model based on GPT-3.5. Due to its excellent performance, ChatGPT and the large-scale language model behind it quickly became a hot topic in the field of artificial intelligence, attracting the attention and attention of a large number of scientific researchers and developers. participate.

This week, 10 outstanding papers in the field of LLM have been selected. In order to facilitate everyone's reading, only the paper title, author, AMiner AI review and other information are listed. If you are interested, you can scan the QR code to view the original text, and the PC data is synchronized (to collect it, you can save it on your PC View via terminal), daily new papers can also be viewed by logging into the mini program.

1. Don’t Make Your LLM an Evaluation Benchmark Cheater

This paper discusses the potential risks and impacts of inappropriate use of evaluation benchmarks and misleading interpretation of evaluation results in large-scale language models (LLMs). In particular, the authors focus on a particular issue that can lead to inappropriate evaluations, namely “benchmark leakage,” which refers to data associated with the evaluation set being occasionally used for model training. This phenomenon is becoming increasingly common since pre-training data is usually prepared before model testing. The authors conducted extensive experiments to study the impact of benchmark utilization and found that it can significantly improve the evaluation results, ultimately leading to unreliable model performance evaluation. To improve the use of existing evaluation benchmarks, the authors conclude by proposing several guidelines for LLM developers and benchmark maintainers. The authors hope that this work will draw attention to appropriate training and evaluation of LLMs.

Link:https://www.aminer.cn/pub/65484c09939a5f4082af623d/?f=cs

2. Ziya2: Data-centric Learning is All LLMs Need

This paper introduces a large language model (LLM) called Ziya2, which uses LLaMA2 as the base model and is further pre-trained on 70 billion tokens. The research focuses on pre-training techniques and data center optimization to enhance Ziya2’s learning process at different stages. Experimental results show that Ziya2 significantly outperforms other models in multiple benchmarks, especially with encouraging results when compared with representative open source models.

Link:https://www.aminer.cn/pub/65499deb939a5f4082bebeea/?f=cs

3. LCM-LoRA: A Universal Stable-Diffusion Acceleration Module

This paper introduces a universal stable diffusion acceleration module called LCM-LoRA. Latent Consistency Models (LCMs) have achieved remarkable results in accelerating text-to-image generation tasks, producing high-quality images with minimal inference steps. LCMs are extracted from pre-trained latent diffusion models (LDMs) and require only about 32 training hours of A100 GPUs. This report expands the potential of LCMs in two aspects: First, by applying LoRa distillation to stable diffusion models including SD-V1.5, SSD-1B, and SDXL, we expand the application scope of LCMs, making the model Less memory consumption, achieving better image generation quality. Secondly, we identify the LoRa parameters obtained by LCM distillation as a general stable diffusion acceleration module named LCM-LoRA. LCM-LoRA can be directly plugged into a variety of fine-tuned stable diffusion models, or LoRAs, without training, making it a versatile accelerator suitable for a variety of image generation tasks. Compared with previous numerical PF-ODE solvers (such as DDIM, DPM-Solver), LCM-LoRA can be regarded as a plug-in neural PF-ODE solver with strong generalization capabilities.

Project page: https://github.com/luosiallen/latent-consistency-model.

Link:https://www.aminer.cn/pub/654d9745939a5f40826b39ed/?f=cs

4. LRM: Large Reconstruction Model for Single Image to 3D

This paper introduces a method called LRM (Large Reconstruction Model), which can predict the 3D model of an object from a single input image in only 5 seconds. Unlike previous methods, which are typically trained on small datasets (such as ShapeNet) and in a class-specific manner, LRM employs a scalable transformer-based architecture with 500 million learnable parameters. , predict neural radiation fields (NeRF) directly from input images. The authors train the model in an end-to-end manner on a large amount of multi-view data containing approximately 1 million objects, including synthetic renderings from Objaverse and real captures from MVImgNet. This combination of high-capacity models and large-scale training data enables our models to be highly generalized and produce high-quality 3D reconstructions from a variety of test inputs, including real-world field captures and images generated by the model. . A video demonstration and interactive 3D mesh can be found at: https://yiconghong.me/LRM/.

Link:https://www.aminer.cn/pub/654c4abd939a5f4082b35938/?f=cs

5. GLaMM: Pixel Grounding Large Multimodal Model

This paper introduces GLaMM: the first pixel-level grounded large-scale multimodal model. Multimodal models (LMMs) extend large language models to the visual domain. Previous studies have used whole images and text cues to generate ungrounded textual responses, while recent studies have used region-level LMMs to generate visually grounded responses, but they can only specify one object category at a time and require the user to specify the region in the input. Or it cannot provide dense pixel-level object grounding. In this paper, the authors present the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks - GLaMM. Not only can GLaMM ground objects that appear in conversations, but it is flexible enough to accept text and optional visual cues (regions of interest) as input. This allows users to interact with the model at different granularities in textual and visual domains. Due to the lack of standard benchmarks for generating visually grounded detailed dialogues, the authors introduce a comprehensive evaluation protocol and carefully curated grounded dialogues. The Grounded Conversation Generation (GCG) task proposed by the authors requires large-scale densely grounded natural scene concepts. To this end, we propose a densely annotated Grounding-anything dataset (GranD), using our proposed automated annotation pipeline, including 7.5 million unique concepts grounded on a total of 810 million regions with segmentation masks. . In addition to GCG, GLaMM also performs well on downstream tasks such as referential expression segmentation, image and region-level description, and visual language dialogue. Project page: https://mbzuai-oryx.github.io/groundingLMM.

Link:https://www.aminer.cn/pub/65499e11939a5f4082beca5d/?f=cs

6. CogVLM: Visual Expert for Pretrained Language Models

This paper introduces CogVLM, a powerful open source visual language foundation model. Unlike popular shallow alignment methods, CogVLM bridges the gap between pre-trained language models and image encoders through trainable visual expert modules in the attention and FFN layers. Therefore, CogVLM is able to achieve deep fusion of visual language features without sacrificing NLP task performance. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and on VQAv2, OKVQA, Ranked second in TextVQA, COCO captioning and other tasks, exceeding or comparable to PaLI-X 55B.

Link:https://www.aminer.cn/pub/65260ee8cd549670787e1513/?f=cs

7. Levels of AGI: Operationalizing Progress on the Path to AGI

This paper proposes a method for classifying the capabilities and behaviors of artificial general intelligence (AGI) models and their precursors. This framework proposes a hierarchy of AGI performance, ubiquity, and autonomy. The authors hope that this framework will be as useful as autonomous driving levels, providing a common language for comparing models, assessing risks, and measuring progress on the path to AGI. In developing this framework, the authors analyzed existing AGI definitions and extracted six principles that a useful AGI ontology should satisfy. These principles include focusing on capabilities rather than mechanisms; assessing generalizability and performance separately; and defining stages on the road to AGI rather than focusing on the endpoint. Based on these principles, the author proposes an "AGI level" based on capability depth (performance) and breadth (universality), and considers how current systems fit into this ontology. They discuss challenging requirements for future benchmarks that quantitatively measure the behavior and capabilities of AGI models versus these levels. Finally, they discuss how these AGI levels interact with deployment considerations such as autonomy and risk, and highlight the importance of selecting a human-computer interaction paradigm for the responsible and safe deployment of highly capable AI systems.

Link:https://www.aminer.cn/pub/65499d88939a5f4082be9a34/?f=cs

8. mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

This paper introduces a multimodal large-scale language model called mPLUG-Owl2, which improves the performance of text and multimodal tasks by leveraging modal collaboration. mPLUG-Owl2 adopts a modular network design, and the language decoder serves as a universal interface to manage different modalities. Specifically, mPLUG-Owl2 introduces shared functional modules to promote modality collaboration and modality adaptation modules to preserve modality-specific features. Experimental results show that mPLUG-Owl2 is able to generalize text tasks and multi-modal tasks and achieve state-of-the-art performance using a single general model. It is worth noting that mPLUG-Owl2 is the first multi-modal large-scale language model that exhibits modal collaboration phenomena in both plain text and multi-modal scenes, opening up a new path for the development of future multi-modal basic models. .

Connection:https://www.aminer.cn/pub/654c4abd939a5f4082b35909/?f=cs

9. LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

This paper introduces a bilingual base model called Skywork-13B, which is trained on 3.2 trillion tokens from English and Chinese text. This is the largest and most fully trained bilingual basic model among the large-scale language models released to date. The article introduces a two-stage training method, targeting general training and domain-specific enhanced training. Research shows that the model not only performs well on popular benchmarks, but also reaches the state-of-the-art in Chinese language modeling in diverse domains. In addition, the article proposes a novel leak detection method, indicating that test data contamination is a pressing issue worthy of further research by the LLM community. To promote future research, the authors also release Skywork-13B, along with intermediate stage checkpoints obtained during training. At the same time, the author also released a part of the SkyPile corpus, which is a set of more than 150 billion web page text tags and is the largest high-quality open Chinese pre-training corpus to date. The author hopes that Skywork-13B and the open corpus can become valuable open source resources so that more people can have access to high-quality language models.

Link:https://www.aminer.cn/pub/654d96ff939a5f40826adf41/?f=cs

10. TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models

This paper introduces a method called TEAL (Tokenize and Embed ALL) for handling input and generating non-text modalities in multimodal large language models (MM-LLMs). TEAL treats the input of any modality as a sequence of tokens and learns a joint embedding space for all modalities. Specifically, TEAL first uses an existing tagger to discretize the input of any modality into a tag sequence, and then uses a learnable embedding matrix to embed the tag sequence into a joint embedding space. MM-LLMs only need to predict multi-modal markers sequentially like text LLMs. Finally, the corresponding de-labeler is applied to generate the output of each modality based on the predicted label sequence. By jointly embedding the space, TEAL enables frozen LLMs to perform understanding and generation tasks involving non-text modalities such as images and audio. Therefore, text LLM can serve only as an interface and maintain high performance in terms of text understanding and generation. Experiments show that TEAL achieves significant improvements in multimodal understanding and implements a simple multimodal generation scheme.

Link:https://www.aminer.cn/pub/654c4abd939a5f4082b35985/?f=cs


How to use AMiner AI?

The method of using AMiner AI is very simple. Open the AMiner homepage and enter the AMiner AI page from the navigation bar at the top of the page or the lower right corner.

Insert image description here

On the AMiner AI page, you can choose to have a conversation based on a single document or a conversation based on the entire database (personal document database). You can choose to upload a local PDF or directly search for documents on AMiner.

AMiner AI入口:https://www.aminer.cn/chat/g/explain

Guess you like

Origin blog.csdn.net/AI_Conf/article/details/134398506