Inventory of open source large model papers, with PDF download link included

Large models have entered the "Android era", and open source models and closed source models continue to emerge, becoming two parallel forces in the field of large models.

Open source large models have brought new vitality to the AI ​​field, and industry ecological applications and even new models developed based on open source large models continue to emerge. It also provides researchers and developers with a broader space for innovation to experiment with these open source models without limited resources and no proprietary systems.

Abroad, after the release of ChatGPT, Meta released Llama, and this year it released Llama2, which is open source and commercially available. Stanford University released Alpaca after fine-tuning Llama, and Falcon with 180 billion parameters was also recently announced as open source.

In China, Tsinghua University and Zhipu AI released the open source ChatGLM-6B, Shanghai Artificial Intelligence Laboratory's Scholar Puyu, Baichuan Intelligent's baichuan-7B, etc.

Open source models are advancing rapidly around the world.

In this article, we summarize some of the current papers on open source large models, and combine them with the AMiner AI function to form a paper review, so that everyone can understand the details of the paper more quickly.

Let’s take a closer look at these exciting developments.

1.Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

The paper illustrates that transfer learning has become a powerful technique in the field of natural language processing (NLP), in which models are first pre-trained on data-rich tasks and then fine-tuned on downstream tasks. The effectiveness of transfer learning has given rise to a variety of approaches, methodologies and practices. This paper explores the potential of NLP transfer learning techniques by introducing a unified framework that transforms all text-based language problems into a text-to-text format. By comparing pre-training targets, architectures, unlabeled datasets, delivery methods, and other factors on dozens of language understanding tasks, combined with large-scale datasets and a new "huge clean crawling corpus," the research achieved a number of The latest results in benchmark tests cover tasks such as summarization, question answering, and text classification. To facilitate future research in NLP transfer learning, the researchers publicly released their dataset, pre-trained model, and code.

Paper link: https://www.aminer.cn/pub/5db1765a3a55ac101c887e97/?f=cs

2.mT5: A massively multilingual pre-trained text-to-text transformer

This paper introduces a large-scale multilingual pre-trained text-to-text Transformer model called mT5. The recent Text-to-Text Transformer (T5) achieves state-of-the-art results on a variety of English natural language processing tasks using unified text-to-text format and scale. In this paper, we introduce mT5, a multilingual variant of T5 based on Common Crawl data, which covers 101 languages. We describe the design and modified training of mT5, and demonstrate its state-of-the-art performance on multiple multilingual benchmarks. All code and model checkpoints used for this work are publicly available.

Paper link: https://www.aminer.cn/pub/5f92ba5191e011edb3573ba5/?f=cs

3.PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation

This paper introduces a large-scale autoregressive pre-trained Chinese language model called PanGu-α with 20 billion parameters. When developing PanGu-α, the MindSpore framework was used and trained on a cluster of 2048 Asciend 910 AI processors. The training adopts a parallel training strategy based on MindSpore Auto-parallel, including data parallelism, operation-level model parallelism, pipeline model parallelism, optimizer model parallelism, and resampling. In order to improve the generalization ability of PanGu-α, we used 1.1TB of high-quality Chinese data from multiple fields for pre-training. In the test, we tested PanGu-α’s generation capabilities in text summarization, question and answer, dialogue generation and other scenarios. In addition, we also study the impact of model size in different Chinese natural language processing tasks and demonstrate that PanGu-α has excellent performance capabilities with few or zero samples.

Paper link: https://www.aminer.cn/pub/6087f2ff91e011e25a316d31/?f=cs

4.CPM-2: Large-scale cost-effective pre-trained language models

This paper introduces a large-scale efficient pre-trained language model called CPM-2, which uses a series of efficient techniques to solve the efficiency and problems in the pre-training, fine-tuning and inference process. These methods include knowledge inheritance to speed up the pre-training process, using large pre-trained language models for rapid fine-tuning, and a new inference toolkit infmoe for using large pre-trained language models in resource-constrained environments. Based on these technologies, this paper introduces an encoder-decoder bilingual model CPM-2 with 11 billion parameters, and a MoE version with 198 billion parameters. In experiments, CPM-2 was compared with mT5 in downstream tasks, and the results showed that CPM-2 has good general language intelligence. In addition, we also verified the efficiency of infmoe for inferencing large models on a single GPU. The source code and model parameters of the paper are available at https://github.com/TsinghuaAI/CPM.

Paper link: https://www.aminer.cn/pub/60d30ac49e795e035c9e5884/?f=cs

5.Multitask Prompted Training Enables Zero-Shot Task Generalization

The article illustrates a problem: how to achieve zero-shot task generalization by using multi-task learning. The article explains that the reason for the recent achievement of reasonable zero-shot generalization in large language models may be due to implicit multi-task learning in language model training. The authors propose a system that converts common natural language tasks into understandable prompt forms to test whether explicit multi-task learning can directly induce zero-shot generalization. By fine-tuning the pre-trained encoder-decoder model on this multi-task hybrid dataset, the authors found that the model achieved strong zero-shot performance on several standard datasets and typically outperformed its own size by a factor of 16 model. In addition, the author's method also performed well on some tasks in the BIG-Bench benchmark, outperforming models 6 times its own size.

Paper link: https://www.aminer.cn/pub/616ce5a55244ab9dcbacff30/?f=cs

6.GPT-NeoX-20B: An Open-Source Autoregressive Language Model

This paper introduces an open source regression language model called GPT-NeoX-20B. The model has 20 billion parameters and is trained on the stacked dataset Pile. The model's weights will be freely and publicly released through an open source license. To our knowledge, this is the largest dense autoregressive model publicly available at that time. In this paper, we describe the architecture and training of GPT-NeoX-20B and evaluate its performance on language understanding, mathematics, and knowledge tasks.

Paper link: https://www.aminer.cn/pub/6258e26b5aee126c0fbc7a9a/?f=cs

7.CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

This paper studies new technologies for computer program generation, aiming to generate computer programs from natural language descriptions or input and output examples. While the popularity of large language models has spurred the development of program generation techniques, limited training resources and data have limited public access to these models. In order to solve this problem, the paper trained and released a large language model called CODEGEN, which has 16.1B parameters, was trained using natural language and programming language data, and the training library was open sourced, called JAXFORMER. The model performed well in the zero-shot Python code generation HumanEval test, demonstrating its practicality. Additionally, the paper investigates a multi-step program generation paradigm that decomposes a single program into multiple sub-problems. To verify the effectiveness of this paradigm, the paper builds an open-source benchmark called MTPB containing a diverse set of 115 questions and decomposes it into multi-round prompts. Analysis of MTPB shows that it significantly improves the effectiveness of procedural generation when the same intent is provided to CODEGEN in multiple rounds. The paper also open-sources the training library JAXFORMER and model checkpoints, and provides a link: https://github.com/salesforce/CodeGen.

Paper link: https://www.aminer.cn/pub/6241273e5aee126c0f292b68/?f=cs

8.Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

This paper explores whether natural language processing models can generalize to a variety of unobserved tasks when task instructions are provided. To address this problem, the authors first introduce the Super-NaturalInstructions benchmark set, which contains 1,616 diverse natural language processing tasks and their expert-written task instructions. The collection covers 76 different task types, including but not limited to classification, extraction, filling, sequence labeling, text rewriting, and text composition. This large-scale task set enables rigorous assessment of the ability to generalize across tasks under task instructions—training models to follow a subset of instructions and evaluating their performance on unobserved tasks. Additionally, the authors built a Transformer model called Tk-Instruct, which was trained to follow task instructions in various contexts (such as simple task definitions or k-shot examples). In the benchmark set, Tk-Instruct not only surpasses existing instruction following models such as InstructGPT in all aspects we studied, but is also an order of magnitude smaller than it. We also analyze how generalization ability varies with the number of observation tasks, the number of instances used per task, and model size. We hope that our data and models can facilitate future progress toward more general natural language processing models.

Paper link: https://www.aminer.cn/pub/625e1a335aee126c0feca4ca/?f=cs

9.UL2: Unifying Language Learning Paradigms

This paper introduces a unified language learning paradigm designed to be valid across data and settings across data and setting differences. This approach proposes a general and unified perspective on language naturalness by separating architectural archetypes from pre-training objectives, showing that different pre-training objectives can be regarded as equivalent to each other, and that different pre-training objectives can be considered equivalent to one another. Interpolation is effective. Mixture-of-Denoisers (MoD) is then proposed as a pre-training target that combines multiple pre-training paradigms, and a concept of mode switching is introduced for downstream fine-tuning related to a specific pre-training scheme. By conducting extensive meta-experiments comparing multiple pre-training objectives, we found that our approach drives the Pareto front in multiple diverse settings and outperforms T5 and GPT similar models in each setting. By scaling the model to 2 billion parameters, top performance on 50 widely used supervised finetuning based natural language processing tasks is achieved. The model also performs well in contextual learning, outperforming 175B GPT-3 in zero-shot SuperGLUE and three times better than T5-XXL in one-shot summarization. In 0-shot MMLU, UL2 20B outperforms T0 and T5 models. Additionally, UL2 20B works well with chain thinking prompts and reasoning, making it an attractive choice for studying reasoning on small to medium 20B parameter scales. Finally, the model is applied to FLAN instruction fine-tuning, achieving MMLU and Big-Bench scores comparable to FLAN-PaLM 62B. We also released Flax-based T5X checkpoints for UL2 20B and Flan-UL2 20B.

Paper link: https://www.aminer.cn/pub/627c6cf55aee126c0f831748/?f=cs

10.OPT: Open Pre-trained Transformer Language Models

This paper proposes Open Pre-trained Transformer Language Models (OPT), which is a set of pre-trained transformer models with only the decoder part, ranging from 125M to 175B parameters. These models can perform zero-shot and small-data learning, and have similar performance to existing language models, such as GPT-3. In comparison, OPT’s carbon footprint during development is only 1/7 of GPT-3. Additionally, the authors provide the code required to conduct the experiments and a log of the infrastructure challenges faced. Through this work, researchers can better understand the inner workings of large language models, providing a better foundation for future research.

Paper link: https://www.aminer.cn/pub/62708f625aee126c0fa694a0/?f=cs

11.No Language Left Behind: Scaling Human-Centered Machine Translation

This paper discusses how machine translation has become a key issue in artificial intelligence research with the goal of eliminating language barriers on a large scale. However, these efforts have mainly focused on a small subset of languages, while most predominantly low-resource languages ​​have been neglected. To address this issue, researchers conducted exploratory interviews with native language speakers to understand the need for low-resource language translation support. They then created data and models designed to close the performance gap between low- and high-resource languages. Specifically, they developed a Conditional Compute model based on the Sparsely Gated Mixture of Experts, which is trained on unique data mining techniques for low-resource languages. They proposed various architectural and training improvements to counteract the overfitting that occurs when training thousands of tasks. Most importantly, they evaluated over 40,000 different translation directions with the human translation benchmark set Flores-200 and combined the human evaluation with a new toxicity benchmark covering all Flores-200 languages ​​to assess translation safety. Their model improves the BLEU score by 44% relative to the previous state-of-the-art, laying an important foundation for realizing a universal translation system.

Paper link: https://www.aminer.cn/pub/62cce6795aee126c0f2a85b2/?f=cs

12.BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Problems in the development and use of large language models (LLMs) are explained. Although LLMs are capable of performing new tasks based on a small number of demonstrations or natural language instructions, most LLMs are developed by resource-rich organizations and are often not available to the public. To further the process of democratizing this powerful technology, the authors introduce BLOOM, a 176B parameter open-access language model designed and built collaboratively by hundreds of researchers. BLOOM is a decoder-only Transformer language model trained using the ROOTS corpus, which contains hundreds of sources (59 in total) in 46 natural languages ​​and 13 programming languages. The authors found that BLOOM achieved competitive performance on various benchmarks and achieved better results after multi-tasking prompt fine-tuning. To promote future research and applications using LLM, the authors have publicly released the model and code.

Paper link: https://www.aminer.cn/pub/636c6bec90e50fcafd2d3ff2/?f=cs

13.GLM-130B: An Open Bilingual Pre-trained Model

This article introduces an open source bilingual (English and Chinese) pre-trained language model GLM-130B, which has 130 billion parameters. The model aims to be at least as good as GPT-3 and unlock the secrets of how such a large-scale model can be successfully pre-trained. During development, the authors faced many unexpected technical and engineering challenges, especially regarding loss spikes and insufficient convergence. This article describes the training process of the GLM-130B, including its design choices, efficient and stable training strategies, and engineering efforts. Results GLM-130B significantly outperforms GPT-3 175B on many popular English benchmarks, while no performance advantage is observed in OPT-175B and BLOOM-176B. In relevant benchmark tests, GLM-130B has always significantly outperformed the largest Chinese language model ERNIE TITAN 3.0 260B. Finally, the authors took advantage of the unique scaling properties of GLM-130B to achieve INT4 quantization with almost no performance loss, making it the first 100B scale model to achieve this feature. The model weights have been made public, and its code, training logs, related toolkits, and learned experiences are also open source at https://github.com/THUDM/GLM-130B.

Paper link: https://www.aminer.cn/pub/633e476890e50fcafde59595/?f=cs

14.Scaling Instruction-Finetuned Language Models

This paper explores methods of instruction fine-tuning language models, with particular emphasis on scaling the number of tasks, model size, and fine-tuning data chains. Research shows that using these methods can significantly improve the performance of various model types (such as PALM, T5 and U-PaLM), as well as model performance under zero starting point, few experience and collaborative learning. For example, Flan-PaLM 540B outperforms PALM 540B by an average of 9.4% after fine-tuning on 1.8K tasks. Flan-PaLM 540B even reached leading levels on some benchmarks, such as reaching 75.2% on five-shot MMLU. We also publicly released Checkpoint for Flan-T5, achieving excellent minority experience performance even when compared to large models such as PaLM 62B. In summary, instruction fine-tuning is a general method to improve the performance and usability of pre-trained language models.

Paper link: https://www.aminer.cn/pub/63520de890e50fcafd60f4dd/?f=cs

15.Crosslingual Generalization through Multitask Finetuning

This paper explores cross-language generalization methods, using multi-task Finetuning to improve the generalization ability of large language models on new tasks. Previous research has shown that multi-task cue finetuning (MTF) can help large language models generalize to new tasks with zero samples, but so far, exploration of MTF has mainly focused on English data and models. This paper applies MTF to the pre-trained cross-language BLOOM and mT5 model families, resulting in finetuned variants named BLOOMZ and mT0. We found that finetuning a large cross-lingual language model to English tasks and English cues can generalize to non-English languages ​​that only appear in the pre-training set. Finetuning further improves the performance of English and non-English tasks on cross-language tasks and English prompts, achieving a leading position in various zero-sample performance. We also studied cross-language finetuning using machine translation to translate English prompts into corresponding languages. We found that training on these machine translation cues improves performance on human-written cues in the corresponding languages. Surprisingly, we found that the model can achieve zero-shot generalization on language tasks that are unconsciously seen. We hypothesize that the model is learning high-level capabilities that are task- and language-independent. Additionally, we introduce xP3, a synthetic supervised dataset consisting of English and machine translation cues in 46 languages. Our code, datasets, and models are publicly available at https://github.com/bigscience-workshop/xmtf.

Paper link: https://www.aminer.cn/pub/636482d790e50fcafdccae4e/?f=cs

16.Galactica: A Large Language Model for Science

This paper introduces a large-scale language model called Galactica that can store, combine and reason about scientific knowledge. The model is trained on a large number of scientific texts, references, knowledge bases and other sources, and outperforms existing models in a range of scientific tasks. In terms of technical knowledge tests, such as LaTeX equations, Galactica performed better than the latest GPT-3, achieving a 68.2% success rate, compared to only 49.0% for GPT-3. Galactica also performed well in reasoning, performing better than Chinchilla in the math MMLU test, reaching 41.3%, and performing better than PaLM 540B in the MATH test, reaching 20.4%. In addition, Galactica has also created new peaks in downstream tasks, such as the dev versions of PubMedQA and MedMCQA question and answer tasks, reaching 77.6% and 52.9% accuracy respectively. Although this model was not trained on a common corpus, it still outperforms models such as BLOOM and OPT-175B. We believe these results demonstrate the potential of language models as scientific interfaces, so we make the model publicly available to benefit the scientific community.

Paper link: https://www.aminer.cn/pub/6375a67190e50fcafd3e1d4a/?f=cs

17.OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

This paper investigates fine-tuning a large pre-trained language model to a set of tasks, called instruction-tuning, to improve its generalization ability when faced with zero and small samples of unseen tasks. However, little is known about the impact of different decisions on downstream task performance during instruction fine-tuning. These decisions include instructing the size and quality of the fine-tuning benchmark, different task sampling strategies, whether to use demonstrations, special dataset training for reasoning and dialogue, and ultimately the goals of fine-tuning. To address this problem, the authors created a large-scale instruction fine-tuning benchmark called OPT-IML Bench, which contains 2000 natural language processing tasks integrated into task categories from 8 existing benchmarks. We also prepare an evaluation framework for this framework to measure model generalization ability in three different types: completely unknown tasks, unknown tasks to known tasks, and unknown instances of known tasks. With this framework, we first show the impact of instruction fine-tuning decisions when applied to the OPT-30B model, and use these insights to train OPT-IML 30B and 175B, which are instruction-fine-tuned OPT models. OPT-IML demonstrates three generalization capabilities on four different evaluation benchmarks, including PromptSource, FLAN, Super-NaturalInstructions and UnifiedSKG. Not only does it far outperform OPT on all benchmarks, it also performs on par with existing models after being fine-tuned on specific benchmarks. We release OPT-IML on two scales and together with the OPT-IML Bench evaluation framework.

Paper link: https://www.aminer.cn/pub/63a910a290e50fcafd2a84fd/?f=cs

19.Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

This article introduces a system called "Pythia" that aims to provide an in-depth analysis of how large language models (i.e., LLMs) develop and evolve during training, and how these patterns change as the model scales up. The system includes 16 LLMs, which are all trained on the same public data and range in size from 70M to 12B parameters. The system also provides 154 checkpoints, as well as tools to download and reconstruct the precise training process for further study. This article provides multiple research cases, including new results in memory, the impact of term frequency on small-sample performance, and reducing gender bias. By showing how this highly controlled research approach can yield new insights into LLMs and their training dynamics, the authors show that the "Pythia" system can help gain insights into LLMs and facilitate related research. All training models, analysis code, training code, and training data can be found at https://github.com/EleutherAI/pythia.

Paper link: https://www.aminer.cn/pub/642ce6f390e50fcafde74c79/?f=cs

20.LLaMA: Open and Efficient Foundation Language Models

This article introduces LLaMA, a basic language model with 7B to 65B parameters. The authors trained on trillions of tokens and showed that it is possible to train state-of-the-art models using publicly available datasets without having to rely on proprietary and inaccessible datasets. Among them, LLaMA-13B outperforms GPT-3 (175B) in most benchmarks, while LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. The authors share all models with the research community.

Paper link: https://www.aminer.cn/pub/63fd715e90e50fcafd14767c/?f=cs

21.CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X

This paper introduces CodeGeeX, a cross-language code generation model with 1.3 billion parameters. The model was pre-trained on 85 billion words in June 2022 based on 23 programming languages. Our experiments show that CodeGeeX outperforms similar cross-language code models in both code generation and translation tasks, and we build on the HumanEval-X benchmark for evaluating multi-language models with handwritten solutions in C++, Java, JavaScript, and Go. standard. We also develop Visual Studio Code, JetBrains and Cloud Studio extensions based on CodeGeeX, which generate 470 million words per week for hundreds of thousands of active users. Our user research shows that CodeGeeX can help 83.4% of users improve coding efficiency. Finally, CodeGeeX is publicly available and open sourced the code, model weights (version 85 billion words), API, extensions, and the HumanEval-X benchmark on its GitHub in September 2022.

Paper link: https://www.aminer.cn/pub/64264f7b90e50fcafd68e145/?f=cs

22.MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

This paper explores how to use more advanced large language models (LLMs) to enhance visual language understanding. They introduced the MiniGPT-4 model, which aligns the frozen visual encoder and the frozen LLM Vicuna through a projection layer. Research shows that MiniGPT-4 has many similar features to GPT-4, such as generating detailed image descriptions and creating websites from handwritten drafts. In addition, we also observed that MiniGPT-4 has emerging capabilities such as generating stories, poems, and solving problems. In experiments, we found that pre-training using only original image-text pairs may result in language output that lacks coherence, including repeated and fragmented sentences. To address this issue, we use a high-quality, aligned dataset for fine-tuning in the second stage, trained using dialogue templates. This step is crucial and increases the model's generation reliability and overall usability. Notably, our model is highly computationally efficient as we only use approximately 5 million aligned image-text pairs to train the projection layer.

Paper link: https://www.aminer.cn/pub/6442336c4c80727584270e42/?f=cs

23.Alpaca: A Strong, Replicable Instruction-Following Model

This paper introduces a powerful and replicable instruction following model called Alpaca. As command-following models like GPT-3.5 (text-davinci-003), ChatGPT, Claude, and Bing Chat become more powerful, many users now regularly interact with these models and even use them to get work done. However, despite their widespread deployment, these models still have many flaws: they can generate disinformation, spread social stereotypes, and generate toxic language.

Paper link: https://www.aminer.cn/pub/64eef34b12da7235fe62adac/?f=cs

24.Llama 2: Open Foundation and Fine-Tuned Chat Models

This paper introduces Llama 2, a collection of pretrained and optimized large language models (LLMs) ranging from 7 billion to 70 billion parameters. Among them, Llama 2-Chat is an LLM optimized for conversation scenarios and performs better than open source chat models on most tested benchmarks. Based on our human evaluation of usefulness and safety, Llama 2-Chat may be a suitable replacement for the closed-source model. The paper describes in detail the fine-tuning methods and security improvement methods of Llama 2-Chat to facilitate the community to build on this basis and promote the responsible development of LLM.

Paper link: https://www.aminer.cn/pub/64b758dd1a5852438b7976ff/?f=cs

Guess you like

Origin blog.csdn.net/AI_Conf/article/details/133019675