[World Premiere] Scholar·Puyu’s 20 billion parameter model InternLM-20B is open source!

8e61c7f7eb180b225328f2545c97c805.png

On September 20, Shanghai Artificial Intelligence Laboratory (Shanghai AI Laboratory) and SenseTime, together with the Chinese University of Hong Kong and Fudan University, officially launched the 20 billion parameter version InternLM-20B of the Scholar·Puyu Large Model (InternLM), and launched it on Alibaba Cloud ModelScope is released as open source for the first time. At the same time, Shusheng Puyu’s entire tool chain for large model R&D and application has been upgraded across the board, and will continue to be fully open together with InternLM-20B, providing free commercial authorization to enterprises and developers.

Riding on the wave, the application value of large models has received increasing attention. Just like any new technology in history, its vitality will eventually return to whether it can be widely implemented and bring positive and real changes to the world. Against this background, the Shanghai AI Laboratory teamed up with a number of institutions to launch the InternLM-20B large model with middle-weight parameters . It has advanced performance and is easy to apply. With less than one-third of the number of parameters, it has reached the level of what is currently considered an open source model. The benchmark Llama2-70B capability level .

Code library link: https://github.com/InternLM/InternLM

Magic scope community link: https://modelscope.cn/organization/Shanghai_AI_Laboratory

Since its first release in June this year, Shusheng Puyu has gone through multiple rounds of upgrades and has had a wide impact on the open source community and industry.

f76466289141bf63d0145d58fed79032.png

Scholar·Puyu "Enhanced Edition": It's not just quantity that increases

Compared with the 7B and 13B models that have been open sourced by the domestic community before, the 20B model has more powerful comprehensive capabilities, especially in complex reasoning and reflection capabilities, so it can bring more powerful performance support to practical applications. ; At the same time, the 20B-level model can be inferred on a single card. After low-bit quantization, it can be run on a single consumer-grade GPU, making it more convenient in practical applications.

InternLM-20B is a large medium-weight language model trained from scratch based on 2.3T token pre-training corpus. Compared with InternLM-7B, the training corpus has undergone a higher level of multi-level cleaning, supplemented with high knowledge density and training data to strengthen understanding and reasoning capabilities. Therefore, InternLM-20B has significantly improved in aspects such as understanding ability, reasoning ability, mathematical ability, and programming ability that test the technical level of language models.

Compared with previous open source models, the capability advantages of InternLM-20B are mainly reflected in:

Excellent overall performance. InternLM-20B has excellent comprehensive performance, not only leading the open source models of similar magnitude (including Llama-33B, Llama2-13B and the domestic mainstream 7B and 13B open source models), but also with less than one-third of the parameters, it has achieved the best performance in the evaluation. The result reached the level of Llama2-70B.

Powerful tool calling capabilities. InternLM-20B expands the capability boundaries of the model and achieves effective connection between large models and real scenes. InternLM-20B supports dozens of types of plug-ins and tens of thousands of API functions. It achieved the best results in the ToolBench evaluation set. In the competition with ChatGPT, the winning rate reached 63.5%. InternLM-20B also has code interpretation and reflection correction capabilities, providing a good technical foundation for the construction of agents.

Longer context. Through multi-stage training expansion, InternLM-20B supports 16K context length, thereby more effectively supporting long text understanding, long text generation and ultra-long conversations.

Safer value alignment. Compared with previous versions, InternLM-20B is more secure and reliable in value alignment. During the development and training process, the research team greatly improved its security through two-stage value alignment based on SFT (supervised fine-tuning) and RLHF (reinforcement learning based on human feedback), as well as adversarial training by an expert red team. When users ask questions with biases, the model can give positive guidance.

Fully upgraded open source tools and data systems. The Shusheng Puyu open source tool chain has been upgraded across the board, forming a more complete tool system, including the pre-training framework InternLM-Train, the low-cost fine-tuning framework XTuner, the deployment inference framework LMDeploy, the evaluation framework OpenCompass, and the agent framework for scenario applications. Lagent. The Shusheng·Puyu tool chain will form a powerful open source tool and data system with the open source data platform OpenDataLab, jointly providing full-chain R&D and application support for academia and industry.

e90ed266b218aa6b111f5d49238da429.pngComprehensively upgraded full-chain tool system

Architecture enhancement: deep structure, long context

In the past period of time, domestic institutions have successively open sourced a number of models with 7B and 13B parameters, and have achieved good results in evaluations. However, researchers have found that these models still have limitations when adapting to downstream tasks, especially tasks that require higher accuracy and reasoning capabilities. In order to better support these tasks, the industry is calling for a middle-weight open source model to provide stronger understanding, reasoning, and long text generation capabilities.

With a relatively limited parameter scale, researchers face important trade-offs when designing architectures—increasing the depth or width of the model? Through extensive controlled experiments, the Shusheng Puyu team found that deeper model layers are more conducive to the cultivation of complex reasoning abilities. Therefore, during the architecture design, the researchers set the number of model layers to 60 layers, exceeding the 32-layer or 40-layer designs usually used in the 7B and 13B models; at the same time, the internal dimensions were kept at 5120, which is at a moderate level. Through new choices in architectural design, InternLM-20B has achieved significant improvements in complex reasoning capabilities under the condition of high computing efficiency.

InternLM-20B also supports longer context lengths. During the training process, the context length of the model is expanded from 2K to 8K in stages. On the reasoning side, based on Dynamic NTK technology, the context length supported by model reasoning is further extended to 16K. The long context provides more space for the expansion of the model's capabilities, including tool invocation, code interpretation, and reflection and correction, and has become a key technical foundation to support the creation of agents on InternLM-20B.

Comprehensive performance enhancement: Leading in multiple evaluations

Based on the OpenCompass large model evaluation platform, researchers conducted a comprehensive test and comparison of InternLM-20B and open source models of similar magnitude on 50 mainstream evaluation sets covering five dimensions: language, knowledge, understanding, reasoning and subject ability. The evaluation results show that InternLM-20B is ahead of the open source 13B model in all dimensions. The average score not only significantly surpasses Llama-33B, but is even better than the benchmark Llama2-70B, which is called an open source model.

dc40b67eccba5dcd3f7f995c9b57902a.png

Evaluation results of InternLM-20B and similar magnitude open source models based on OpenCompass

The following table shows the average scores in each dimension of mainstream open source models with 13B and higher parameters (the red font is the highest score in each capability dimension within the 13B-33B range). InternLM-20B surpasses Llama2-70B in the comprehensive evaluation of language and knowledge subjects, and is the same as Llama2-70B in the evaluation of reasoning ability, but there is still a certain gap in knowledge. But in all the above dimensions, InternLM-20B is significantly ahead of the mainstream 13B open source model.

562847681d93ae2f438dde3d449be8e2.png

The following table compares the performance of mainstream open source models on some important and influential typical data sets (the red fonts are the best results in various evaluations within the 13B-33B parameter range):

0712326e3cdfd6272ea8979770775ac8.jpeg

The evaluation results show that InternLM-20B has excellent results in the comprehensive subject evaluations of MMLU, C-Eval, and AGIEval, and is in a leading position among open source models of the same magnitude . MMLU is generally considered to be a key indicator for evaluating the comprehensive ability of a language model. InternLM-20B achieved a score of 62.05 on MMLU, close to the level of Llama-65B; while on C-Eval and AGIEval, which include the Chinese subject examination, InternLM-20B The performance also significantly exceeded Llama2-70B.

Knowledge question and answer evaluations such as BoolQ, TriviaQA, NaturalQuestions, etc. mainly evaluate the model's ability to master factual knowledge. In this dimension, the performance of InternLM-20B surpasses the 13B model and has its own advantages and disadvantages with Llama-33B , but compared with Llama-65B or Llama2-70B still has a certain gap.

CMRC, CSL, and RACE are evaluation sets for encyclopedic knowledge, scientific literature, and students' reading comprehension respectively, while XSum is a challenging literature summary evaluation-the above evaluations all test the ability to understand large models. In terms of understanding ability, InternLM-20B performs outstandingly, surpassing all open source models at all levels, including Llama2-70B.

Reasoning, especially complex reasoning, is a common problem currently faced by language models, and it is also a key capability for whether the model can support practical applications. WinoGrande, GSM-8K, PIQA, and BigBench-Hard (BBH) listed in the above table respectively examine the model's capabilities in common sense reasoning, mathematical reasoning, physics-related reasoning, and challenging comprehensive reasoning. InternLM-20B has significantly surpassed the results of the mainstream 13B open source model, and is very close to the reasoning ability level of heavyweight models such as Llama-65B in WinoGrande, GSM8K and PIQA evaluations.

The programming capabilities of InternLM-20B have also been significantly improved. In the two typical evaluation sets of HumanEval and MBPP, it comprehensively surpasses the mainstream 13B open source model, Llama-33B and Llama-65B, and is close to the level of Llama2-70B.

Overall, InternLM-20B is ahead of the 13B-level open source model in terms of comprehensive capabilities. It is close to or even surpasses Llama-65B in multiple evaluation sets that evaluate reasoning and programming capabilities. It generally surpasses Llama2- in Chinese-related evaluations. 70B.

Enhanced ability to call tools: you can learn even if you don’t know it

Tool calling is an important means to expand the capabilities of large language models, and it is also one of the key features of OpenAI's recent large model launches. The InternLM-20B dialogue model supports content output in dozens of directions such as date, weather, travel, sports, etc. and tens of thousands of different APIs.

In ToolBench, a large model tool calling evaluation set jointly released by Tsinghua University and other institutions, InternLM-20B achieved a winning rate of 63.5% compared with ChatGPT, achieving the best results on the list and showing strong tool calling capabilities. .

ac7ac98ace3ae41b2499c202aab81c61.png

The InternLM-20B model also shows a certain zero-sample generalization ability. For the model that has not learned some tools during the training process, InternLM-20B can also call tools to complete tasks based on tool descriptions and user questions. For example, by providing some AI tools to the model, the model can plan and reason on its own to complete the user's questions.

93563b5e52c8cf60749faf14d9efb0c6.pngInternLM-20B can independently call tools to complete tasks

Values ​​Enhancement: A more secure open source model

Only a large language model that is more in line with human values ​​​​can better serve as a "human assistant". InternLM-20B added a large amount of data in line with human values ​​during the iteration process. The research team organized experts in relevant fields to conduct multiple rounds of red team attacks on the model, greatly improving its security.

When a user asks a biased question to InternLM-20B, it is able to identify unsafe factors and provide correct value guidance in the answer.

1b01fea03b2d772321d867a817edb88e.png

Enhanced dialogue capabilities: context length reaches 16K

The context length of InternLM-20B has been expanded to 8K in stages during the training phase, and the context length during inference has been expanded to 16K through means such as Dynamic NTK. Based on the context length of 16K, InternLM-20B can effectively support long text understanding, long text generation and ultra-long conversations.

The following example demonstrates the long text understanding ability of InternLM-20B: let the large model read the latest news of a well-known coffee brand, and the model can accurately answer the three questions asked.

a707bc8e37ed34c71393658ed540091a.png

InternLM-20B also has the ability to extract accurate summaries for long papers and reports. The researchers input the Introduction chapter of the classic paper ResNet into the model, and it can write a better summary and accurately summarize the core ideas and experimental results of ResNet.

2d14dbf22b7b6b7ad310f57df555821c.png

The whole chain tool system is further consolidated and comprehensively upgraded.

In July this year, Shanghai AI Laboratory and SenseTime jointly launched the Shusheng Puyu, the first in the industry to open source a full-chain tool system covering data, pre-training, fine-tuning, deployment and evaluation. After several months of upgrades, Shusheng·PuYu’s full-chain open source tool system has been consolidated and upgraded, and is now available to the whole society for free commercial use.

Data-OpenDataLab open source "Scholar·Wanjuan" pre-training corpus

Scholar·Wanjuan is an open-source multi-modal corpus of Shanghai AI Laboratory. It contains three parts: text data set, image and text data set, and video data set. The total data volume exceeds 2TB. Currently, Scholar·Wanjuan 1.0 has been applied to the training of Scholar·Multimodal and Scholar·Puyu. Through the "digestion" of high-quality corpus, the Shusheng series models have shown excellent performance in various generative tasks such as semantic understanding, knowledge question and answer, visual understanding, and visual question and answer.

Pre-training-InternLM efficient pre-training framework

In addition to large models, the InternLM code base has open sourced the pre-training framework InternLM-Train. The deep integration of Transformer model operators improves training efficiency, and the unique Hybrid Zero technology is proposed to achieve efficient overlap of computing and communication, significantly reducing cross-node communication traffic during the training process. Thanks to the ultimate performance optimization, the high efficiency of kilo-calorie parallel computing is achieved, and the training performance reaches the industry-leading level.

Fine-tuning - InternLM full-parameter fine-tuning, XTuner lightweight fine-tuning

InternLM supports full parameter fine-tuning of the model and supports a variety of downstream applications. At the same time, the low-cost large model fine-tuning toolbox XTuner has also been open sourced recently, supporting a variety of large models and fine-tuning algorithms such as LoRA and QLoRA. Through XTuner, only 8GB of video memory is needed to perform low-cost fine-tuning of 7B models and fine-tuning of 20B models. , which can be completed on a consumer-grade graphics card with 24G video memory.

Deployment-LMDeploy supports efficient inference of billions to hundreds of billions of parameter language models

LMDeploy covers a complete set of lightweight, inference deployment and service solutions for large models, supports efficient model inference from billions to hundreds of billions of parameters, and surpasses mainstream open source projects in the community such as FasterTransformer, vLLM and Deepspeed in terms of throughput and other performance.

Evaluation-OpenCompass one-stop, all-round large model evaluation platform

OpenCompass is the open source large model evaluation platform of Shanghai AI Laboratory. It has built an evaluation system covering five dimensions: subject, language, knowledge, understanding, and reasoning. It supports more than 50 evaluation data sets and 300,000 evaluation questions, and supports zero samples and small samples. Sample and thinking chain evaluation is currently the most comprehensive open source evaluation platform. Since its release in July, it has attracted widespread attention from academia and industry. It has been widely used in large model research and development by dozens of companies and scientific research institutions such as Alibaba, Tencent, and Tsinghua University.

Application-Lagent lightweight and flexible agent framework

The Shusheng·Puyu team has also open sourced the agent framework, which supports users to quickly transform a large language model into multiple types of agents, and provides typical tools to empower large language models. The Lagent open source framework supports large language models such as InternLM, Llama and ChatGPT, and integrates various types of agent capabilities such as ReAct, AutoGPT and ReWoo. With the support of Lagent, these agents can call large language models for planning reasoning and tool invocation, and can conduct timely reflection and self-correction during execution.

Based on the Shusheng·PuYu large-scale model, the Shanghai AI Laboratory has developed a richer set of downstream applications, which will be shared with academia and industry in the near future.

Facing the new wave of innovation launched by large models, Shanghai AI Laboratory is committed to leading technological progress with original innovation, continuing to build basic models with more comprehensive capabilities, building a more complete and easy-to-use full-chain tool system, and insisting on open source and open source , free for commercial use, fully empowering the prosperity and development of the entire AI community ecosystem, helping enterprises and research institutions lower the threshold for the development and application of large models, and allowing the value of large models to bloom in all walks of life.

Guess you like

Origin blog.csdn.net/Datawhale/article/details/133108906