[LLM] Low-cost half-day training can produce similar results to mainstream large-scale models, open source and non-commercial domain-specific LLM solutions

 The most significant difference between LLaMA-1 and LLaMA-2 is the inclusion of a higher quality corpus, which is a key factor leading to the significantly enhanced performance of LLaMA-2. This, combined with its commercial availability, expands the potential for creative applications of large models within the open source community. 

However, it is widely believed that pre-training large models from scratch is cost-prohibitive, often humorously referred to as an area only those with “$50 million” to spare can enter. This puts off many companies and developers, so how can you build your own large models at a lower cost?

The Colossal-AI team is at the forefront of cost reduction and efficiency improvements for large models, maximizing the core capabilities of LLaMA-2. Through innovative training technology, Colossal-AI only utilized approximately 0.0085 trillion tokens of data, invested 15 hours and spent hundreds of dollars in training costs, and achieved remarkable results. This strategy resulted in a high-performance Chinese LLaMA-2 model that consistently outperformed competitors on multiple evaluation benchmarks.

Compared with the original LLaMA-2, Colossal-AI's model not only enhanced its Chinese ability, but also further improved its English proficiency. Notably, it demonstrates performance levels that are comparable to similarly sized state-of-the-art (SOTA) models in the open source community.

The foundation of Colossal-AI's approach is strong open source principles. Therefore, access to the model is not subject to any commercial restrictions and the entire training process, code and model weights are completely transparent . At the same time, Colossal-AI provides ColossalEval, a comprehensive evaluation framework that promotes cost-effective reproducibility.

In addition, the methods developed by Colossal-AI can be easily applied in various fields , helping to economically build large models pre-trained from scratch.

Open source code and weights are available at:

GitHub - hpcaitech/ColossalAI: Making large AI models cheaper, faster and more accessible

 

Performance is as follows

f2-1

Note: Based on ColossalEval scores, the scores in parentheses come from the official ranking scores of the corresponding models, and the C-Eval scores come from the official rankings.

It can be observed in the common English evaluation rankings that in the MMLU ranking, Colossal-LLaMA-2-7B-base overcomes the problem of catastrophic forgetting with the support of low-cost continuous pre-training. Its performance has steadily improved (44.47 -> 53.06), showing excellent performance among all 7B-level models.

In the Chinese ranking, it is mainly compared with CMMLU, AGIEVAL, GAOKAO, and C-Eval. The performance of Colossal-LLaMA-2 is significantly better than other Chinese localization models based on LLaMA-2. Even compared to other well-known models that use Chinese corpora and potentially spend millions of dollars to train from scratch, Colossal-LLaMA-2 still stands out at the same scale. Notably, it achieved a significant leap in Chinese proficiency compared to the original LLaMA-2 (CMMLU: 32.97 -> 49.89).

Furthermore, fine-tuning through methods such as SFT and LoRA has limitations in effectively injecting knowledge and capabilities from the base model . It is not well suited to the requirements of building high-quality domain-specific knowledge or specialized model applications.

To better evaluate the model's performance, the Colossal-AI team relies not only on quantitative metrics but also performs manual evaluation of different aspects of the model. Here are some examples:

f3-1

f4-1

Looking at the entire training loss record, it is clear that model convergence is well preserved while leveraging the cost-effective capabilities of the Colossal-AI system. The model achieves such remarkable results with a training data set of only about 8.5 billion tokens and a computational cost of several hundred dollars . In contrast, many large-scale models on the market require trillions of tokens to be trained to ensure effectiveness, and the cost is significantly higher.

f5-1

So, how did the Colossal-AI team reduce training costs and achieve such impressive results?

Vocabulary expansion and model initialization

The original vocabulary of LLaMA-2 is not optimized specifically for Chinese, and the Chinese vocabulary is limited, resulting in insufficient understanding of Chinese data. Therefore, the first step involved expanding the vocabulary of LLaMA-2.

The Colossal-AI team found: 

  1. Lexical expansion not only significantly improves the encoding efficiency of string sequences, but also enriches the encoded sequences with more meaningful information. This has proven to be very beneficial for document-level coding and understanding. 
  2. However, due to the limited amount of continuous pre-training data, a wide expansion of the vocabulary may cause some words or combinations to lack practical meaning, making it difficult to effectively learn them from the continuous pre-training data set and affecting the final performance. 
  3. Excessive vocabulary will increase the number of embedding related parameters and affect training efficiency. 

Therefore, after taking into account training quality and efficiency, and after extensive experiments, the Colossal-AI team decided to expand the vocabulary of LLaMA-2 from the original 32,000 words to 69,104 words. 

With the extended vocabulary in place, the next step involves initializing the embeddings based on the original LLaMA-2 of the new vocabulary. To facilitate a seamless transition of functionality from original LLaMA-2 to Chinese LLaMA-2 while ensuring that English proficiency is not affected in the initial state, the Colossal-AI team averaged the new embeddings to initialize LLaMA-2 using the weights of the original embeddings. This method not only retains English language capabilities, but also facilitates the smooth transfer of these capabilities to the Chinese language model.

Data construction

To further reduce training costs, high-quality data plays a key role. In particular, continuous pre-training has strict requirements on data quality and distribution. In order to better screen high-quality data, the Colossal-AI team built a complete data cleaning system and toolkit to select higher-quality data for continuous pre-training.

The figure below shows the complete data governance process of the Colossal-AI team:

f6

In addition to heuristic selection and deduplication of data, scoring and classification filtering of key data are also performed. Appropriate data plays a crucial role in stimulating LLaMA-2's Chinese ability while overcoming the catastrophic forgetting problem of English.

Finally, in order to improve the training efficiency of the same subject data, the Colossal-AI team sorted the data by length and spliced ​​it according to the maximum length of 4096.

training strategy

multi-stage training

In terms of training, in response to the characteristics of continuous pre-training, the Colossal-AI team designed a multi-stage, layered continuous pre-training plan, dividing the training process into three stages:

f7

  1. Large-scale pre-training stage: The goal of this stage is to enable the model to produce relatively smooth text through training with a large amount of corpus. This stage is completed by LLaMA-2. After this stage, the model has mastered a large amount of English knowledge and can produce smooth results based on Next Token Prediction.
  2. Chinese knowledge injection stage: This stage relies on high-quality Chinese knowledge. On the one hand, it enhances the model's grasp of Chinese knowledge, and on the other hand, it improves the model's understanding of new words in Chinese vocabulary.
  3. Relevant knowledge replay stage: This stage is dedicated to enhancing the model's understanding and generalization ability of knowledge and alleviating the problem of catastrophic forgetting.

The multi-stage approach complements each other and ultimately ensures that the model progresses equally in both Chinese and English proficiency.

Bucket Training

Continuous pretraining is extremely sensitive to data distribution, so balance is particularly important. In order to ensure the balanced distribution of data, the Colossal-AI team designed a data bucketing strategy to divide the same type of data into 10 different buckets. During the training process, each data bucket contains one bin for each type of data, ensuring that each type of data can be evenly utilized by the model.

Evaluation System

In order to better evaluate the performance of the model, the Colossal-AI team built a complete evaluation system-ColossalEval to evaluate large language models from multiple dimensions. The process framework and code are completely open source, support the reproducibility of results, and also allow users to customize data sets and evaluation methods according to application scenarios. The features of the assessment framework are summarized below:

  1. Contains commonly used data sets for evaluating the knowledge reserve capabilities of large language models, such as MMLU, CMMLU, etc. For multiple-choice questions, ABCD probability comparison and other formats, more comprehensive calculation methods have been added, such as absolute matching, single-choice and other perplexities. The goal is to more thoroughly measure a model's knowledge.
  2. Assessments that support multiple choice questions and long text assessments.
  3. It supports evaluation methods for different application scenarios such as multi-round dialogue, role play, information extraction, and content generation. Users can selectively evaluate different aspects of the model's capabilities according to their specific needs. In addition, the system supports the extension of custom prompts and assessment methods to meet individual preferences and requirements.

A bridge from general large-scale models to domain-specific large-scale models

Judging from the experience of the Colossal-AI team, building the Chinese version of LLaMA-2 can be summarized as the following process:

f8

So, can this process be reused?

The answer is yes, and has important implications in practical implementation scenarios.

With the rise of the artificial intelligence wave driven by ChatGPT, major Internet giants, artificial intelligence companies, startups, universities, research institutions, etc. are actively participating in large-scale general model competitions. However, there is often a lack of domain-specific knowledge behind the generality of these large models. Therefore, the problem of practical application becomes particularly serious. While application-specific fine-tuning can bring some benefits, the lack of large domain-specific models can create performance bottlenecks in application deployments. 

Being able to quickly and cost-effectively build large domain-specific models and then fine-tune them for specific business needs will undoubtedly advance application deployment and provide a competitive advantage. 

Applying the above process to perform knowledge transfer in any domain can cost-effectively build lightweight domain-specific basic large-scale models :

f9

To build a basic large model from scratch, you can also get inspiration from the above experience and Colossal-AI's ability to reduce costs and increase efficiency, and achieve this goal efficiently at the minimum cost.

System Optimization

Colossal-LLaMA-2’s impressive performance and cost advantages build on Colossal-AI, the low-cost AI large model development system. 

Colossal-AI is based on PyTorch and uses efficient multi-dimensional parallelism, heterogeneous memory and other technologies to reduce the development and deployment costs of AI large model training, fine-tuning and inference. It enhances model task performance, reduces GPU requirements, and more. In just over a year, it has gained more than 30,000 GitHub stars in the open source community. Ranked first in the world in the field of large-scale model development tools and communities, it cooperates with many Fortune 500 and other well-known companies to develop/optimize 100 billion and 10 billion parameter models and create model models in specific fields.

Huge AI cloud platform

In order to further improve the efficiency of large model development and deployment, Colossal-AI has been upgraded to the Colossal-AI cloud platform. The platform allows users to cost-effectively train, fine-tune, and deploy large models in the cloud through a low-code/no-code approach, while quickly integrating models for personalized applications.

Currently, the Colossal-AI cloud platform is pre-installed with mainstream models and solutions such as Stable diffusion and LLaMA-2. Users only need to upload their own data for fine-tuning, and the fine-tuned model can be deployed in the form of an API.

翻译来自于:One half-day of training using a few hundred dollars yields similar results to mainstream large models, open-source and commercial-free domain-specific LLM solution

 

Guess you like

Origin blog.csdn.net/sikh_0529/article/details/133357624