Chinese LLaMa and Alpaca Large Language Model Open Source Solution | Expand Chinese Vocabulary & Efficiently Encode Chinese Corpus

Welcome to the official WeChat account of "CVHub"!

Title: Efficient and Effective Text Encoding for Chinese Llama and Alpaca

PDF: https://arxiv.org/pdf/2304.08177v1.pdf

Code: https://github.com/ymcui/Chinese-LLaMA-Alpaca

guide

Large-scale language models LLM, such as ChatGPT and GPT-4, have revolutionized natural language processing research. However, LLMsthe expensive training and deployment of , pose challenges for transparent and open academic research. In order to solve these problems, the project open-sources Chinese LLaMAand Alpacalarge language models, and emphasizes instruction fine-tuning . The original Chinese vocabulary is expanded by adding 20K Chinese tokens LLaMA, increasing coding efficiency and improving basic semantic understanding. By pre-training Chinese data and using Chinese instruction data for fine-tuning, the model's ability to understand and execute instructions can be greatly improved.

introduction

LLMsThe field of natural language processing NLPhas undergone a revolutionary paradigm shift with the advent of large language models . Characterized by their large scale and extensive training data, these models have demonstrated a strong ability to understand and generate human-like text. Unlike pre-trained language models for text understanding BERT, GPTthe series focuses on text generation capabilities. As GPTthe latest in the family LLMs, ChatGPTit GPT-4has attracted a lot of attention and has become the most representative powerful model in this fast-growing field.

ChatGPT(OpenAI, 2022) is an advanced conversational artificial intelligence model built on the GPT-3.5 architecture, which can perform context-aware human-like interactions. Its success GPT-4has paved the way for the development of (OpenAI, 2023), GPT-4a more complex class LLMthat has shown potential for greater natural language understanding, generation, and various NLP tasks. These two models open up new research and application directions, and have aroused AGIinterest in exploring the capabilities of artificial general intelligence. These LLMsnot only show impressive performance on multiple benchmarks, but also demonstrate the ability to learn and adapt to new tasks with small amounts of data .

Although LLMsexceptionally powerful, these models have certain limitations:

  1. Privatization, which limits outside access to the source code of these models and hampers the ability of the wider research community to build upon their success.
  2. The enormous computing resources required to train and deploy these large language models is also a challenge for researchers with limited resources

In response to these limitations, the natural language processing research community has turned to open source alternatives . The most famous of these are: LLaMAand Alpaca, where Alpacathe model is further fine-tuned with instruction dataLLaMA based on . These open sources are designed to facilitate academic research and accelerate progress in the field of natural language processing . By open sourcing these models, the natural language processing community aims to create an environment that encourages model development, fine-tuning and evaluation for further development, and ultimately building something powerful for use in a variety of applications.LLMsLLMs

Existing large language models suffer from privatization and resource constraints, so academia turned to open-source alternatives: LLaMAand Alpacaet al. to facilitate greater transparency and collaboration. However, these open-source models still have difficulties in dealing with Chinese tasks because their vocabularies only contain a few hundred Chinese words, resulting in greatly affected efficiency in encoding and decoding Chinese text. To this end, the project proposes an improved Chinese LLaMAand models that expand the vocabulary Alpacaby adding 20k Chinese words, thereby improving the ability of these models to process and generate Chinese text. At the same time, a low-rank adaptation LoRAmethod is adopted to ensure efficient training and deployment of the model, thus providing a reference for model adaptation in other languages. This work provides a basis for generalizing the sum models to other languages, as well as methods for extending the vocabularies and performance of these models LLaMA.Alpaca

The contributions of this technical report are as follows:

  • Enhanced Chinese encoding and decoding efficiency and improved Chinese comprehension by LLaMAadding 20k Chinese words to the original vocabulary .LLaMA
  • Using the Low-Rank Adaptation method, the efficient training and deployment of Chinese and Chinese models LoRAis achieved , enabling researchers to use these models without incurring excessive computational costs.LLaMAAlpaca
  • AlpacaThe performance of Chinese 7B and 13B models in various natural language understanding NLUand natural language generation tasks was evaluated , and significant improvements were achieved NLGin Chinese language tasks compared to the original model.LLaMA
  • The resources and results of related research are made public, facilitating NLPfurther research and collaboration within the community, and encouraging the adaptation of LLaMAand Alpacamodels to other languages.

Chinese LLAMA

LLaMAis an transformerarchitecture-based decoder-only foundational large language model. transformerSimilar to other large language models based on , LLaMAit includes an embedding layer, multiple transformerblocks and a language model head layer. It also contains various improvements: such as prenormalization, SwiGLU activation and Rotary Embeddings, etc. LLaMAThe total number of parameters is between 7B and 65B. Experimental data show that: LLaMAwhile maintaining a smaller model size, GPT-3it is quite competitive with other large language models such as .

LLaMA1T to 1.4T tokens have been pre-trained on publicly available corpora, most of which are in English, so LLaMAthe ability to understand and generate Chinese is limited. To address this issue, the project proposes to pre-train the model on a Chinese corpus LLaMAto enhance its basic Chinese understanding and generation capabilities.

LLaMAHowever, there are corresponding challenges in pre-training on the Chinese corpus :

  1. There are less than a thousand Chinese characters in the original LLaMAtokenizer vocabulary. Although LLaMAthe tokenizer can support all Chinese characters by falling back to bytes, this fallback strategy will significantly increase the sequence length and reduce the efficiency of processing Chinese text.
  2. Byte tokens are used not only to represent Chinese characters, but also to represent other UTF-8tokens, which makes it difficult for byte tokens to learn the semantic meaning of Chinese characters.

To solve these problems, the researchers propose the following two solutions to expand LLaMAthe Chinese vocabulary in the tokenizer:

  1. Use to train a Chinese tokenizer on a Chinese corpus SentencePiecewith a vocabulary size of 20,000. The Chinese tokenizer is then merged with the original LLaMAtokenizer, and by combining their vocabularies, a merged tokenizer is finally obtained, called the Chinese LLaMA tokenizer, with a vocabulary size of 49,953
  2. To fit the new tokenizer, the researchers adjusted the word embedding and language model head from V × H to the shape of V' × H, where V = 32,000 represents the size of the original vocabulary, and V' = 49,953 is the Chinese LLaMA tokenizer Glossary size. A new row is appended to the end of the original embedding matrix, ensuring that the embeddings of tokens in the original vocabulary are not affected.

The technical report pointed out that the number of tokens generated by using the Chinese LLaMAtokenizer LLaMAis reduced by about half compared to the original tokenizer. Comparing the original LLaMAword breaker and the Chinese LLaMAword breaker, using the Chinese LLaMAword breaker has a significant reduction in the encoding length compared to the original. This shows that the method proposed by this project LLaMAis effective in improving the Chinese understanding and generation ability of the model. In the standard natural language model training task, the Chinese model is pre-trained using a Chinese LLaMAtokenizer LLaMAto predict the next token in an auto-regressive manner, thereby further improving the LLaMAChinese understanding and generation capabilities of the model.

For a given input token sequence x = ( x 0 , x 1 , x 2 , . . . ) x = (x_0, x_1, x_2, ...)x=x0x1x2, ... ) , the model is trained in an autoregressive manner to predict the next token, with the goal of minimizing the following negative log-likelihood:

Here, the symbol Θ denotes the model parameters, xi x_ixiIndicates the token to be predicted, and x 0 x_0x0, x 1 x_1 x1, . . . , x i − 1 x_{i-1} xi1then indicates the context. In this task, the model automatically predicts the probability distribution of the next token based on the context information it has seen, and learns model parameters by minimizing the negative log-likelihood function.

Table1. The number of tokens generated by the Chinese tokenizer is reduced by about half compared to the original tokenizer

Chinese Alpaca

After obtaining the pre-trained Chinese LLaMAmodel, we Alpacacontinue to train the model by applying instruction fine-tuning using the method used in . Each training sample consists of an instruction and an output, and the sample composition template is shown below:

experiment settings

Pre-training stage

First, a large-scale Chinese corpus similar to that used by Chinese BERT-wwm, MacBERT, and other pre-trained models is used, with a total of about 20GB. LERTPre-training is divided into two stages:

  1. transformerFix the encoder parameters in the model , only train the word embedding embeddings, adapt to the newly added Chinese word vector, and minimize the interference to the original model
  2. Add LoRAweights adaptersto the attention mechanism, train word embeddings, language model headers and newly added LoRAparameters

Instruction Fine-tuning stage

self-instructionThe method is used ChatGPTto automatically obtain training data from (gpt-3.5-turbo API). In addition, a list of hyperparameters (hyperparameters) is provided in this description, and details of fine-tuning data are provided in Table 3. The author made the template and code details public on GitHub.

Table 2 below shows the list of relevant hyperparameters for training 7B and 13B models

Table 2. List of hyperparameters for training 7B and 13B LLama models

Experimental Evaluation Method

This project uses GPT-4 as a scoring tool. However, GPT-4 does not always provide accurate scores, so the authors manually check their scores and adjust them if necessary. Manual checks ensure that the scores are consistent and reflect the true performance of the model being evaluated. The authors use the following evaluation template as input to GPT-4 for scoring:

Adopted GPT-4as a scoring method, combined with human inspection, a reliable evaluation framework can be established to effectively measure the Alpacaperformance of Chinese models on various natural language understanding and generation tasks. The evaluation set includes 160 samples covering 10 different tasks, including question answering, reasoning, literature, entertainment, translation, multi-turn dialogue, coding, and ethics, among others. The total score for each task is the sum of the scores of all samples within that task, and the total score is normalized to 100 points.

Experimental results

The above results show that the Chinese Alpaca-13B model outperforms the 7B model in all tasks, highlighting the benefits of increased model capacity. In the question answering task, the Chinese Alpaca-13B model scored 77, while the 7B model scored 53. In terms of open-ended question answering, the 13B model scored 73 and the 7B model scored 64. In the numerical reasoning task, the 13B model scored 50, while the 7B model scored only 23.

The following example shows the comparison results of the 7B and 13B Chinese LLama models under the same prompt:

in conclusion

This article open-sources the large language model specifically for Chinese LLaMA, Alpacaand expands the original Chinese LLaMAvocabulary by adding 20K Chinese tokens. The number of tokens generated by using the Chinese LLaMAtokenizer LLaMAis reduced by about half compared with the original tokenizer, which further increases the efficiency of Chinese coding. And improve the basic semantic understanding of Chinese. Students who have the training needs of Chinese LLaMAlarge language model can learn from it.


If you are also interested in the full-stack field of artificial intelligence and computer vision, it is strongly recommended that you pay attention to the informative, interesting, and loving public account "CVHub", which brings you high-quality original, multi-field, and in-depth cutting-edge scientific papers every day Interpretation and industrial mature solutions! Welcome to add the editor's WeChat account: cv_huber, remark "CSDN", join the CVHub official academic & technical exchange group, and discuss more interesting topics together!

Guess you like

Origin blog.csdn.net/CVHub/article/details/130304020