Technical Report: Efficient and Effective Text Encoding for Chinese LLaMA AND Alpaca

Technical Report: Efficient and Effective Text Encoding for Chinese LLaMA AND Alpaca

Introduction

First of all, the author said that models such as ChatGPT have shown good performance in the field of AGI recently, but they are restricted by computing power and closed source, which hinders research.

Then Meta and MIT open sourced LLaMA and Alpaca respectively, which gave hope to the research.

Then the author said that these two models were trained based on English predictions. There are only a few hundred Chinese in the vocabulary, and the Chinese performance is not good. Then the author proved that LLaMA and Alpaca can improve performance in other languages ​​by expanding the vocabulary and other methods. possibility.

The article mainly has the following contributions:

  1. Expanded the Chinese vocabulary with 20,000 tokens for the original vocabulary of LLaMA and Alpaca.
  2. Using Lora reduces computing power consumption.
  3. Verify the performance of LLaMA and Alpaca in Chinese.
  4. Open source research and resources.

Chinese LLaMA

LLaMA is a model pre-trained on tokens of about 1.4T, but its Chinese ability is a mess (although llama supports fallback of Chinese characters, but the bytecode cannot represent Chinese well), in order to solve this problem, the author did the following Improve:

  1. In order to enhance the tokenizer so that it enhances Chinese text, the author uses Sentence Piece to train a new Chinese tokenizer, which is merged with the original vocabulary.
  2. Modify the embedding to adapt to the new vocabulary. In order not to affect the previous token, the new vector is added at the end of the previous embedding matrices.

Preliminary experiments show that while the expression is clearer, the required token length is almost doubled.
insert image description here

Chinese Alpaca

After obtaining Chinese LLaMA, take the form of instruction fine-tuning to obtain Chinese Alpaca, which belongs to the following format: the
insert image description here
difference from the original model is that there is no input (I think this is more in line with the Chinese way of question and answer), if the downstream data input contains data, pass\ n combines instruction and input, where \n is regarded as an additional padding token .

Lora-Fine-tuning

This stage is the same as before. This technology is used in the stages from LLaMA to Chinese-LLaMA, and from Alpaca to Chinese Alpaca.

experiment

7b

pre- training

Phase 1: We fix the parameters of the transformer encoder in the model, and only train the Embedding, and adjust the newly added Chinese word vectors to the original model
while minimizing the interference . Phase 2: Add LoRA weights (adapters) to the attention mechanism and train ebeddings, LM heads and newly added LoRA parameters.

Instruction-Tuning

Instruction fine-tuning After obtaining the pre-trained model, we also use LoRA for efficient fine-tuning, increasing the number of trainable parameters.
By adding a LoRA adapter to the MLP layer. We used about 2M data points and crawled the SFT data to tune the 7B model.

13b

Pre-Training

Pre-training The pre-training process for the 13B model is basically the same as for the 7B model, except that we skip stage 1 in the pre-training. We directly apply LoRA to the training attention and mlp, while setting the embedding and LM head as trainable.

Instruct-Tuning

Instructions fine-tune LoRA settings and trainable parameters remain constant during the training phase. We fine-tune with an additional 1M crawled self-guided data points for the 13B model, resulting in a total data size of 3M for the 13B model.

Hyperparameters:
insert image description here

Guess you like

Origin blog.csdn.net/qq_18555105/article/details/130263062