Large Language Models (LLMs), such as ChatGPT and GPT-4, have revolutionized natural language processing research and demonstrated potential in Artificial General Intelligence (AGI). However, the expensive training and deployment of LLMs present challenges to transparent and open academic research. To address these issues, this project open-sources the Chinese LLaMA and Alpaca large models, emphasizing instruction fine-tuning. We expand the original LLaMA's Chinese vocabulary by adding 20K Chinese tokens, increasing encoding efficiency and enhancing basic semantic understanding. By incorporating secondary pre-training using Chinese data and fine-tuning with Chinese instruction data, we substantially improve the models' comprehension and execution of instructions. Our pilot study serves as a foundation for researchers adapting LLaMA and Alpaca models to other languages. Resources are made publicly available through GitHub, fostering open research in the Chinese NLP community and beyond. GitHub repository: this https URL

大型语言模型(LLM)，如ChatGPT和GPT-4，已经彻底改变了自然语言处理研究，并在人工通用智能(AGI)中展示了潜力。然而，法学硕士的昂贵培训和部署对透明和开放的学术研究提出了挑战。为了解决这些问题，本项目开放了中国LLaMA 和Alpaca 的大型模型，强调指令微调。我们在原有的LLaMA中文词汇中增加了20K个中文标记，提高了编码效率，增强了基本的语义理解。通过结合中文数据的二次预训练和中文指令数据的微调，我们大大提高了模型对指令的理解和执行能力。我们的试点研究为研究人员将LLaMA和Alpaca 模型应用于其他语言奠定了基础。资源通过GitHub公开提供，促进中国NLP社区及其他领域的开放研究。

1、INTRODUCTION

The field of natural language processing (NLP) has undergone a transformative paradigm shift with the advent of Large Language Models (LLMs). These models, characterized by their vast size and extensive training data, have demonstrated remarkable capabilities in understanding and generat- ing human-like text. Unlike pre-trained language models for text understanding, such as BERT (Devlin et al., 2019), the GPT series (Radford et al., 2018) focuses on text generation abilities, mak- ing them a more suitable testbed for creativity than their counterparts. As the latest LLMs in the GPT family, ChatGPT and GPT-4 have attracted significant attention and emerged as leading examples in this rapidly evolving domain.

ChatGPT (OpenAI, 2022), built on the GPT-3.5 (Ouyang et al., 2022) architecture, is an advanced conversational AI model that can engage in context-aware, human-like interactions. Its success has paved the way for the development of GPT-4 (OpenAI, 2023), a more sophisticated LLM, which has demonstrated even greater potential in natural language understanding, generation, and vari- ous NLP tasks. Both models have opened up new avenues of research and applications, fueling interest in exploring the capabilities of Artificial General Intelligence (AGI). These LLMs have not only shown impressive performance in multiple benchmarks but have also exhibited a capacity for few-shot learning and adapting to novel tasks. As a result, they have significantly contributed to the expansion of NLP research, inspiring researchers and industry professionals alike to explore and leverage their potential in a wide range of applications, from sentiment analysis and machine translation to question-answering systems and beyond.

自然语言处理（NLP）领域随着大型语言模型（LLM）的出现经历了一次转变性的范式转变。这些模型以其庞大的规模和丰富的训练数据为特征，在理解和生成类人文本方面展示出了非凡的能力。与用于文本理解的预训练语言模型（如BERT）不同，GPT系列侧重于文本生成能力，使其比其同行更适合用作创造力的试验平台。作为GPT家族中的最新LLM，ChatGPT和GPT-4吸引了广泛的关注，并成为这个快速发展领域中的领先示例。

ChatGPT（OpenAI, 2022）基于GPT-3.5（Ouyang et al., 2022）架构构建，是一种先进的对话型AI模型，可以进行上下文感知的类人对话。其成功为GPT-4（OpenAI, 2023）的开发铺平了道路，后者是一个更复杂的LLM，展示了在自然语言理解、生成和各种NLP任务方面更大的潜力。这两个模型开辟了研究和应用领域的新途径，引发了对探索人工通用智能（AGI）能力的兴趣。这些LLM不仅在多个基准测试中表现出色，而且还展示了少样本学习和适应新任务的能力。因此，它们对于NLP研究的拓展做出了重要贡献，激发了研究人员和行业专业人士对在情感分析、机器翻译、问答系统等各种应用中发掘和利用它们潜力的兴趣。

Despite the remarkable advancements brought about by LLMs, these models come with certain limitations that hinder transparent and open research. One of the most notable concerns is their proprietary nature, which restricts access to the models and hampers the ability of the broader re- search community to build upon their successes. Additionally, the immense computational resources required for training and deploying these models pose a challenge for researchers with limited re- sources, further exacerbating the accessibility problem.

In response to these limitations, the NLP research community has turned to open-source alternatives to foster greater transparency and collaboration. Two such examples are LLaMA (Touvron et al., 2023) and Alpaca (Taori et al., 2023), where the Alpaca model is further finetuned on LLaMA with instruction data. These open-source LLMs have been designed to facilitate academic research and accelerate progress in the field of NLP. By open-sourcing these models, the NLP community aims to create an environment that encourages further advancements in model development, fine-tuning, and evaluation, ultimately leading to more robust and capable LLMs that can be utilized in a wide range of applications.

Although LLaMA and Alpaca have made significant strides in the world of NLP, they possess inher- ent limitations when it comes to natively supporting Chinese language tasks. The original models contain only a few hundred Chinese tokens in their vocabularies, significantly hampering their effi- ciency in encoding and decoding Chinese text2. Drawing from our previous work on Chinese BERT series (Cui et al., 2021) and Chinese minority-oriented multilingual pre-trained models (Yang et al., 2022), in this technical report, we propose Chinese LLaMA and Alpaca with enhanced abilities in Chinese understanding and generation. We extend the original LLaMA’s vocabulary with an ad- ditional 20K Chinese tokens, substantially improving its ability to process and generate Chinese text. To ensure efficient training and deployment of the Chinese LLaMA and Alpaca models, we adopt the Low-Rank Adaptation (LoRA) approach (Hu et al., 2021), which allows us to train and fine-tune the models without incurring excessive computational costs. Our pilot study in enhancing the Chinese understanding and generation capabilities of LLaMA and Alpaca models can serve as a foundation for researchers seeking to adapt these models to other languages. By demonstrating the feasibility and effectiveness of our approach, we provide insights and methodologies that can be applied to extend the vocabularies and improve the performance of LLaMA and Alpaca models in different languages.

尽管LLM带来了显著的进展，但这些模型存在一些限制，阻碍了透明和开放的研究。其中最显著的问题之一是它们的专有性质，限制了对模型的访问，并阻碍了广大研究社区在其成功基础上的进一步开发。此外，训练和部署这些模型所需的巨大计算资源对于资源有限的研究人员构成了挑战，进一步加剧了可访问性问题。

针对这些限制，NLP研究社区已经转向开源替代方案，以促进更大的透明度和合作。LLaMA（Touvron et al., 2023）和Alpaca（Taori et al., 2023）就是其中两个例子，其中Alpaca模型在LLaMA基础上进一步进行了指导数据的微调。这些开源LLM的设计旨在促进学术研究，并加速NLP领域的进展。通过开源这些模型，NLP社区旨在创造一个鼓励模型开发、微调和评估进一步进展的环境，最终实现更强大、更具能力的LLM，可以在各种应用中利用。

尽管LLaMA和Alpaca在NLP领域取得了重要进展，但它们在本地支持中文任务方面存在固有的局限性。原始模型的词汇表中只包含了几百个中文标记，严重影响了它们在编码和解码中文文本方面的效率。借鉴我们之前在中文BERT系列（Cui et al., 2021）和面向中国少数民族的多语言预训练模型（Yang et al., 2022）上的工作，本技术报告中，我们提出了具有增强中文理解和生成能力的中文LLaMA和Alpaca。我们通过额外添加2万个中文标记来扩展原始LLaMA的词汇表，大大提高了处理和生成中文文本的能力。为了确保高效训练和部署中文LLaMA和Alpaca模型，我们采用了低秩自适应（Low-Rank Adaptation，LoRA）方法（Hu et al., 2021），它允许我们在不产生过多计算成本的情况下进行模型的训练和微调。我们在增强LLaMA和Alpaca模型的中文理解和生成能力方面的初步研究可以为希望将这些模型适应其他语言的研究人员提供基础。通过展示我们方法的可行性和有效性，我们提供了可以应用于扩展LLaMA和Alpaca模型不同语言词汇表和提高性能的见解和方法。

In summary, the contributions of this technical report are as follows:

>> We enhance the Chinese encoding and decoding efficiency and improve LLaMA’s Chinese understanding ability by extending the original LLaMA’s vocabulary with an additional 20,000 Chinese tokens.

>> We adopt the Low-Rank Adaptation (LoRA) approach for the efficient training and deployment of the Chinese LLaMA and Alpaca models, enabling researchers to work with these models without incurring excessive computational costs.

>> We evaluate the performance of the Chinese Alpaca 7B and 13B models on a variety of natural language understanding (NLU) and natural language generation (NLG) tasks, demonstrating significant improvements over the original LLaMA counterparts in the context of Chinese lan- guage tasks.

>> We make the resources and findings of our study publicly available, fostering further research and collaboration within the NLP community and encouraging the adaptation of LLaMA and Alpaca models to other languages.

总结起来，本技术报告的贡献如下：

>> 我们通过在原始LLaMA词汇表中增加2万个中文标记，提高了中文编码和解码的效率，并改进了LLaMA在中文理解方面的能力。

>> 我们采用低秩自适应（LoRA）方法对中文LLaMA和Alpaca模型进行高效的训练和部署，使研究人员能够在不产生过多计算成本的情况下使用这些模型。

>> 我们评估了中文Alpaca 7B和13B模型在各种自然语言理解（NLU）和自然语言生成（NLG）任务上的性能，在中文语言任务的背景下，相比原始LLaMA模型，取得了显著的改进。

>> 我们公开提供了我们研究的资源和发现，促进了NLP社区内的进一步研究和合作，并鼓励将LLaMA和Alpaca模型应用到其他语言中。

2、Chinese Llama

LLaMA (Touvron et al., 2023) is a decoder-only, foundational large language model based on the transformer architecture (Vaswani et al., 2017). Similar to other transformer-based LLMs, LLaMA comprises an embedding layer, multiple transformer blocks, and an LM head layer. It also in- corporates various improvements, such as Pre-normalization (Zhang & Sennrich, 2019), SwiGLU activation (Shazeer, 2020), and Rotary Embeddings (Su et al., 2021). The total number of parame- ters in LLaMA ranges from 7B to 65B. Experiments demonstrate that LLaMA achieves competitive performance compared to other LLMs, like GPT-3, while maintaining a smaller model size.

LLaMA has been pre-trained on 1T to 1.4T tokens from publicly available corpora, with the majority of the data in English and only a small fraction in other languages using Latin or Cyrillic scripts. As a result, LLaMA’s ability to understand and generate Chinese is limited. To address this, we propose pre-training the LLaMA model on Chinese corpora to enhance its fundamental Chinese understanding and generation capabilities.

LLaMA（Touvron等人，2023年）是一种基于Transformer架构（Vaswani等人，2017年）的仅解码的基础大型语言模型。类似于其他基于Transformer的语言模型，LLaMA包括嵌入层、多个Transformer块和LM头部层。它还融入了各种改进，如Pre-normalization（Zhang＆Sennrich，2019年）、SwiGLU激活（Shazeer，2020年）和Rotary Embeddings（Su等人，2021年）。LLaMA的总参数数量在7B到65B之间。实验表明，LLaMA在保持更小模型尺寸的同时，与其他语言模型（如GPT-3）相比取得了竞争性能。

LLaMA已经在公开可用的语料库中对1T到1.4T个标记进行了预训练，其中大部分数据是英语，只有很小一部分是其他语言，使用拉丁字母或西里尔字母脚本。因此，LLaMA对于理解和生成中文的能力有限。为了解决这个问题，我们建议在中文语料库上对LLaMA模型进行预训练，以增强其对中文的基本理解和生成能力。

Directly pre-training LLaMA on Chinese corpora faces several challenges. Firstly, there are less than one thousand Chinese characters in the original LLaMA tokenizer vocabulary. Although the LLaMA tokenizer supports all Chinese characters by falling back to bytes, this fallback strategy significantly increases sequence length and slows down the processing efficiency on Chinese texts. Moreover, byte tokens are not exclusively designed for representing Chinese characters, as they are also used to represent other UTF-8 tokens, making it difficult for byte tokens to learn the semantic meaning of Chinese characters.

To address these issues, we propose to extend the LLaMA tokenizer with additional Chinese tokens and adapt the model for the new tokenizer (Yang et al., 2022):

To enhance the tokenizer’s support for Chinese text, we first train a Chinese tokenizer with SentencePiece (Kudo & Richardson, 2018) on the Chinese corpus, using a vocabulary size of 20,000. We then merge the Chinese tokenizer into the original LLaMA tokenizer by combin- ing their vocabularies. Ultimately, we obtain a merged tokenizer, which we call the Chinese LLaMA tokenizer, with a vocabulary size of 49,953.
To adapt the model for the Chinese LLaMA tokenizer, we resize the word embeddings and language model head from shape V × H to V ′ × H, where V = 32, 000 represents the original vocabulary size, and V ′ = 49, 953 is the vocabulary size of the Chinese LLaMA tokenizer.The new rows are appended to the end of the original embedding matrices, ensuring that the embeddings of the tokens in the original vocabulary remain unaffected.

直接在中文语料库上对LLaMA进行预训练面临一些挑战。首先，原始LLaMA分词器的词汇表中只包含不到一千个中文字。虽然LLaMA分词器通过字节回退来支持所有中文字，但这种回退策略会显著增加序列长度，并降低处理中文文本的效率。此外，字节标记并不是专门设计用于表示中文字符的，因为它们也用于表示其他UTF-8标记，这使得字节标记难以学习到中文字符的语义含义。

为了解决这些问题，我们提议通过添加额外的中文标记来扩展LLaMA分词器，并使模型适应新的分词器（Yang等人，2022年）：

为增强分词器对中文文本的支持，我们首先使用SentencePiece（Kudo＆Richardson，2018年）在中文语料库上训练一个中文分词器，使用词汇表大小为20,000。然后，我们将中文分词器与原始LLaMA分词器合并，结合它们的词汇表。最终，我们获得了一个合并的分词器，称为Chinese LLaMA分词器，其词汇表大小为49,953。
为了使模型适应Chinese LLaMA分词器，我们将词嵌入和语言模型头部的形状从V×H调整为V'×H，其中V = 32,000表示原始词汇表大小，V' = 49,953表示Chinese LLaMA分词器的词汇表大小。

新的行被追加到原始嵌入矩阵的末尾，确保原始词汇表中的标记的嵌入不受影响。

Our preliminary experiments show that the number of tokens generated by the Chinese-LLaMA tokenizer is roughly half of those generated by the original LLaMA tokenizer. Table 1 shows an example comparison between the original LLaMA tokenizer and our Chinese LLaMA tokenizer. As we can see, using the Chinese LLaMA tokenizer significantly reduces the encoding length com- pared to the original. Given a fixed context length, the model can accommodate about twice as much information, and the generation speed is two times faster compared to the original LLaMA tokenizer. This demonstrates the effectiveness of our proposed approach in enhancing the Chinese understanding and generation capabilities of the LLaMA model.

After completing the aforementioned adaptation steps, we pre-train the Chinese-LLaMA model us- ing the Chinese-LLaMA tokenizer on the standard Casual Language Modeling (CLM) task. Given an input token sequence x = (x0, x1, x2, . . .), the model is trained to predict the next token in an autoregressive manner. The objective is to minimize the following negative log likelihood:

我们的初步实验结果显示，Chinese LLaMA分词器生成的标记数量大约是原始LLaMA分词器生成的标记数量的一半。表1展示了原始LLaMA分词器和我们的Chinese LLaMA分词器之间的示例对比。从中可以看出，使用Chinese LLaMA分词器显著减少了编码长度，相比原始分词器，给定固定的上下文长度，模型可以容纳大约两倍的信息量，并且生成速度比原始LLaMA分词器快两倍。这证明了我们提出的方法在增强LLaMA模型的中文理解和生成能力方面的有效性。

在完成上述适应步骤后，我们使用Chinese LLaMA分词器在标准的非正式语言建模（CLM）任务上对Chinese LLaMA模型进行预训练。给定输入标记序列x =（x0，x1，x2，...），模型以自回归的方式预测下一个标记。目标是最小化以下负对数似然：

3、Chinese Alpaca

After obtaining the pre-trained Chinese LLaMA model, we follow the approach used in Stanford Al- paca (Taori et al., 2023) to apply self-instructed fine-tuning to train the instruction-following model.

Each example consists of an instruction and an output. We input the instruction into the model and prompt the model to generate the output auto-regressively. This process is similar to the ca- sual language modeling task. We adopt the following prompt template from Stanford Alpaca for self-instructed fine-tuning, which is also utilized during inference:

A key difference between our approach and Stanford Alpaca is that we exclusively use the prompt template designed for examples without an input field, whereas Stanford Alpaca employs two tem- plates for examples with and without an input field separately. If the example contains a non-empty input field, we concatenate the instruction and input with an “\n” to form the new instruction. Note that there is an additional padding token for the Alpaca model, resulting in a vocabulary size of 49,954.

在获得预训练的Chinese LLaMA模型之后，我们按照Stanford Alpaca（Taori等人，2023年）中使用的方法，采用自我指导的微调方法来训练指令跟随模型。

每个示例包括一条指令和一个输出。我们将指令输入模型，并提示模型自回归地生成输出。这个过程类似于非正式语言建模任务。我们采用了Stanford Alpaca中的以下提示模板来进行自我指导的微调，在推理过程中也使用该模板：

我们的方法与Stanford Alpaca的一个关键区别在于，我们专门使用为没有输入字段的示例设计的提示模板，而Stanford Alpaca分别使用了针对具有和不具有输入字段的示例的两个模板。如果示例包含非空的输入字段，我们将指令和输入用“\n”连接起来形成新的指令。请注意，Alpaca模型还有一个额外的填充标记，导致词汇表大小为49,954。

4、Parameter Efficient Fine-Tuning With Lora使用LoRA进行参数高效微调

Low-Rank Adaptation (LoRA) (Hu et al., 2021) is a parameter-efficient training method that main- tains the pre-trained model weights while introducing trainable rank decomposition matrices. This approach significantly reduces the number of trainable parameters. The general formulation of LoRA is represented in the following equation, where r is the pre-determined rank, d is the hid- den size, and A and B are the decomposed trainable matrices:

h = W0x + ∆W x = W0x + BAx, B ∈ Rd×r , A ∈ Rr×d (3)

低秩适应（LoRA）（Hu等，2021年）是一种参数高效的训练方法，它在保持预训练模型权重的同时引入可训练的秩分解矩阵。这种方法显著减少了可训练参数的数量。LoRA的一般公式如下所示，其中r是预定的秩，d是隐藏大小，A和B是分解的可训练矩阵：

h = W0x + ∆W x = W0x + BAx, B ∈ Rd×r , A ∈ Rr×d (3)

To achieve parameter-efficient training while adhering to a tight budget, we apply LoRA to the

Chinese-LLaMA/Alpaca models in all our experiments, including both pre-training and fine-tuning

stages. We primarily incorporate LoRA adapters into the weights of the attention module and, in

some cases, additional MLP layers. For further details, please refer to the next section and Table 2.

为了在预算限制下实现参数高效的训练，我们将LoRA应用于所有实验中的Chinese LLaMA/Alpaca模型，包括预训练和微调阶段。我们主要将LoRA适配器应用于注意力模块的权重，有时还包括其他MLP层。更多详细信息，请参见下一节和表2。

5、Experimental Setups实验设置

5.1、Experimental Setups For Pre-Training And Fine-Tuning预训练和微调的实验设置

5.1.1、7B Version版本

Pre-training We initialize the Chinese-LLaMA model with the original LLaMA weights and pre- train the model on general Chinese corpora, consistent with the corpora used in Chinese BERT-wwm (Cui et al., 2021), MacBERT (Cui et al., 2020), LERT (Cui et al., 2022), and others, resulting in a 20GB text corpus. The pre-training process consists of two stages:

Stage 1: We fix the parameters of the transformer encoders within the model and only train the embeddings, adapting the newly added Chinese word vectors while minimizing the disturbance to the original model.

Stage 2: We add LoRA weights (adapters) to the attention mechanisms and train the embed- dings, LM heads, and newly added LoRA parameters.

预训练：我们使用原始LLaMA模型的权重初始化Chinese LLaMA模型，并在与中文BERT-wwm（Cui等，2021年）、MacBERT（Cui等，2020年）、LERT（Cui等，2022年）等使用的语料库相一致的通用中文语料库上对模型进行预训练，得到一个20GB的文本语料库。预训练过程分为两个阶段：

阶段1：我们固定模型内部的Transformer编码器的参数，只训练嵌入层，适应新增的中文词向量，同时尽量减小对原始模型的干扰。

阶段2：我们添加LoRA权重（适配器）到注意力机制中，并训练嵌入层、语言模型头部和新增的LoRA参数。

Instruction Fine-tuning After obtaining the pre-trained model, we fine-tune it according to Sec- tion 3. We also use LoRA for efficient fine-tuning, increasing the number of trainable parameters by adding LoRA adapters to the MLP layers. We utilize approximately 2M data points, including translation (Xu, 2019), pCLUE3, Stanford Alpaca, and crawled SFT data for tuning the 7B model.

For the crawled data, we employ the self-instruction (Wang et al., 2022) method for automatically obtaining data from ChatGPT (gpt-3.5-turbo API), as used in Taori et al. (2023). Templates and code details are available on GitHub.4

The hyperparameters are listed in Table 2. Detailed information about the fine-tuning data is pro- vided in Table 3.

指令微调：在获得预训练模型后，我们根据第3节对其进行微调。我们还使用LoRA进行高效微调，通过将LoRA适配器添加到MLP层中，增加可训练参数的数量。我们使用了约2M个数据点进行微调，包括翻译（Xu，2019年）、pCLUE3、Stanford Alpaca和从ChatGPT（gpt-3.5-turbo API）中自动获取的爬取的SFT数据，这与Taori等人（2023年）使用的方法相同。模板和代码细节可以在GitHub上找到。

具体的超参数请参见表2。有关微调数据的详细信息请参见表3。

5.1.2、13B Version版本

Pre-training The pre-training process for the 13B model is largely the same as that of the 7B model, with the exception that we skip stage 1 in the pre-training. We directly apply LoRA to attentions and MLPs for training while setting the embeddings and LM head as trainable.

Instruction Fine-tuning The LoRA settings and trainable parameters remain the same as in the pre-training stage. We use an additional 1M crawled self-instructed data points for the 13B model fine-tuning, resulting in a total data size of 3M for the 13B model.

预训练：13B模型的预训练过程与7B模型基本相同，唯一的区别是我们跳过了预训练的第一阶段。我们直接对注意力机制和MLP进行LoRA训练，同时将嵌入层和语言模型头部设置为可训练。

指令微调：LoRA的设置和可训练参数与预训练阶段相同。我们使用额外的1M个自我指导的爬取数据点对13B模型进行微调，从而使13B模型的总数据量达到3M。

The hyperparameters are listed in Table 2.

Table 2: Training recipes for LLaMA (pre-training stages) and Alpaca (instruction SFT stage) 7B and 13B. PT: pre-training. SFT: supervised fine-tuning. QKVO: four matrices (represents query, key, value, and output) in each attention module. MLP: three matrices in each MLP layer.

具体的超参数请参见表2。

表2：LLaMA（预训练阶段）和Alpaca（指令SFT阶段）7B和13B的训练配置。PT：预训练。SFT：监督微调。QKVO：每个注意力模块中的四个矩阵（表示查询、键、值和输出）。MLP：每个MLP层中的三个矩阵。

7B Settings	PT Stage 1 PT Stage 2	Instruction SFT
Batch size	1024 1024	512
Peak learning rate	2e-4 1e-4	1e-4
Training steps	3K 6K	6-10K
Max length	512 512	512
Trainable parameters	2.97% 6.06%	6.22%
LoRA rank	- 8	8
LoRA weights	- QKVO	QKVO, MLP
Training device	8 × A100 16 × A100	16 × A100
Distributed training	DeepSpeed ZeRO-2 DeepSpeed ZeRO-2	DeepSpeed ZeRO-2
13B Settings	PT	Instruction SFT
Batch size	2304	1152
Peak learning rate	2e-4	1e-4
Training steps	7K	5.5K
Max length	512	512
Trainable parameters	4.10%	4.10%
LoRA rank	8	8
LoRA weights	QKVO, MLP	QKVO, MLP
Training device	48 × A100	48 × A100
Distributed training	DeepSpeed ZeRO-2	DeepSpeed ZeRO-2

5.2、Experimental Setups For Decoding解码的实验设置

The decoding process of LLMs plays a critical role in determining the quality and diversity of the generated text. In our experiments, we use the following decoding hyperparameters

LLM的解码过程在确定生成文本的质量和多样性方面起着关键作用。在我们的实验中，我们使用以下解码超参数：

Context size: We set the context size to 2048, which determines the maximum number of tokens that the model can consider simultaneously when generating text.

Maximum sequence length: We limit the generated sequence length to 512 tokens to ensure that the outputs remain focused and relevant to the input prompt.

Temperature: We set the temperature to 0.2, which controls the randomness of the sampling process. Lower values make the model generate more focused and deterministic outputs, while higher values increase diversity at the cost of coherence.

Top-k sampling: We use Top-k sampling with k = 40, meaning that the model selects its next token from the top 40 most probable tokens at each step, adding an element of randomness and diversity to the generated text.

Top-p sampling: We also employ Top-p sampling with p = 0.9, which further enhances diver- sity by considering a dynamic set of tokens that collectively account for 90% of the probability mass.

Repetition penalty: To discourage the model from generating repetitive text, we apply a repeti- tion penalty with a factor of 1.3, penalizing tokens that have already been selected.

Note that these values may not be optimal for each testing scenario. We did not perform further tuning on these hyperparameters for each task to maintain a balanced view.

>> 上下文大小：我们将上下文大小设置为2048，这决定了模型在生成文本时同时考虑的最大标记数。

>> 最大序列长度：我们将生成的序列长度限制为512个标记，以确保输出保持专注和与输入提示相关。

>> 温度：我们将温度设置为0.2，控制抽样过程的随机性。较低的值使模型生成更专注和确定性的输出，而较高的值增加了多样性，但可能降低连贯性。

>> Top-k抽样：我们使用k=40的Top-k抽样，意味着模型在每个步骤从最有可能的前40个标记中选择下一个标记，从而为生成的文本添加一定的随机性和多样性。

>> Top-p抽样：我们还使用p=0.9的Top-p抽样，通过考虑动态的标记集合，这些标记总体上占据了90%的概率质量，进一步增强了多样性。

>> 重复惩罚：为了防止模型生成重复的文本，我们应用了一个重复惩罚因子为1.3的机制，对已经被选中的标记进行惩罚。

请注意，这些值可能不适用于每个测试场景。我们没有针对每个任务进一步调整这些超参数，以保持平衡的观点。

5.3、Deployment On CPU在CPU上部署

Deploying large language models on personal computers, particularly on CPUs, has historically been challenging due to their immense computational requirements. However, with the help of many community efforts, such as llama.cpp (Gerganov, 2023), users can efficiently quantize LLMs into 4-bit forms, significantly reducing memory usage and computational demands, making it easier to deploy LLMs on personal computers. This also enables quicker interactions with the models and facilitates local data processing.

在个人计算机上部署大型语言模型，特别是在CPU上，过去一直存在着巨大的计算需求挑战。然而，借助众多社区的努力，例如llama.cpp（Gerganov，2023年），用户可以将LLM有效地量化为4位形式，显著减少内存使用和计算需求，从而更容易在个人计算机上部署LLM。这也可以实现与模型的更快交互，并促进本地数据处理。

Quantizing LLMs and deploying them on personal computers offer several benefits. Firstly, it helps users protect their data privacy by ensuring that sensitive information remains within their local environment, rather than being transmitted to external servers. Secondly, it democratizes access to LLMs by making them more accessible to users with limited computational resources. Lastly, it promotes the development of new applications and research directions that take advantage of local LLM deployments. Overall, the ability to deploy LLMs on personal computers using llama.cpp (or similar) paves the way for a more versatile and privacy-conscious utilization of LLMs in various domains.

In the following sections, we will use the 4-bit round-to-nearest (RTN) (Yao et al., 2022; Dettmers et al., 2022) quantized Chinese Alpaca for evaluation, which is more realistic from a user perspective rather than a research-oriented view. As a kind reminder, 4-bit quantized models gener- ally perform worse than FP16 or FP32 models.

量化LLM并在个人计算机上部署具有多个好处。首先，它有助于用户通过确保敏感信息保留在本地环境中而不是传输到外部服务器来保护数据隐私。其次，它使具有有限计算资源的用户更容易访问LLM，从而使其更加民主化。最后，它促进了利用本地LLM部署的新应用程序和研究方向的发展。总体而言，使用llama.cpp（或类似工具）在个人计算机上部署LLM为各个领域中更多样化和注重隐私的LLM利用铺平了道路。

在接下来的几节中，我们将使用4位最接近（RTN）（Yao等，2022年；Dettmers等，2022年）量化的中文Alpaca进行评估，从用户的角度来看，这更加符合实际，而不是面向研究。

需要提醒的是，4位量化模型的性能通常低于FP16或FP32模型。

5.4、Evaluation And Task Design评估和任务设计

Evaluating the performance of text generation tasks can be challenging due to the significant variety in their form, unlike natural language understanding tasks (such as text classification and extractive machine reading comprehension). Following previous work that utilizes GPT-4 as a scoring method, we also adopt GPT-4 to provide an overall score (on a 10-point scale) for each sample, which is more efficient than human evaluation. However, GPT-4 may not always provide accurate scores, so we perform manual checks on its ratings and adjust them if necessary. The manual checks ensure that the scores are consistent and reflect the true performance of the models being evaluated. We use the following prompt template for scoring the outputs of the systems:

由于文本生成任务的形式存在显著的多样性，评估其性能可能具有挑战性，与自然语言理解任务（如文本分类和抽取式机器阅读理解）不同。我们采用了以GPT-4作为评分方法的先前工作所采用的方法，用于为每个样本提供一个综合得分（在10分制度上），这比人工评估更高效。然而，GPT-4并不总是能够提供准确的评分，因此我们对其评分进行人工检查，并在必要时进行调整。人工检查确保得分的一致性，并反映所评估模型的真实性能。我们使用以下提示模板对系统输出进行评分：

By employing GPT-4 as a scoring method in conjunction with manual checks, we establish a reliable evaluation framework that effectively measures the performance of our Chinese Alpaca models on a range of natural language understanding and generation tasks.

Our evaluation set is designed to provide a comprehensive assessment of the Chinese Alpaca models across a wide range of natural language understanding and generation tasks. The set comprises 160 samples, covering 10 distinct tasks, including Question Answering, Reasoning, Literature, Enter- tainment, Translation, Multi-turn Dialogue, Coding, and Ethics, among others. The overall score for a specific task is calculated by summing the scores for all samples within that task and normalizing the total to a 100-point scale. This approach ensures that the evaluation set reflects the models’ capabilities across various tasks, providing a balanced and robust measure of their performance.

通过将GPT-4作为评分方法与人工检查相结合，我们建立了一个可靠的评估框架，有效地衡量了我们的中文Alpaca模型在各种自然语言理解和生成任务上的性能。

我们的评估集旨在全面评估中文Alpaca模型在各种自然语言理解和生成任务中的性能。该集合包含160个样本，涵盖了10个不同的任务，包括问答、推理、文学、娱乐、翻译、多轮对话、编码和伦理等。特定任务的总体得分是通过对该任务中所有样本的得分求和并将总和归一化到100分制来计算的。这种方法确保评估集反映了模型在各种任务上的能力，提供了一个平衡和稳健的性能度量。

6、Results结果

In this section, we present and analyze the results obtained from our experiments with 4-bit quan- tized Chinese Alpaca-7B and Alpaca-13B models, as shown in Table 4. The evaluation is based on GPT-4 rated results across ten distinct NLP tasks, encompassing a total of 160 samples. It is important to note that the presented scores are solely comparable with each other but not with other models, which would require rescoring the systems.

The performance of both Chinese Alpaca-7B and Alpaca-13B models demonstrates significant im- provements over their original LLaMA counterparts. The Chinese Alpaca-13B model consistently outperforms the 7B variant, highlighting the benefits of increased model capacity.

For Question Answering tasks, the Chinese Alpaca-13B achieves a score of 77, compared to 53 for the 7B model. Similar improvements can be observed in Open-ended QA, with scores of 73 and 64 for the 13B and 7B models, respectively. Numerical Reasoning shows a more considerable improvement, with the 13B model scoring 50 compared to 23 for the 7B model.

在本节中，我们展示和分析了我们对4位量化的中文Alpaca-7B和Alpaca-13B模型进行实验的结果，如表4所示。评估基于GPT-4对十个不同的自然语言处理任务中的总共160个样本的评分结果。需要注意的是，所呈现的得分仅可相互比较，而不可与其他模型进行比较，这将需要重新对系统进行评分。

中文Alpaca-7B和Alpaca-13B模型的性能显示出显著的改进，超过了它们的原始LLaMA对应模型。中文Alpaca-13B模型始终优于7B变体，突显了增加模型容量的好处。

在问答任务中，中文Alpaca-13B的得分为77，而7B模型为53。在开放式问答中也可以观察到类似的改进，13B和7B模型的得分分别为73和64。在数字推理中，13B模型的得分为50，而7B模型的得分为23，显示出更为显著的改进。

In the domains of Poetry, Literature, Philosophy, Music, Sports, and Entertainment, the 13B model continues to outperform the 7B model, with scores of 54 and 65 against 31 and 36, respectively. The performance gap remains significant for tasks involving Letters and Articles, Translation, and Multi-turn Dialogue, with the 13B model consistently achieving higher scores. Interestingly, we observe that even though we did not use any multi-turn dialogue data for tuning systems, Chinese Alpaca still has the ability to track conversation history and follow user instructions in a consecutive manner.

Coding tasks exhibit a noticeable improvement, with the Chinese Alpaca-13B scoring 49 compared to 27 for the 7B model. The most striking performance difference can be observed in the Ethics task, where the 13B model achieves a perfect score of 100, in contrast to the 7B model’s score of 50, indicating superior performance in rejecting any unethical user inputs.

在诗歌、文学、哲学、音乐、体育和娱乐领域，13B模型继续优于7B模型，分别得分54和65，而7B模型得分为31和36。对于涉及信函和文章、翻译以及多轮对话的任务，性能差距仍然显著，13B模型始终获得更高的得分。有趣的是，我们观察到，即使我们没有使用任何多轮对话数据来调整系统，中文Alpaca仍然具有跟踪对话历史和按照用户指令连贯进行的能力。

编码任务显示出明显的改进，中文Alpaca-13B得分为49，而7B模型得分为27。最引人注目的性能差异出现在伦理任务中，13B模型获得完美的100分，而7B模型得分为50，表明在拒绝任何不道德的用户输入方面具有更优越的性能。

In summary, the experimental results demonstrate that both Chinese Alpaca-7B and Alpaca-13B models exhibit significant improvements over their original LLaMA counterparts, with the 13B model consistently outperforming the 7B model across all tasks. This underscores the effectiveness of our approach in enhancing the Chinese understanding and generation capabilities of the LLaMA and Alpaca models.

We provide some cases in Table 5, 6, and 7. For full comparisons and samples, please refer to our GitHub repository.

总之，实验结果表明，中文Alpaca-7B和Alpaca-13B模型在所有任务中都显示出显著的改进，13B模型始终优于7B模型。这凸显了我们的方法在增强LLaMA和Alpaca模型的中文理解和生成能力方面的有效性。

我们在表5、表6和表7中提供了一些案例。有关完整的比较和样本，请参阅我们的GitHub存储库。

Table 4: GPT-4 rated results for 4-bit quantized Chinese Alpaca-7B and Alpaca-13B. Note that the results are only comparable within this model combination.

表4：4位量化中文Alpaca-7B和Alpaca-13B的GPT-4评分结果。请注意，这些结果仅在该模型组合内可比较。

Table 4: GPT-4 rated results for 4-bit quantized Chinese Alpaca-7B and Alpaca-13B. Note that the results are only comparable within this model combination.

Task	Samples #	Chinese-Alpaca-7B	Chinese-Alpaca-13B
Question Answering	20	53	77
Open-ended QA	20	64	73
Numerical Reasoning	20	23	50
Poetry, Literature, Philosophy	20	31	54
Music, Sports, Entertainment	20	36	65
Letters and Articles Writing	15	65	78
Translation	15	63	78
Multi-turn Dialogue	10	80	83
Coding	10	27	49
Ethics	10	50	100
Total	160	49	71

7、CONCLUSION结论

In this technical report, we have presented an approach to enhance the Chinese understanding and generation capabilities of the LLaMA model. Acknowledging the limitations of the original LLaMA’s Chinese vocabulary, we expanded it by incorporating 20K additional Chinese tokens, sig- nificantly increasing its encoding efficiency for the Chinese language. Building on the Chinese LLaMA, we employed supervised fine-tuning with instruction data, resulting in the development of the Chinese Alpaca models, which exhibit improved instruction-following capabilities.

To evaluate our models effectively, we annotated 160 samples across 10 distinct task types and uti- lized GPT-4 for evaluation. Our experiments demonstrated that the proposed models significantly outperform the original LLaMA in Chinese understanding and generation tasks, with the 13B ver- sion consistently achieving greater improvements compared to the 7B variant.

在本技术报告中，我们提出了一种增强LLaMA模型中文理解和生成能力的方法。鉴于原始LLaMA模型在中文词汇方面的局限性，我们通过添加2万个额外的中文标记来扩展其词汇表，显著提高了对中文语言的编码效率。在中文LLaMA的基础上，我们采用了带指导数据的有监督微调方法，开发了中文Alpaca模型，其具有改进的指令跟随能力。

为了有效评估我们的模型，我们对10个不同任务类型的160个样本进行了注释，并利用GPT-4进行评估。我们的实验结果表明，所提出的模型在中文理解和生成任务中明显优于原始LLaMA模型，13B版本相比于7B版本持续取得更大的改进。

Looking ahead, we plan to explore Reinforcement Learning from Human Feedback (RLHF) or Re- inforcement Learning from AI Instructed Feedback (RLAIF) to further align the models’ output with human preferences. Moreover, we intend to adopt more advanced and effective quantization methods, such as GPTQ (Frantar et al., 2022), among others. Additionally, we aim to investigate al- ternative methods to LoRA for more efficient and effective pre-training and fine-tuning of large lan- guage models, ultimately enhancing their performance and applicability across various tasks within the Chinese NLP community.

展望未来，我们计划探索人类反馈强化学习（RLHF）或AI指导反馈强化学习（RLAIF）等方法，进一步使模型的输出与人类偏好相一致。此外，我们打算采用更先进、更有效的量化方法，例如GPTQ（Frantar et al., 2022）等。另外，我们还打算研究替代LoRA的更高效、更有效的大型语言模型预训练和微调方法，最终提高这些模型在中文NLP社区内各种任务中的性能和适用性。

LIMITATIONS限制

While this project has successfully enhanced the Chinese understanding and generation capabilities of the LLaMA and Alpaca models, several limitations must be acknowledged:

虽然本项目成功增强了LLaMA和Alpaca模型的中文理解和生成能力，但仍需注意以下几个限制：

>> Harmful and unpredictable content: Our results demonstrate that the 13B version has a better ability to reject unethical queries than the 7B version. However, these models may still generate content that is harmful or misaligned with human preferences and values. This issue may arise from biases present in the training data or the models’ inability to discern appropriate outputs in certain contexts.

>> Insufficient training: Due to constraints in computing power and data availability, the training of the models may not be sufficient for optimal performance. As a result, t5.1here is still room for improvement in the Chinese understanding capabilities of the models.

有害和不可预测的内容：我们的结果表明，13B版本比7B版本更能拒绝不道德的查询。然而，这些模型仍可能生成有害或与人类偏好和价值观不一致的内容。这个问题可能源于训练数据中存在的偏见，或者模型在某些情境下无法判断适当的输出。

训练不足：由于计算能力和数据可用性的限制，模型的训练可能不足以实现最佳性能。因此，模型在中文理解能力方面仍有改进的空间。

LLMs：《Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca》翻译与解读

相关文章