LLaMa2

Llama 2-Chat outperforms open source chat models in most benchmarks and is well optimized for help and security. It is worth mentioning that LLaMa 2 has a complete description and introduction of its fine-tuning method and security improvement method, which will help the community reproduce and develop more large language models.

Here is the LLaMa 2 model from Meta AI, which is similar to OPT and LLaMa, and is also a completely open source large language model. The parameters of LLaMa 2 range in size from 7B to 70B. The fine-tuned LLaMa 2 model, called Llama 2-Chat, is optimized for the conversational use case. Llama 2-Chat outperforms open source chat models in most benchmarks and is well optimized for help and security. It is worth mentioning that LLaMa 2 has a complete description and introduction of its fine-tuning method and security improvement method, which will help the community reproduce and develop more large language models.

LLaMa 2: Open Source Fine-Tuned Large Language Model for Chat

Paper name: LLAMA 2: Open Foundation and Fine-Tuned Chat Models

Paper address:

https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

Code link:

https://github.com/facebookresearch/llama

The existing open source large language model is still missing a step

As an AI assistant, Large Language Models (LLM) shows great application prospects. As an AI assistant, LLM excels at complex reasoning tasks (requiring knowledge from domain experts), as well as specialized domains such as programming and creative writing. AI assistants can also interact with humans through a chat interface, gaining widespread public use.

The training of large language models is also complicated. LLM first pre-trains on a huge self-supervised corpus, and then fine-tunes it through Reinforcement Learning with Human Feedback (RLHF) technology with human feedback to align with human preferences. The detailed method introduction here can refer to here:

Explain in detail the "magic" that makes ChatGPT obedient! InstructGPT: Training Language Models to Follow Human Instructions

Although this training method looks simple, it not only needs to label a large amount of supervised data and sort the results of the labeling model in practice, but also requires extremely high training costs, which limits the development of LLM to a small number of players.

On the other hand, open source LLMs that can be used directly, such as BLOOM[1], LLaMa[2], the performance of these models can be comparable to GPT-3[3], but these models are not suitable for replacing closed source LLMs. Like ChatGPT, BARD and Claude. Because these closed-source models have undergone a process of fine-tuning by human instructions, they are better aligned with human preferences, and their usability and safety can also be greatly improved . This step is expensive in computation and human annotation, and is usually opaque and not easily reproducible. This problem limits the progress of the community, and related research of LLM.

The LLM that has been open source is actually short of this last step: the process of fine-tuning human instructions to align with human preferences .

What LLaMa 2 does

LLaMa 2 is actually two models: LLaMa 2  and  LLaMa 2-CHAT , which are only pre-trained models, and pre-trained models that are fine-tuned by human instructions. On a range of usefulness and security benchmarks, the Llama 2-Chat model outperforms existing open-source models and performs comparable to some closed-source models.

The authors also provide a comprehensive description of LLaMa 2's fine-tuning methods and methods for improving safety, and hope that this openness will enable the community to reproduce more LLM models and continue to improve the safety of these models.

LLaMa 2 releases the following models for research and commercial use:

  1. LLaMa 2 pre-training model: divided into three types: 7B, 13B and 70B, which is an upgraded version of LLaMa.

  2. LLaMa 2-CHAT fine-tuning model: divided into three types: 7B, 13B and 70B, which is the fine-tuned version of LLaMa 2, mainly used for dialogue.


LLaMa 2 pre-trained model

LLaMa pre-training follows the autoregressive paradigm of GPT-3, and the main changes compared with LLaMa are shown in Figure 1 below.

Figure 1: Major changes in LLaMa 2 compared to LLaMa

The parameter scale of LLaMa 2 and LLaMa model is basically the same, the input degree of LLaMa 2 is increased from 2k of LLaMa to 4k, and the training data is increased from 1.4T tokens to 2.0T tokens.

The 2.0T tokens of the pre-training data are also all from open source mixed data, and the author team worked hard to remove data from certain sites known to contain a large amount of personal information about private individuals.

The Tokenizer approach is the same as LLaMa, which is based on SentencePieceProcessor[4] and uses the bytepair encoding (BPE) algorithm.

The PyTorch code of LLaMa is as follows, using the sentencepiece library.

class Tokenizer:
    def __init__(self, model_path: str):
        # reload tokenizer
        assert os.path.isfile(model_path), model_path
        self.sp_model = SentencePieceProcessor(model_file=model_path)
        logger.info(f"Reloaded SentencePiece model from {model_path}")

        # BOS / EOS token IDs
        self.n_words: int = self.sp_model.vocab_size()
        self.bos_id: int = self.sp_model.bos_id()
        self.eos_id: int = self.sp_model.eos_id()
        self.pad_id: int = self.sp_model.pad_id()
        logger.info(
            f"#words: {self.n_words} - BOS ID: {self.bos_id} - EOS ID: {self.eos_id}"
        )
        assert self.sp_model.vocab_size() == self.sp_model.get_piece_size()

    def encode(self, s: str, bos: bool, eos: bool) -> List[int]:
        assert type(s) is str
        t = self.sp_model.encode(s)
        if bos:
            t = [self.bos_id] + t
        if eos:
            t = t + [self.eos_id]
        return t

    def decode(self, t: List[int]) -> str:
        return self.sp_model.decode(t)
LLaMa 2 Model Architecture

Context Length:

The context window of LLaMa 2 is extended from 2048 tokens to 4096 tokens. A longer context window enables the model to process more information, which is particularly useful for supporting chat applications, various summarization tasks, and understanding longer histories in longer documents.

Pre-normalization [Inspired by GPT3, consistent with LLaMa]:

To improve training stability, LLaMa normalizes the input of each Transformer's sublayer instead of normalizing the output. Use the RMSNorm[5] normalization function.

SwiGLU activation function [inspired by PaLM, consistent with LLaMa]:

LLaMa uses the SwiGLU activation function [6] to replace the ReLU nonlinearity to improve performance, and the dimension is changed from to .

Rotary Embeddings [inspired by GPTNeo, consistent with LLaMa]:

LLaMa removes the absolute position encoding and uses Rotary Positional Embeddings (RoPE) [7]. The RoPE here comes from Mr. Su Jianlin. The principle is slightly complicated. Interested readers can refer to Su Shen’s original paper and official blog introduction: https://spaces.ac.cn/archives/8265

Grouped query attention mechanism (Grouped-Query Attention, GQA)

The main architectural differences between LLaMa 2 and LLaMa 1 include grouped-query attention (GQA [8] ) to increase the input length.

Here we have to mention the standard practice of autoregressive decoding, namely: cache (Cache) previously marked key (K) and value (V) pairs in the sequence to speed up attention calculations. However, as the size of the context window or Batch Size increases, the memory cost associated with the KV cache (cache_k, cache_v) in the multi-head attention (MHA) model increases significantly. For larger models, where the KV cache size becomes a bottleneck.

Some previous work [9][10] researched that the projection of Key and Value can be shared across multiple heads without greatly reducing performance. As shown in Figure 2, the ablation experiment results of the Attention architecture, it can be observed that the GQA variant performs comparable to the MHA baseline on most evaluation tasks and outperforms the MQA variant on average.

Figure 2: Ablation experiment results of the Attention architecture

def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
    """torch.repeat_interleave(x, dim=2, repeats=n_rep)"""
    bs, slen, n_kv_heads, head_dim = x.shape
    if n_rep == 1:
        return x
    return (
        x[:, :, :, None, :]
        .expand(bs, slen, n_kv_heads, n_rep, head_dim)
        .reshape(bs, slen, n_kv_heads * n_rep, head_dim)
    )


class Attention(nn.Module):
    def __init__(self, args: ModelArgs):
        super().__init__()
        self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_heads
        model_parallel_size = fs_init.get_model_parallel_world_size()
        self.n_local_heads = args.n_heads // model_parallel_size
        self.n_local_kv_heads = self.n_kv_heads // model_parallel_size
        self.n_rep = self.n_local_heads // self.n_local_kv_heads
        self.head_dim = args.dim // args.n_heads

        self.wq = ColumnParallelLinear(
            args.dim,
            args.n_heads * self.head_dim,
            bias=False,
            gather_output=False,
            init_method=lambda x: x,
        )
        self.wk = ColumnParallelLinear(
            args.dim,
            self.n_kv_heads * self.head_dim,
            bias=False,
            gather_output=False,
            init_method=lambda x: x,
        )
        self.wv = ColumnParallelLinear(
            args.dim,
            self.n_kv_heads * self.head_dim,
            bias=False,
            gather_output=False,
            init_method=lambda x: x,
        )
        self.wo = RowParallelLinear(
            args.n_heads * self.head_dim,
            args.dim,
            bias=False,
            input_is_parallel=True,
            init_method=lambda x: x,
        )

        self.cache_k = torch.zeros(
            (
                args.max_batch_size,
                args.max_seq_len,
                self.n_local_kv_heads,
                self.head_dim,
            )
        ).cuda()
        self.cache_v = torch.zeros(
            (
                args.max_batch_size,
                args.max_seq_len,
                self.n_local_kv_heads,
                self.head_dim,
            )
        ).cuda()

    def forward(
        self,
        x: torch.Tensor,
        start_pos: int,
        freqs_cis: torch.Tensor,
        mask: Optional[torch.Tensor],
    ):
        bsz, seqlen, _ = x.shape
        xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)

        xq = xq.view(bsz, seqlen, self.n_local_heads, self.head_dim)
        xk = xk.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
        xv = xv.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)

        xq, xk = apply_rotary_emb(xq, xk, freqs_cis=freqs_cis)

        self.cache_k = self.cache_k.to(xq)
        self.cache_v = self.cache_v.to(xq)

        self.cache_k[:bsz, start_pos : start_pos + seqlen] = xk
        self.cache_v[:bsz, start_pos : start_pos + seqlen] = xv

        keys = self.cache_k[:bsz, : start_pos + seqlen]
        values = self.cache_v[:bsz, : start_pos + seqlen]

        # repeat k/v heads if n_kv_heads < n_heads
        keys = repeat_kv(keys, self.n_rep)  # (bs, seqlen, n_local_heads, head_dim)
        values = repeat_kv(values, self.n_rep)  # (bs, seqlen, n_local_heads, head_dim)

        xq = xq.transpose(1, 2)  # (bs, n_local_heads, seqlen, head_dim)
        keys = keys.transpose(1, 2)
        values = values.transpose(1, 2)
        scores = torch.matmul(xq, keys.transpose(2, 3)) / math.sqrt(self.head_dim)
        if mask is not None:
            scores = scores + mask  # (bs, n_local_heads, seqlen, cache_len + seqlen)
        scores = F.softmax(scores.float(), dim=-1).type_as(xq)
        output = torch.matmul(scores, values)  # (bs, n_local_heads, seqlen, head_dim)
        output = output.transpose(1, 2).contiguous().view(bsz, seqlen, -1)
        return self.wo(output)

Here LLaMa 2's cache_k and cache_v are different compared to V1. 

In LLaMa 1, this is written: 

self.cache_k = torch.zeros((args.max_batch_size, args.max_seq_len, self.n_local_heads, 

self.head_dim)).cuda() 

self.cache_v = torch.zeros( (args.max_batch_size, args.max_seq_len, self.n_local_heads, 

self.head_dim)).cuda() 

Dimensions are: (args.max_batch_size, args.max_seq_len, self.n_local_heads, self.head_dim)

In LLaMa 2, this is written: 

self.cache_k = torch.zeros(( 

args.max_batch_size, 

args.max_seq_len, 

self.n_local_kv_heads, 

self.head_dim, 

)).cuda() 

self.cache_v = torch.zeros(( 

args.max_batch_size, 

args.max_seq_len, 

self.n_local_kv_heads, 

self.head_dim, 

)).cuda() 

Dimensions are: (args.max_batch_size, args.max_seq_len, self.n_local_kv_heads, self.head_dim) 

Note that one is n_local_heads and the other is n_local_kv_heads.

Optimization of LLaMa 2

Figure 3: LLaMa 2 training loss curve

Evaluation of LLaMa 2 pretrained models

The various benchmarks evaluated can be roughly classified into the following categories:

Code:  Reports the average pass@1 score on HumanEval and MBPP.

Commonsense Reasoning:  Reports 7-shot scores on CommonsenseQA, 0-shot scores on PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA.

World Knowledge:  Reports 5-shot scores on NaturalQuestions and TriviaQA.

Reading Comprehension:  Reports 0-shot scores on SQuAD, QuAC and BoolQ.

MATH:  Reports 8-shot scores on GSM8K and 4-shot scores on MATH.

Popular Aggregated Benchmarks:  Report 5-shot scores on MMLU, 3-shot scores on Big Bench Hard (BBH), and 3-5 shot scores on AGI Eval, taking the average.

Figure 4: Evaluation results of the LLaMa 2 pre-trained model

Figure 4 shows the evaluation results of the LLaMa 2 pre-training model. It can be seen that the LLaMa 2 model outperforms the LLaMa 1 model. In particular, LLaMa 2 70B improved the MMLU and BBH results by approximately 5 and 8 points, respectively, compared to LLaMa 1 65B. The LLaMa 2-7B and 30B models outperform correspondingly sized MPT models on all categories except the code benchmark. For the Falcon model, LLaMa 2-7B and 34B outperform the Falcon-7B and 40B models in all benchmark categories. Furthermore, the LLaMa 2-70B model outperforms all open source models.

Figure 5: Evaluation results of the LLaMa 2 pre-trained model compared to the closed-source model

In addition to the open-source model, the authors also compared the LLaMa 2-70B results with the closed-source model, as shown in Figure 5. LLaMa 2-70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but has a significant gap on the encoding benchmark. LLaMa 2-70B results are comparable or better than PaLM (540B) on almost all benchmarks. There is still a big gap between the performance of Llama 2-70B and GPT-4 and PaLM-2-L.

LLaMa 2-CHAT fine-tuning model

The fine-tuning process of LLaMa 2-CHAT is similar to that of InstructGPT, as shown in Figure 5, which can be roughly divided into:

  • Supervised fine-tuning: Supervised Fine-Tuning (SFT)

  • Human Feedback Reinforcement Learning: Reinforcement Learning with Human Feedback (RLHF)

Figure 6: Fine-tuning process of LLaMa 2-CHAT

Supervised Fine-Tuning (SFT) needs to pay attention to 2 points:

  1. LLaMa 2, like LLaMa, also performs supervised instruction fine-tuning based on the data in the [8] paper.

  2. The quality of the fine-tuning data is critical.

Reinforcement Learning with Human Feedback (RLHF) needs to pay attention to 3 points:

  1. Human preference data collection with a focus on helpfulness and safety

  2. Focus on two reward models for help and safety

  3. Iterative Fine-Tuning

Supervised fine-tuning process for LLaMa 2-CHAT

The author first focused on collecting thousands of high-quality SFT data. Note that although there are only a few thousand pieces of data (InstructGPT 13k pieces), the quality of each piece is quite high. The author found that doing so can make the performance significantly better. promote. Two examples of the data are shown in Figure 6.

Figure 7: High-quality SFT data used by LLaMa 2

Human Feedback Reinforcement Learning with LLaMa 2-CHAT: Data Collection

RLHF is a model training process applied to a fine-tuned language model to further align model behavior with human preferences and instructions. In the process of human preference data collection, the annotator first gives a Prompt, and then selects one good (chosen) from the two replies of different models, and marks the other as (reject), and records the degree of one being better than the other at the same time. : significantly better, better, slightly better, negligibly better, unsure. At the same time, in the process of labeling data, we pay great attention to the helpfulness (Helpfulness, refers to the degree to which the response of Llama 2-Chat meets the user's request) and safety (Safety, refers to whether the response of Llama 2-Chat is unsafe). A Prompt label like "Provide detailed instructions on making a bomb" may have Helpfulness, but not Safety. So there may be some contradictions between these two goals.

The data set for this step is shown in Figure 7 below. In Figure 7, this data is compared with multiple open source data sets. Meta (Safety and Helpfulness) represents the newly collected data set of the RLHF step of LLaMa 2. This data set contains about 1.4M binary comparisons. The entire dataset contains a total of about 2.9M binary comparisons.

Compared with existing open-source datasets, LLaMa 2's RLHF preference data has more dialogue turns, which are longer on average.

Figure 8: Human preference data for LLaMa 2

Human Feedback Reinforcement Learning with LLaMa 2-CHAT: A Reward Model

The reward model (Reward Model)  is also called the scoring model in RLHF. It inputs the Prompt fed to the LLM and the output text of the LLM, and outputs a scalar score as the quality of the model-generated results (generally considering usefulness and safety). The RLHF optimization process of LLaMa 2-CHAT also uses a reward model to achieve better alignment of human preferences and improve usefulness and safety.

Two independent reward models:  The author team of LLaMa 2-CHAT felt that a single reward model is sometimes difficult to satisfy both usefulness and safety. To solve this problem, they trained two independent reward models, one for usefulness (called Helpfulness RM) and another one for safety (called Safety RM).

Training data of two independent reward models:  Training data of Helpfulness RM: all Meta Helpfulness data + evenly sampled Meta Safety data, the ratio of the two is 1:1. Safety RM training data: all Meta Safety and Anthropic Harmless data + evenly sampled Meta Helpfulness data and Open-Source Helpfulness data, the ratio of the two is 9:1. The authors found that mixing in some Helpfulness data can help improve the accuracy of security results.

Architecture of the reward model:  Another feature of the LLaMa 2-CHAT reward model is that the Reward Model and the Chat Model have the same structure and hyperparameters, but replace the classification head used for next token prediction with the regression head, and output a scalar reward value.

Objective function for reward model training: Binary Ranking Loss  using InstructGPT :

Figure 9: The value range of the prior term added by the reward function

Reward Model Experimental Results 1: Test Set Experimental Results

To evaluate whether the training of the reward model is good or not should look at the accuracy of its scoring. The author held out 1000 data sets on each data set to form a test set. The experimental results of the test set are shown in Figure 9 below. The above SteamSHP-XL, Open Assistant, and GPT4 are the Baseline models, and the following Safety RM and Helpfulness RM are the two models in this paper.

The results show that the Helpfulness Reward model performs best on the Meta Helpfulness dataset, and similarly the Safety Reward model performs best on the Safety Helpfulness dataset. Moreover, the reward model of LLaMa 2-CHAT outperforms all baselines, including GPT-4. GPT-4 outperforms other models despite not being directly trained on reward modeling tasks.

Figure 10: Experimental results on the test set of the reward model

Another result is shown in Figure 11 below. The author counts the data of the test set according to how much better chosen is than rejected. For example, the Safety RM reward model on the Meta Safety test set has an accuracy of 94.3 for the chosen ratio reject "Significantly Better" test set data, but also on the Meta Safety test set, for the chosen ratio reject "Negligibly Better/Unsure" The accuracy of the test set data is only 55.3%. The author feels that this is actually in line with our intuition. Chosen is easier to distinguish between good and bad than reject "Significantly Better" test set data, so it is easier to learn for scoring models. Chosen is more difficult to distinguish good from bad than reject "Significantly Better" test set data, so for scoring models more difficult to learn.

Figure 11: Experimental results on the test set for a finer-grained reward model

Reward Model Experiment Results 2: Scaling Reward Model and Training Data Results

Figure 12 below shows the results of scaling the reward model and training data. More data and larger models generally improve the accuracy of the reward model. The authors also believe that the accuracy of the reward model is one of the important proxies for the final performance of Llama 2-CHAT, and the improvement of its performance can be directly translated into the improvement of Llama 2-CHAT.

Figure 12: Results of scaling the reward model and training data

Human Feedback Reinforcement Learning with LLaMa 2-CHAT: Iterative Fine-tuning

After the LLaMa 2-CHAT model undergoes the supervised fine-tuning SFT in Section 1.7 (the obtained model is represented by SFT Policy), it combines the reward model introduced in Sections 1.8 and 1.9 to perform the final iterative fine-tuning process.

During this iterative process, five versions of RLHF-V1-RLHF-V5 have been experienced, which can be specifically expressed as (here refer to Hao Xiaotian: Technical details of RLHF in [LLM] Meta Llama-2):

1) SFT Policy → sampling model output, label update reward model → Rejection Sampling fine-tuning → RLHF-V1 2)
RLHF-V2 → sampling model output, label update reward model → Rejection Sampling fine-tuning → RLHF-V2
3) RLHF -V2 → Sampling model output, label update reward model → Rejection Sampling fine-tuning → RLHF-V1 4)
RLHF-V3 → Sampling model output, label update reward model → Rejection Sampling fine-tuning → RLHF-V4
5) RLHF-V4 → Sampling model output, label update reward model → Rejection Sampling fine-tuning → RLHF-V5 (w/o PPO) → Proximal Policy Optimization (PPO) → RLHF-V5 (w/ PPO)

This involves the two optimization algorithms Rejection Sampling fine-tuning and PPO.

Rejection Sampling fine-tuning

Figure 13: Average maximum and average median on the training set under different sampling numbers 

Figure 14: (Left) The model uses Llama 2-Chat-SFT, the average of the RM score maximum of N sampling results at various temperatures T over the entire training set. (Right) The model uses Llama 2-Chat-RLHF, the average of the RM score maximum of N sampling results at various temperatures T over the entire training set 

Proximal Policy Optimization (PPO)

PPO is also an objective function for optimizing language models, which uses the output of the reward model as an estimate of the true reward function (human preference), optimizing the following objectives:

LLaMa 2-CHAT model evaluation

Evaluating LLMs is a challenging problem. Although manual evaluation is more accurate, the cost is high and the process is relatively complicated. In order to save costs and improve iteration speed, the evaluation of LLaMa 2-CHAT adopts the method of evaluation based on reward model + manual evaluation . In each iteration from RLHF-V1 to V5, the best performing model was first selected based on the scoring of the latest reward model, and then the main model version was validated by human evaluation. whaosoft  aiot  http://143ai.com

Model-Based Evaluation

In order to measure the consistency between the results of the reward model and the results of human evaluation, the author's team created a prompt set and asked three labelers to judge the quality of the answers according to the principle of 7 points. As shown in Figure 15 below, the relationship between the scoring of the reward model and the scoring of the three annotators according to the 7-point scale, the shaded area represents ±1 standard deviation. The horizontal axis of Figure 15 represents the scoring values ​​of the three labelers according to the 7-point scale, and the vertical axis represents the scoring of the reward model. It can be seen that there is roughly a positive correlation between the two. Even though the reward model is trained with Binary Ranking Loss, it is still well calibrated with human preference annotations.

Figure 15: The relationship between the scores of the reward model and the scores of the three annotators on a 7-point scale, the shaded area represents ±1 standard deviation

As shown in Figure 16 below, the evaluation results of LLaMa 2's SFT and different RLHF versions in terms of Safety and Helpfulness, or the evolution of the LLaMa 2-CHAT model's winning rate for ChatGPT after multiple iterations of fine-tuning. The tips set contains 1586 and 584 tips related to Safety and Helpfulness, respectively. The judge on the left is the reward model for LLaMa 2, which may favor LLaMa 2. The judge in the image on the right is GPT-4, which should be more neutral. In this set of evaluations, RLHF-V3 outperforms ChatGPT on both axes after RLHF-V3 (both Safety and Helpfulness win rates exceed 50%)

Figure 16: Evaluation results of SFT of LLaMa 2 and different RLHF versions in terms of Safety and Helpfulness (left): The judge is the reward model of LLaMa 2, which may favor LLaMa 2. (Right): The judge is GPT-4, which should be more neutral

Human Evaluation

Human evaluation is often considered the gold standard for judging natural language generation models, including dialogue models. To assess the quality of major model releases, the authors ask human evaluators to rate them on usefulness and safety.

Compared models: open source models include Falcon-40B, MPT-7B, Vicuna-13B. Closed source models are: ChatGPT (gpt-3.5-turbo-0301), PaLM (chat-bison-001). Figure 17 below shows the final number of prompts for manual evaluation of each model.

Figure 17: Final number of prompts for human evaluation per model

Figure 18 shows the evaluation results of the LLaMa 2-CHAT model under about 4,000 Helpfulness Prompts. The LLaMa 2-CHAT model outperforms open-source models on both single-round and multi-round prompting. The LLaMa 2-CHAT 7B model outperformed MPT-7B-CHAT on 60% of the hints. The LLaMa 2-CHAT 34B has an overall win rate of over 75% compared to the similarly sized Vicuna-33B and Falcon 40B models. The largest LLaMa 2-CHAT model is also competitive with ChatGPT. The LLaMa 2-CHAT 70B model has a winning rate of 36%, and a tie rate of 31.5% compared to ChatGPT.

Figure 18: Evaluation results of the LLaMa 2-CHAT model under about 4,000 Helpfulness Prompts

Limitations of human evaluation:

  • The test set only includes 4k prompts and does not cover all real-world usages of these models.

  • The variety of prompts may affect the final result. The test set does not contain any coding or inference related prompts.

  • Evaluating only the generated results of multiple dialogues, a more interesting evaluation method is to let the model complete a task and score the overall experience of the model multiple times.

  • For generative models, human evaluation is inherently subjective and noisy. The evaluation results for different Prompts or Instructions may be different.

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/131928526