【LLM系列之LlaMA】Llama: Open and Efficient Foundation Language Models

论文题目：《LLaMA: Open and Efficient Foundation Language Models》
论文链接：https://arxiv.org/pdf/2302.13971.pdf
github链接：https://github.com/facebookresearch/llama/tree/main
huggingface链接：https://huggingface.co/decapoda-research/llama-7b-hf

1 Model Introduction

LLaMA is a collection of basic language models released by Meta AI that includes four parameter scales of 7B, 13B, 33B, and 65B. LLaMA-13B outperforms GPT-3 (175B) on most benchmarks with only 1/10 scale parameters. The LLaMA-65B is also competitive with the best models in the industry, the Chinchilla-70B and the PaLM-540B.

Main contributions:

Open source a series of language models that can compete with SOTA models
LLaMA-13B performs better than GPT-3, but with one-tenth the model size
LLaMA-65B is comparable in strength to Chinchilla-70B and PaLM-540B
The state-of-the-art performance can be partially reproduced using public data sets (about 86% of the effect)

2 Research Background

Under the condition of given budget, the best model is not necessarily the largest model, and the smaller model trained on more data will achieve better performance. The purpose of Hoffmann's work is to decide how to scale the data set and model size, but he ignores the cost of inference. So in this paper, given a target performance level, the more recommended model is not the fastest to train, but the fastest to infer. The resulting model is called LLaMA, with parameters ranging from 7B to 65B, comparable to the best LLMs available today.

LLaMA-13B performs better than GPT-3 on most benchmarks, but the model size is reduced to one-tenth. The Meta team believes this model can help democratize the use and research of LLMs because it can run on a single GPU. At a higher scale, the 65B parameter model is more competitive than the current best LLM (such as Chinchila or PaLM-540B). Another advantage of LLaMA is that it is trained using public datasets.

3 training methods

The training method of this work is similar to Brown's work and inspired by Hoffmann (Chinchilla scaling laws). Models are optimized using standard optimizers. Later, "Scaling Laws for Neural Language Models" will be interpreted separately. This article mainly models the relationship between model performance and non-embedding parameter N, data set size D and calculation amount C. The main findings:

Performance is mainly related to model size and weakly related to model structure
The performance has a relatively close power-law relationship with the above three factors

From the experimental point of view, the bigger the model, the better. The small model really cannot achieve the miracle effect of the big model, and the model structure is not that important (although there is a lot of work to improve the model structure itself). The conclusion part emphasizes that big models are more important than big data

3.1 Pre-training data

Our training dataset is a mixture of multiple sources, as shown in Table 1, covering different domains. In most cases, we reuse data sources that have been used to train other LLMs, but only with publicly available and open-source compatible data. This results in the following mix of data and their percentages in the training set:

3.2 Model structure

The overall architecture is still Transformer's decoder module, which refers to the paper Attention is all you need. The following are three further improvements on the Transformer architecture.

[GPT3] Use RMSNorm (Root Mean square Layer Normalization) to standardize the input data. RMSNorm can refer to the paper: Root mean square layer normalization.
$\begin{align} \begin{split} & \bar{a}i = \frac{a_i}{ \text{RMS}(\mathbf{a})} g_i, \quad \text{where}~~ \text{RMS}(\mathbf{a}) = \sqrt{\frac{1}{n} \sum {i=1}^{n} a_i^2}. \end{split}\nonumber \end{align}$

In order to improve the stability of training, regularization is performed at the input of each transformer sublayer instead of at the output, and the regularization method used is RMSNorm.
The implementation method in the LLaMA source code is:

class RMSNorm(torch.nn.Module):
    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def _norm(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

    def forward(self, x):
        output = self._norm(x.float()).type_as(x)
        return output * self.weight

[PaLM] uses the activation function SwiGLU, which can refer to the PALM paper: Glu variants improve transformer.

class FeedForward(nn.Module):
    def __init__(
        self,
        dim: int,
        hidden_dim: int,
        multiple_of: int,
    ):
        super().__init__()
        hidden_dim = int(2 * hidden_dim / 3)
        hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)

        self.w1 = ColumnParallelLinear(
            dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x
        )
        self.w2 = RowParallelLinear(
            hidden_dim, dim, bias=False, input_is_parallel=True, init_method=lambda x: x
        )
        self.w3 = ColumnParallelLinear(
            dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x
        )

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

[GPTNeo] uses Rotary Embeddings for position encoding, which can refer to the paper Roformer: Enhanced transformer with rotary position embedding.

3.3 Optimizer

The AdamW optimizer is used, and the cosine learning rate schedule is used, so that the final learning rate is equal to 10% of the maximum learning rate, and the weight decay of 0.1 and the gradient clipping of 1.0 are set. The step of warmup is 2000, and the learning rate and batch size are changed according to the size of the model (see Table 2 for details).

3.4 Efficient implementation

The author made some optimizations to improve the training speed of the model. First, an efficient implementation of causal multi-head attention is used to reduce memory usage and runtime. The implementation is available in the xformers library.

https://github.com/facebookresearch/xformers

To further improve training efficiency, the amount of activations recomputed during the backward pass is reduced by checkpointing. More precisely, computationally expensive activations, such as the output of linear layers, are saved. This is achieved by manually implementing the backward function of the transformer layer, rather than relying on PyTorch's autograd.

This refers to gradient checkpointing, a strategy that trades time (the time cost of recalculating these values twice) for space (the memory cost of storing these values in advance).

In addition, communication over the network (due to the all_reduce operation) between active computations and GPUs is also covered as much as possible.
When training a 65b parameter model, the code processes about 380 tokens/sec/GPU on a 2048 A100 GPU and 80GB RAM. This means that it takes about 21 days to train on a dataset containing 1.4T tokens.

4 Experimental results

The author mainly compared the results on Zero-shot and Few-shot.

4.1 Common Sense Reasoning

It can be observed that the results of 13B and GPT-3 175B are actually very similar.

4.2 Closed-book QA

LLaMA model is better than PaLM 540B model

4.3 Reading Comprehension

It can be seen that LLaMA is benchmarked to PaLM of 540B.

4.4 Mathematical reasoning

4.5 Code generation (Code generation)

4.6 Massive Multitask Language Understanding

It can be observed that LLaMA-65B lags behind Chinchilla70B and PaLM-540B by several percentage points on average in most domains. One possible explanation is that a limited number of books and academic papers were used in the pre-training data, namely ArXiv, Gutenberg and book3, totaling only 177GB, while the models were trained on books up to 2TB. The large number of books used by Gopher, Chinchilla, and PaLM may also explain why Gopher outperforms GPT-3 on this benchmark, but is comparable on other benchmarks

4.7 Evolution of performance during training

During training, we track the model's performance on several question answering and commonsense benchmarks and report them in Figure 2. On most benchmarks, performance improves steadily and correlates with the model's training perplexity (see Figure 1). The exceptions are SIQA and WinoGrande. Most notably, on SIQA we observe a lot of variance in performance, which may indicate that this benchmark is not reliable. On WinoGrande, performance is uncorrelated with training perplexity: LLaMA-33B and LLaMA-65B perform similarly during training.

5 instruction tuning

The results of instruction model LLaMA-I on MMLU are compared with existing medium-scale instruction fine-tuning models. Although the instruction tuning method used here is simple, it achieves 68.9% on MMLU. LLaMA-I (65B) outperforms MMLU on existing moderate-scale instruction fine-tuning models, but is still far from the state-of-the-art with a GPT code-davincii-002 score of 77.4 on MMLU.

6 Model Code

https://github.com/facebookresearch/llama/blob/main/llama/model.py

class TransformerBlock(nn.Module):
    def __init__(self, layer_id: int, args: ModelArgs):
        super().__init__()
        self.n_heads = args.n_heads
        self.dim = args.dim
        self.head_dim = args.dim // args.n_heads
        self.attention = Attention(args)
        self.feed_forward = FeedForward(
            dim=args.dim, hidden_dim=4 * args.dim, multiple_of=args.multiple_of
        )
        self.layer_id = layer_id
        self.attention_norm = RMSNorm(args.dim, eps=args.norm_eps)
        self.ffn_norm = RMSNorm(args.dim, eps=args.norm_eps)

    def forward(self, x: torch.Tensor, start_pos: int, freqs_cis: torch.Tensor, mask: Optional[torch.Tensor]):
        h = x + self.attention.forward(self.attention_norm(x), start_pos, freqs_cis, mask)
        out = h + self.feed_forward.forward(self.ffn_norm(h))
        return out

Instead of normalizing the output, the authors normalize the input of each Transformer sublayer. Pay attention to the yellow square (Add & Norm) part in the Transformer, which is in the output part. Now adjust this operation to the front to perform the Norm operation on the input.

class Transformer(nn.Module):
    def __init__(self, params: ModelArgs):
        super().__init__()
        self.params = params
        self.vocab_size = params.vocab_size
        self.n_layers = params.n_layers

        self.tok_embeddings = ParallelEmbedding(
            params.vocab_size, params.dim, init_method=lambda x: x
        )

        self.layers = torch.nn.ModuleList()
        for layer_id in range(params.n_layers):
            self.layers.append(TransformerBlock(layer_id, params))

        self.norm = RMSNorm(params.dim, eps=params.norm_eps)
        self.output = ColumnParallelLinear(
            params.dim, params.vocab_size, bias=False, init_method=lambda x: x
        )

        self.freqs_cis = precompute_freqs_cis(
            self.params.dim // self.params.n_heads, self.params.max_seq_len * 2
        )

    @torch.inference_mode()
    def forward(self, tokens: torch.Tensor, start_pos: int):
        _bsz, seqlen = tokens.shape
        h = self.tok_embeddings(tokens)
        self.freqs_cis = self.freqs_cis.to(h.device)
        freqs_cis = self.freqs_cis[start_pos : start_pos + seqlen]

        mask = None
        if seqlen > 1:
            mask = torch.full((1, 1, seqlen, seqlen), float("-inf"), device=tokens.device)
            mask = torch.triu(mask, diagonal=start_pos + 1).type_as(h)

        for layer in self.layers:
            h = layer(h, start_pos, freqs_cis, mask)
        h = self.norm(h)
        output = self.output(h[:, -1, :])  # only compute last logits
        return output.float()

7 Conclusion of the paper

In this paper, we propose a series of publicly released language models and achieve competitive results with the state-of-the-art base models. Most notably, LLaMA-13B outperforms GPT-3 but is more than 10 times smaller, and LLaMA-65B competes with Chinchilla-70B and PaLM-540B.

Unlike previous work, the paper's research demonstrates that state-of-the-art performance can be achieved without using proprietary datasets, but only using publicly available datasets for training. The authors hope that releasing these models to the research community will accelerate the development of large language models and help improve their robustness, alleviating known problems such as toxicity and bias.

Furthermore, the authors, like Chung et al., observe that fine-tuning these models on instruction yields promising results and plan to investigate this further in future work.

Finally, the authors plan to release larger models trained on larger pre-trained corpora in the future, as the authors have seen continuous performance improvements when expanding the corpus