Paper Reading_LLaMA

Paper information

number headings: auto, first-level 2, max 4, _.1.1
name_en: LLaMA: Open and Efficient Foundation Language Models
name_ch: LLaMA: Open and Efficient Foundation Language Model
paper_addr: https://arxiv.org/abs/2302.13971
doi : https://doi.org/10.48550/arXiv.2302.13971
date_read: 2023-03-25
date_publish: 2023-02-27
tags: ['Deep Learning','Natural Language Processing']
author: Hugo Touvron, Meta AI
citation : 7
code: https://github.com/facebookresearch/llama

1 Feedback

Open source projects, small wins. Use more tokens for training and fewer model parameters. Its small model can run in a single GPU environment, and the 65B large model can compete with the effect of the PaLM model ; the main technology includes: adjusting the model structure, accelerating training and reasoning.

2 Summary

The paper demonstrates training state-of-the-art models using only publicly available datasets, without resorting to proprietary and inaccessible datasets. The model is trained from 7B-65B parameters using T-level tokens. The effect of the LLaMA-13B model surpasses that of the GPT-3(175B) model. The LLaMA-65B model can compete with the best current models.

3 Introduction

The large model performs well on Few Shot, mainly due to the amount of parameters of the large model. This article focuses on finding the right amount of data and parameters to enable fast inference.

4 methods

4.1 Predicting training data

4.2 Model structure

The model is based on the Transformer structure, and has the following main differences from other frameworks (basically 2019-2021, the technology used by other models):

  • Prenormalization:
    Use RMSNorm to normalize the input of each transformer sublayer instead of normalizing the output to improve stability.
  • SwiGLU activation function:
    Use SwiGLU instead of ReLU activation function.
  • Positional Embeddings:
    At each layer of the network, absolute positional embeddings are removed and rotated positional embeddings are added.

4.3 Optimization

The model size is as follows:

4.4 Efficient implementation

An efficient implementation of a causal multi-head attention operator with reduced memory usage and computation. To further improve training efficiency, the amount of activations recomputed during backpropagation with checkpointing was reduced (instead of Pytorch autograd). Reduce memory usage of models by using model and sequence parallelism. In addition, network communication between the activation computation and the GPU is overlapped as much as possible.
The code runs on a 2048 A100 GPU with 80GB RAM while training a 65B parameter model. It takes about 21 days to train on a dataset containing 1.4T tokens.

5 main experiment

For the evaluation of Zero-shot and Few-shot tasks, the following is an evaluation of reading comprehension. It can be seen that there is not much difference between the large model and the small model in handling such problems:

The following functions have been evaluated, and the screenshots will not be explained here. The result is that the 65B model is similar to the PalM540 model, and many evaluation effects are even better.

  • Standard Commonsense Reasoning (8)
  • Closed-book Q&A (2)
  • Reading comprehension (1)
  • Mathematical reasoning (2) Google's Minerva model is better trained for math
  • code generation (2)
  • Large-scale multi-tasking language understanding. Consists of multiple-choice questions covering a variety of areas of knowledge, including humanities, STEM, and social sciences. PaLM is significantly better in this benchmark, probably because it was trained on a larger corpus.

It can be seen that the more tokens, the better the training effect:

6 instruction fine-tuning

A bootstrap model LLaMA-I was trained through fine-tuning, and the comparison results of the evaluation data for MMLU (multiple-choice questions on 57 topics) are as follows:

Guess you like

Origin blog.csdn.net/xieyan0811/article/details/130043145