A simple interpretation of an open source large-scale language model LLaMA paper, LLaMA: Open and Efficient Foundation Language Models

Interpretation of an open source large language model LLaMA paper, LLaMA: Open and Efficient Foundation Language Models

Back to Papers and Materials Catalog

1. Introduction

LLaMA is a collection of basic language models released by Meta AI that includes four parameter scales of 7B, 13B, 33B, and 65B. LLaMA-13B outperforms GPT-3 (175B) on most benchmarks with only 1/10 scale parameters. The LLaMA-65B is also competitive with the best models in the industry, the Chinchilla-70B and the PaLM-540B. The open source code can be easily obtained on github , as well as the corresponding address of the original paper .

2. Abstract and Introduction

One problem with large language models is that larger models do not have better performance, so there may be a situation where a smaller model can get better performance with more data training. The author found that the performance of a 7B model is still improving on 1T tokens. Therefore, the job of LLaMA is to use a smaller model to get better performance.

In addition, in LLaMA, all training data comes from publicly available data on the Internet. The author introduces the model and its training details in the paper.

3. Method

3.1 Dataset

The data set is shown in the figure below, where these data are all available from open sources on the Internet. The article is written here in more detail, and the dataset seems to be available on huggingface .
insert image description here
model architecture

The overall architecture is still Transformer's decoder module, which refers to the paper Attention is all you need. The following are three further improvements on the Transformer architecture.

  1. Use RMSNorm (Root Mean square Layer Normalization) to standardize the input data. RMSNorm can refer to the paper: Root mean square layer normalization.
  2. Use the activation function SwiGLU , which can refer to the PALM paper: Glu variants improve transformer.
  3. Use Rotary Embeddings for position encoding , which can refer to the paper Roformer: Enhanced transformer with rotary position embedding.

optimizer

Use the AdamW optimizer optimizer , the optimizer can refer to the paper Decoupled weight decay regularization. The following table is some parameter details of the training.

insert image description here

Other Effective Improvement Measures

  1. Use causal multi-head attention to improve the training speed of the model. The implementation of this mechanism borrows the xformers library , and its idea is not to store attention weights, and not to calculate the attention scores.
  2. The Transformer's activation function was implemented manually without using the autograd of the pytorch library to obtain a better training speed. At the same time, parallelization technology is used to improve the training speed. These two improvements can refer to the paper: Reducing activation recomputation in large transformer models.

When training the LLaMA-65B model, the authors say that the processing speed is about 380 tokens per second on a 2048 A100 GPU with 80GB of video memory. It takes about 21 days to train on a dataset containing 1.4T tokens.

4. Main conclusions (experimental results)

Experimental results of zero-shot reasoning tasks
insert image description here
Experimental results of natural problems
insert image description here

insert image description here

Reading Comprehension Experiment Results
insert image description here

Math Ability Test Results

insert image description here
Generate Code Experiment Results

insert image description here

Multi-domain task experiment results

insert image description here

performance on these issues during training
insert image description here

6. The model generates harmful error messages

Although the model is better in this part, it still has a low score, and it will make up like GPT-3 and other models.

Relatively unimportant, skip it.

7. How much carbon emissions will be generated by training the LLaMA model

The author explains from this perspective that LLaMA can protect the environment, hh.

8. The future of work

The author intends to introduce the human-guided approach of InstructGPT to carry out further work and expand to larger models and more data.

Guess you like

Origin blog.csdn.net/a1920993165/article/details/130044242