Interpretation of an open source large language model LLaMA paper, LLaMA: Open and Efficient Foundation Language Models
Back to Papers and Materials Catalog
1. Introduction
LLaMA is a collection of basic language models released by Meta AI that includes four parameter scales of 7B, 13B, 33B, and 65B. LLaMA-13B outperforms GPT-3 (175B) on most benchmarks with only 1/10 scale parameters. The LLaMA-65B is also competitive with the best models in the industry, the Chinchilla-70B and the PaLM-540B. The open source code can be easily obtained on github , as well as the corresponding address of the original paper .
2. Abstract and Introduction
One problem with large language models is that larger models do not have better performance, so there may be a situation where a smaller model can get better performance with more data training. The author found that the performance of a 7B model is still improving on 1T tokens. Therefore, the job of LLaMA is to use a smaller model to get better performance.
In addition, in LLaMA, all training data comes from publicly available data on the Internet. The author introduces the model and its training details in the paper.
3. Method
3.1 Dataset
The data set is shown in the figure below, where these data are all available from open sources on the Internet. The article is written here in more detail, and the dataset seems to be available on huggingface .
model architecture
The overall architecture is still Transformer's decoder module, which refers to the paper Attention is all you need. The following are three further improvements on the Transformer architecture.
- Use RMSNorm (Root Mean square Layer Normalization) to standardize the input data. RMSNorm can refer to the paper: Root mean square layer normalization.
- Use the activation function SwiGLU , which can refer to the PALM paper: Glu variants improve transformer.
- Use Rotary Embeddings for position encoding , which can refer to the paper Roformer: Enhanced transformer with rotary position embedding.
optimizer
Use the AdamW optimizer optimizer , the optimizer can refer to the paper Decoupled weight decay regularization. The following table is some parameter details of the training.
Other Effective Improvement Measures
- Use causal multi-head attention to improve the training speed of the model. The implementation of this mechanism borrows the xformers library , and its idea is not to store attention weights, and not to calculate the attention scores.
- The Transformer's activation function was implemented manually without using the autograd of the pytorch library to obtain a better training speed. At the same time, parallelization technology is used to improve the training speed. These two improvements can refer to the paper: Reducing activation recomputation in large transformer models.
When training the LLaMA-65B model, the authors say that the processing speed is about 380 tokens per second on a 2048 A100 GPU with 80GB of video memory. It takes about 21 days to train on a dataset containing 1.4T tokens.
4. Main conclusions (experimental results)
Experimental results of zero-shot reasoning tasks
Experimental results of natural problems
Reading Comprehension Experiment Results
Math Ability Test Results
Generate Code Experiment Results
Multi-domain task experiment results
performance on these issues during training
6. The model generates harmful error messages
Although the model is better in this part, it still has a low score, and it will make up like GPT-3 and other models.
Relatively unimportant, skip it.
7. How much carbon emissions will be generated by training the LLaMA model
The author explains from this perspective that LLaMA can protect the environment, hh.
8. The future of work
The author intends to introduce the human-guided approach of InstructGPT to carry out further work and expand to larger models and more data.