In the last article, we briefly reviewed the overview of the Llama model. In this article, we will discuss in detail [About Llama 2] what you need to know.
01-How good is the performance of Llama 2?
As Meta's newly released SOTA open source large-scale language model, Llama 2 is the continuation and upgrade of the Llama model. The Llama 2 family model includes the Llama 2 pre-training model and the Llama 2-chat fine-tuning model, which have versions with 7B, 13B and 70B parameters respectively, covering different application scenario requirements.
1.1 Training data
Llama 2 has a 40% increase in pre-training corpus compared to Llama, increasing to 2 trillion tokens, and the text sources in the training data are more diverse. In addition, the fine-tuned model corresponding to Llama 2 was trained on more than 1 million manually labeled data.
Figure 1: Overview of Llama 2 model [1]
1.2 Model evaluation
From the perspective of model evaluation, Llama 2 outperforms the first generation Llama and existing open source in many benchmark tests, such as reasoning, programming, dialogue ability and knowledge tests. Large model.
Figure 2: Llama 2 scores on different benchmarks.
Although Llama 2-70B performs close to GPT-3.5 on inference tasks, it is still unable to compete with OpenAI’s GPT-4 and Google’s PaLM-2-L in terms of overall performance. It is comparable to the source model, especially far behind the two in programming benchmarks.
Figure 3: Scores of Llama 2, GPT and PaLM on different benchmark tests
02-Unlock the model structure of Llama 2
2.1 Llama 2 model architecture
Llama 2 is very similar to the first-generation model in terms of pre-training settings and model architecture.
As shown in the figure, the Llama series models all use the autoregressive Transformer architecture, that is, Transformer's decoder-only architecture. The consistency of the two generations of models is reflected in:
Pre-normalization: the sub-layer input of each transformer is normalized, using the RMSNorm normalization function
SwiGLU activation function: in the feedforward neural network (FFN) ) Use SwiGLU activation function to replace the ReLU activation function in Transformer to improve performance
Rotary Positional Embeddings (RoPE): RoPE can take into account both relative position and absolute position information to improve the generalization ability of the model
2.2 Llama 2 training
highlights With the increase in training data mentioned above, Llama 2 also has two highlights in the training process that deserve our attention. First, the expansion of the context length improves the model's understanding ability; second, the grouped query attention mechanism improves the model's reasoning speed.
The context window expansion
of Llama 2 has doubled the context length of Llama, from 2048 tokens to 4096 tokens. A longer context window means more chat use cases can be adopted, thereby improving the model’s understanding.
Grouped-Query attention
In the implementation of Attention, Llama 2 30B and above models adopt the grouped-Query Attention mechanism (Grouped-Query Attention, GQA), see Figure 5 and Figure 6.
Figure 6: Llama 2 using GQA[2]
The decoding of the autoregressive model speeds up the computation of attention by caching previously labeled key (K) value (V) pairs of the sequence. However, as the batch size and context window increase, the memory cost of the multi-head attention model (Multi-head Attenrion, MHA) will increase significantly.
Figure 7: "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints"[3] The
advantage of GQA is that it groups Query and shares KV within the group, so that the predictions of K and V can be shared across multiple heads. This significantly reduces computing and memory requirements and improves inference speed.
2.3 Llama 2-chat fine-tuning process
Meta is committed to training reward models on preference data and then using reinforcement learning to optimize them to improve the quality of generation.
SFT + RLHF by RS and PPO
is similar to InstructGPT. The fine-tuning process of the Llama 2-chat dialogue model is divided into: obtaining the Llama 2 base model supervised fine-tuning (SFT)
after self-supervised training , human feedback reinforcement learning (Reinforcement learning) with human feedback (RLHF): Rejection Sampling + Proximal Policy Optimization RLHF uses two optimization algorithms: Rejection Sampling fine-tuning (RS) and Proximal Policy Optimization (PPO). The principle of rejection sampling is to sample K results when the model outputs, use the best reward model at the current moment to score, and select the one with the highest reward value. In the reinforcement learning stage, the gradient is updated, and the optimization of RS plus PPO is performed in combination with PPO. Figure 8: Llama 2-chat fine-tuning process [1] Meta has iterated a total of 5 RLHF versions, from V1 to V5, but only the latest V5 version has been released. The steps for V5 version iteration are shown in the figure below. Figure 9: RLHF-V5 iteration process Quality Is All You Need
Meta uses two independent reward models, Helpfulness RM and Safty RM, trained using user preference data to optimize usefulness and safety respectively. In the process of SFT, Llama 2's official paper [2] emphasized that only a small amount of high-quality SFT preference data can significantly improve the quality of the results (Quality Is All You Need). In addition, this paper is also the first paper to point out that "RLHF fundamentally improves the upper limit of large model performance."
Figure 10: "Quality Is All You Need" emphasized in the Llama 2 paper [2] In
summary, the most important revelation from the Llama 2 training process is that
the reward model is not only the key to RLHF, but also the key to the effect of the entire large model. ;Data quality is the key to the reward model. [4]
03-Llama 2 practice on UCloud UK8S
3.1 Download the model.
Download the model
and clone the Llama 2 model from HuggingFace [5]: https://huggingface.co/meta-llama. This article uses the Llama 2-chat-7b model.
Install the WebUI tool
oobabooga's open source text-generation-webui [6], a large-model visualization toolkit. The installation method is as follows:
a. Enter Text Generation's github
(https://github.com/oobabooga/text-generation-webui)
b. Select one-click installation package installation or manual installation
c. We put the Llama 2 model file into the text-generation-webui/models directory. The file structure is as follows: 3.2
Build the image
according to the instructions of the Uhub container image library:
(https: //docs.ucloud.cn/uhub/guide)
1. First, create the image library on Uhub
2. Second, create the image on the cloud host and mark it with
docker image build -t {tag name} .
docker tag {local image Name} uhub.service.ucloud.cn/{mirror warehouse created}/{mirror}:tag
3. Finally, push the cloud host image to Uhub
docker push uhub.service.ucloud.cn/{mirror warehouse created}/{mirror}:tag
3.3 Configure UK8S cluster
1. Create a UFS file system and mount it
(https://docs.ucloud.cn/ufs/ufs_guide/ create)
2. Create the UK8S container cloud
reference document (https://docs.ucloud.cn/uk8s/). When creating a cluster, the configuration of Node can refer to the figure below:
After the cluster is created, click the "Details" button and change the "External Network Credentials" is copied to the ~/.kube/config file. At the same time, the Kubectl command line tool needs to be installed and configured.
(https://docs.ucloud.cn/uk8s/manageviakubectl/connectviakubectl?id=Install and configure kubectl)
3. Use UFS in UK8S
and use the created UFS as the shared storage of the UK8S cluster.
Create PVs and PVCs according to the Using UFS in UK8S documentation.
(https://docs.ucloud.cn/uk8s/volume/ufs?id=Use ufs in uk8s)
a. Create Pod: Write configuration file ufspod.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: Llama 2
spec:
selector:
matchLabels:
app: Llama 2
replicas: 1
template:
metadata:
labels:
app: Llama 2
spec:
containers:
- name: Llama 2
image: uhub.service.ucloud.cn/Llama 2/Llama 2-test:v1 # Replace with Your warehouse
volumeMounts:
- mountPath: "/app/models"
name: mypd
ports:
- containerPort: 7861
volumes:
- name: mypd
persistentVolumeClaim:
claimName: ufsclaim
execution configuration file:
kubectl apply -f ufspod.yml
# deployment.apps/ Llama 2 created
b. Enter the Pod
to query the Pod Name:
kubectl get pods
Start a Bash Shell inside the Pod:
kubectl exec -it {pod_name} -- /bin/bash c. Run the server.py file
for online inference python server.py --model Llama-2-7b-chat-hf --listen to this point , we can have a conversation with Llama 2 on the Web. In addition, UCloud has launched the Llama 2 GPU cloud host image, which is available out of the box to help AI developers quickly build Llama 2 inference and fine-tuning environments. For details, please see: "Llama 2 Model Rapid Deployment" document. (https://docs.ucloud.cn/gpu/practice/LLaMA2?id=llama2-Rapid Model Deployment) In this issue we introduce [About Llama 2] what you need to know. Due to its small size and open source characteristics, the Llama series models have high popularity and reputation in the AI community. It is foreseeable that more customized fine-tuning models and related services based on Llama 2 will emerge in the short term. In the next article, we will focus on the deployment and inference of "LangChain + large model + vector database" in the cloud, so stay tuned ~ [ References] [1] Llama 2 official announcement: https://ai.meta.com/llama/ [2 ] Llama 2 official paper: https://huggingface.co/papers/2307.09288
[3] "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" by Google Research:https://arxiv.org/pdf/2305.13245.pdf
[4] "Llama 2: an incredible open LLM" by Nathan Lambert: https://www.interconnects.ai/p/llama-2-from-meta
[5] Llama 2 models: https://huggingface.co/meta-llama
[6] Text generation web UI github: https://github.com/oobabooga/text-generation-webu