Large model technology practice (2) | Things you need to know about Llama 2

In the last article, we briefly reviewed the overview of the Llama model. In this article, we will discuss in detail [About Llama 2] what you need to know.
9b95d162a893f52a9951ef6be752d617.jpeg

01-How good is the performance of Llama 2?


As Meta's newly released SOTA open source large-scale language model, Llama 2 is the continuation and upgrade of the Llama model. The Llama 2 family model includes the Llama 2 pre-training model and the Llama 2-chat fine-tuning model, which have versions with 7B, 13B and 70B parameters respectively, covering different application scenario requirements.

1.1 Training data

Llama 2 has a 40% increase in pre-training corpus compared to Llama, increasing to 2 trillion tokens, and the text sources in the training data are more diverse. In addition, the fine-tuned model corresponding to Llama 2 was trained on more than 1 million manually labeled data.
bda9c60786e8c8b21c34fc29c409addb.jpeg Figure 1: Overview of Llama 2 model [1]


1.2 Model evaluation

From the perspective of model evaluation, Llama 2 outperforms the first generation Llama and existing open source in many benchmark tests, such as reasoning, programming, dialogue ability and knowledge tests. Large model.
f64679251ec1c7042efcb8097e52bb53.jpeg Figure 2: Llama 2 scores on different benchmarks.

Although Llama 2-70B performs close to GPT-3.5 on inference tasks, it is still unable to compete with OpenAI’s GPT-4 and Google’s PaLM-2-L in terms of overall performance. It is comparable to the source model, especially far behind the two in programming benchmarks.
06bfcf0c42012d0bd1ea08c028d51d7a.jpeg Figure 3: Scores of Llama 2, GPT and PaLM on different benchmark tests


02-Unlock the model structure of Llama 2


2.1 Llama 2 model architecture

Llama 2 is very similar to the first-generation model in terms of pre-training settings and model architecture.
b8c7a6b9bc60159323f42f3d936d20f9.jpeg
As shown in the figure, the Llama series models all use the autoregressive Transformer architecture, that is, Transformer's decoder-only architecture. The consistency of the two generations of models is reflected in:

Pre-normalization: the sub-layer input of each transformer is normalized, using the RMSNorm normalization function

SwiGLU activation function: in the feedforward neural network (FFN) ) Use SwiGLU activation function to replace the ReLU activation function in Transformer to improve performance

Rotary Positional Embeddings (RoPE): RoPE can take into account both relative position and absolute position information to improve the generalization ability of the model


2.2 Llama 2 training

highlights With the increase in training data mentioned above, Llama 2 also has two highlights in the training process that deserve our attention. First, the expansion of the context length improves the model's understanding ability; second, the grouped query attention mechanism improves the model's reasoning speed.



The context window expansion

of Llama 2 has doubled the context length of Llama, from 2048 tokens to 4096 tokens. A longer context window means more chat use cases can be adopted, thereby improving the model’s understanding.



Grouped-Query attention

In the implementation of Attention, Llama 2 30B and above models adopt the grouped-Query Attention mechanism (Grouped-Query Attention, GQA), see Figure 5 and Figure 6.
0d4c64d900be1ac84eb2c4bf867cc622.jpeg Figure 6: Llama 2 using GQA[2]

The decoding of the autoregressive model speeds up the computation of attention by caching previously labeled key (K) value (V) pairs of the sequence. However, as the batch size and context window increase, the memory cost of the multi-head attention model (Multi-head Attenrion, MHA) will increase significantly.
c2cf23e34c46b9cb29503cedda9e8ae8.jpeg Figure 7: "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints"[3] The


advantage of GQA is that it groups Query and shares KV within the group, so that the predictions of K and V can be shared across multiple heads. This significantly reduces computing and memory requirements and improves inference speed.


2.3 Llama 2-chat fine-tuning process


Meta is committed to training reward models on preference data and then using reinforcement learning to optimize them to improve the quality of generation.

SFT + RLHF by RS and PPO

is similar to InstructGPT. The fine-tuning process of the Llama 2-chat dialogue model is divided into: obtaining the Llama 2 base model supervised fine-tuning (SFT)

after self-supervised training , human feedback reinforcement learning (Reinforcement learning) with human feedback (RLHF): Rejection Sampling + Proximal Policy Optimization RLHF uses two optimization algorithms: Rejection Sampling fine-tuning (RS) and Proximal Policy Optimization (PPO). The principle of rejection sampling is to sample K results when the model outputs, use the best reward model at the current moment to score, and select the one with the highest reward value. In the reinforcement learning stage, the gradient is updated, and the optimization of RS plus PPO is performed in combination with PPO. Figure 8: Llama 2-chat fine-tuning process [1] Meta has iterated a total of 5 RLHF versions, from V1 to V5, but only the latest V5 version has been released. The steps for V5 version iteration are shown in the figure below. Figure 9: RLHF-V5 iteration process Quality Is All You Need








0ed487ba69f9a27e32ae2156a692c42b.jpeg



d03e33a3df24813aaee860f1b99d2905.jpeg



Meta uses two independent reward models, Helpfulness RM and Safty RM, trained using user preference data to optimize usefulness and safety respectively. In the process of SFT, Llama 2's official paper [2] emphasized that only a small amount of high-quality SFT preference data can significantly improve the quality of the results (Quality Is All You Need). In addition, this paper is also the first paper to point out that "RLHF fundamentally improves the upper limit of large model performance."
1e4a4ef044de1ea6fd72a18ef45cc414.jpeg Figure 10: "Quality Is All You Need" emphasized in the Llama 2 paper [2] In

summary, the most important revelation from the Llama 2 training process is that

the reward model is not only the key to RLHF, but also the key to the effect of the entire large model. ;Data quality is the key to the reward model. [4]

03-Llama 2 practice on UCloud UK8S


3.1 Download the model.

Download the model

and clone the Llama 2 model from HuggingFace [5]: https://huggingface.co/meta-llama. This article uses the Llama 2-chat-7b model.
0abf1e8f06f95894bc581c99301e9c68.jpeg Install the WebUI tool

oobabooga's open source text-generation-webui [6], a large-model visualization toolkit. The installation method is as follows:



a. Enter Text Generation's github

(https://github.com/oobabooga/text-generation-webui)

b. Select one-click installation package installation or manual installation

c. We put the Llama 2 model file into the text-generation-webui/models directory. The file structure is as follows: 3.2
bc2ff7eab3b1fb2b7b5250c2fc30af54.jpeg Build the image

according to the instructions of the Uhub container image library:

(https: //docs.ucloud.cn/uhub/guide)



1. First, create the image library on Uhub
91a81617c7c39b62e03e78f9602b0fa3.jpeg 2. Second, create the image on the cloud host and mark it with

docker image build -t {tag name} .
docker tag {local image Name} uhub.service.ucloud.cn/{mirror warehouse created}/{mirror}:tag

3. Finally, push the cloud host image to Uhub

docker push uhub.service.ucloud.cn/{mirror warehouse created}/{mirror}:tag

3.3 Configure UK8S cluster



1. Create a UFS file system and mount it

(https://docs.ucloud.cn/ufs/ufs_guide/ create)
4a5894d501a6f83d2f00d2857639ce08.jpeg 2. Create the UK8S container cloud

reference document (https://docs.ucloud.cn/uk8s/). When creating a cluster, the configuration of Node can refer to the figure below:
ef4d073492091356f322eeb8fc5717f7.jpeg After the cluster is created, click the "Details" button and change the "External Network Credentials" is copied to the ~/.kube/config file. At the same time, the Kubectl command line tool needs to be installed and configured.

(https://docs.ucloud.cn/uk8s/manageviakubectl/connectviakubectl?id=Install and configure kubectl)

3. Use UFS in UK8S

and use the created UFS as the shared storage of the UK8S cluster.

Create PVs and PVCs according to the Using UFS in UK8S documentation.

(https://docs.ucloud.cn/uk8s/volume/ufs?id=Use ufs in uk8s)


a. Create Pod: Write configuration file ufspod.yml

apiVersion: apps/v1
kind: Deployment
metadata:
name: Llama 2
spec:
selector:
matchLabels:
app: Llama 2
replicas: 1
template:
metadata:
labels:
app: Llama 2
spec:
containers:
- name: Llama 2
image: uhub.service.ucloud.cn/Llama 2/Llama 2-test:v1 # Replace with Your warehouse
volumeMounts:
- mountPath: "/app/models"
name: mypd
ports:
- containerPort: 7861
volumes:
- name: mypd
persistentVolumeClaim:
claimName: ufsclaim


execution configuration file:

kubectl apply -f ufspod.yml
# deployment.apps/ Llama 2 created



b. Enter the Pod

to query the Pod Name:

kubectl get pods



Start a Bash Shell inside the Pod:

kubectl exec -it {pod_name} -- /bin/bash c. Run the server.py file



for online inference python server.py --model Llama-2-7b-chat-hf --listen to this point , we can have a conversation with Llama 2 on the Web. In addition, UCloud has launched the Llama 2 GPU cloud host image, which is available out of the box to help AI developers quickly build Llama 2 inference and fine-tuning environments. For details, please see: "Llama 2 Model Rapid Deployment" document. (https://docs.ucloud.cn/gpu/practice/LLaMA2?id=llama2-Rapid Model Deployment) In this issue we introduce [About Llama 2] what you need to know. Due to its small size and open source characteristics, the Llama series models have high popularity and reputation in the AI ​​community. It is foreseeable that more customized fine-tuning models and related services based on Llama 2 will emerge in the short term. In the next article, we will focus on the deployment and inference of "LangChain + large model + vector database" in the cloud, so stay tuned ~ [ References] [1] Llama 2 official announcement: https://ai.meta.com/llama/ [2 ] Llama 2 official paper: https://huggingface.co/papers/2307.09288







d23fbc5e1608354d95ac929a767694c9.jpeg
















[3] "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" by Google Research:https://arxiv.org/pdf/2305.13245.pdf

[4] "Llama 2: an incredible open LLM" by Nathan Lambert: https://www.interconnects.ai/p/llama-2-from-meta

[5] Llama 2 models: https://huggingface.co/meta-llama

[6] Text generation web UI github: https://github.com/oobabooga/text-generation-webu

Guess you like

Origin blog.csdn.net/specssss/article/details/132471313