Paper Reading_ChatGLM

Article information

name_en: GLM-130B: An Open Bilingual Pre-Trained
name_ch: GLM-130B: Open Bilingual Pre-training model
Paper_addr: https://arxiv.org/abs/2210.02414
DOI: 10.48550/Arxiv.221 0.02414
date_read: 2023-03-23
date_publish: 2023-01-01
tags: ['Deep Learning','Natural Language Processing']
author: Aohan Zeng
code: https://github.com/THUDM/GLM-130B/
citation: 4

after reading

In November 2022, the Large Model Center of Stanford University conducted a comprehensive evaluation of 30 mainstream large models in the world, and GLM-130B was the only large model selected in Asia. GLM-130B is close to or on par with GPT-3 175B (davinci) on accuracy and maliciousness metrics.
The biggest advantage of ChatGLM is that it is open source and optimized for Chinese. In particular, you can build a simplified version of int4 service on your own machine. The actual test results in answering general questions are not bad. The environment construction method is attached at the end of the article.

Summary

ChatGLM is a Chinese-English bilingual pre-trained large language model with 130B parameters (130 billion) and 400B token training.
GPT and BERT are combined in the model structure. In English, the effect is better than GPT-3; in Chinese, it is better than ERNIE TITAN 3.0 with 260B parameters. Runs on 4×RTX 3090 (24G) or 8×RTX 2080 Ti (11G) GPUs.

introduce

The paper proposes: the general language model General Language Model (GLM), the main techniques used are: two-way attention and autoregressive blank filling targets. Embedding the gradient shrinkage strategy can significantly improve the training stability of GLM-130B.

method

structure

GLM architecture

Unlike GPT, PaLM and other models that use Transformer's decoder, GLM-130B uses a bidirectional general language model (GLM) as its Backbone. The model structure is detailed in the paper: Glm: General language model pretraining with autoregressive blank infilling (2022).

GLM is a Transformer-based language model that uses autoregressive blank filling as the training target. Briefly, for a sequence of text x=[x1, · · · ,xn], sample text from it span{s1,· · ·,sm}, where each si represents a span of consecutive tokens, and replace si with a single mask, asking the model to autoregressively restore them. Unlike the GPT-like model, it uses two-way attention where there is no Mask, so it mixes two kinds of Mask to support understanding and generation:

[MASK]: Short blanks in the sentence, the length is added to a certain part of the input
[gMASK]: Long blanks of random length, added at the end of the sentence providing the context of the prefix

In theory, the gap-filling objective of bi-directional attention can understand context more effectively than GPT-style models: when using MASK, GLM-130B performs similarly to BERT and T5; when using gMASK, GLM-130B performs similarly to PrefixLM nature.

GLM-130B achieves a high accuracy of 80.2% on zero-shot LAMBADA, outperforming GPT-3 and PaLM 540B in Figure 2.

Normalization method

Normalization helps to improve the stability of model training. The DeepNorm method proposed in 2022 is used in this paper (see the paper: Deepnet: Scaling transformers to 1,000 layers for details). The formula is: Deep Norm ( x ) =
Layer Norm ( α ⋅ x + N network ( x ) ) DeepNorm(x) = LayerNorm(α x + Network(x))DeepNorm(x)=L a yer N or m ( α x+N e tw or k ( x ))
α = ( 2 N ) 1 2 α = (2N )^{\frac{1}{2}}a=( 2N ) _21
where N is the number of layers. This method effectively improves the training stability.

Position Encoding and Feedforward Networks

Rotational positional encoding (RoPE) was adopted for positional encoding in GLM-130B, and the GeLU activation function was chosen to optimize FFN.

training settings

The GLM-130B pre-training objective includes not only self-supervised GLM autoregressive blank filling, but also multi-task learning for a small number of tokens to improve the performance of its downstream zero-shot tasks.

Self-supervised blank filling (95%)

Both MASK and gMASK were used, one for each sequence. Specifically, MASK is used to mask consecutive Tokens in 30% of the training sequences for blank filling. For the other 70% of the sequences, the prefix of each sequence is kept as context, and gMASK is used to mask the remaining sequences for training.
The pre-training data includes 1.2T English, 1.0T Chinese Enlightenment corpus, and 250G Chinese corpus crawled from the Internet (including online forums, encyclopedias and QA), forming a balanced English-Chinese content composition.

Multi-task guided pre-training (MIP, 5%)

In the pre-training, a variety of instruction prompt data sets including language understanding, generation and information extraction are added to train the model.

Parallel training and model configuration

Training was performed for 60 days on a server cluster of 96 DGX-A100 GPUs (8 × 40G). Combining the pipeline model parallelism with the other two strategies forms the 3D parallelism strategy.

Stability of model training

It is necessary to maintain a balance between accuracy and stability. The low-precision FP format improves computational efficiency, but it is prone to overflow errors, resulting in training crashes.

mixed precision

FP16 is used for forward and backward, and FP32 is used for optimizer state and master weight to reduce GPU memory usage and improve training efficiency.

Embedding layer gradient shrinkage

Experiments show that the gradient norm can serve as an informative indicator of training crashes. Specifically, training crashes typically lag behind "spikes" in the gradient norm by a few training steps. Gradient shrinkage of the embedding layer is found to overcome loss spikes, thereby stabilizing GLM-130B training.

Using model inference on Rtx 2080 TI

Focuses on the quantization of model weights while maintaining the accuracy of FP16 activations. The quantized model is dynamically converted to FP16 precision at runtime, which introduces a small computational overhead and greatly reduces the GPU memory usage for storing model weights. In this paper, the INT4 weight quantization of GLM-130B has been successfully implemented. The model has been released and can be downloaded and used.

experiment

Compare with the English model:

Compared with the Chinese model:

Actual combat - environment construction

ChatGLM-6B is a Chinese-English bilingual language model with 6.2 billion parameters. It is obtained after quantization of a large model. The code + model is only a few G in size.

Download code and models

Due to the limited performance of my machine, I downloaded the int4 model, which takes up about 5G of space and about 5G of GPU memory when running.

$ git clone https://github.com/THUDM/ChatGLM-6B
$ git clone https://huggingface.co/THUDM/chatglm-6b-int4/

Download from the website https://huggingface.co/THUDM/chatglm-6b-int4/tree/main:
ice_text.model and pytorch_model.bin Two large files, replace the files in git.

Download the running environment image

If you use docker to start, the recommended image is:

$ docker pull pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime

boot image

$ nvidia-docker run -e "LC_ALL=zh_CN.UTF-8" -e "LANGUAGE=zh_CN.UTF-8" -e "LANG=zh_CN.UTF-8" -p 7860:7860 --rm -v /exports:/workspace/exports -it pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime bash

Adjust the code

Modify web_demo.py

  • Modify the model name to specify the directory:
tokenizer = AutoTokenizer.from_pretrained("../chatglm-6b-int4/", trust_remote_code=True)
model = AutoModel.from_pretrained("../chatglm-6b-int4/", trust_remote_code=True).half().cuda()
  • Set server_name to 0.0.0.0 so that the service can be called outside docker
demo.queue().launch(share=False, inbrowser=True, server_name="0.0.0.0")

run service

run inside the mirror

$ cd ChatGLM-6B/
$ pip install -r requirements.txt
$ python web_demo.py

After starting the service, you can access it through port 7860 in the host browser, and the effect is as follows:

Personally, I think the speed is quite fast, and the answering effect is not bad.

Related Links

ChatGLM project addressChatGLM
model introductionModel
download

Guess you like

Origin blog.csdn.net/xieyan0811/article/details/129778822