Language model: Application of GPT and HuggingFace

This article is shared from the Huawei Cloud Community " Do you know the underlying principles of the large language model?" Large Language Model Underlying Architecture Part 2: GPT Implementation ", author: Hua Shanghua_Lancer.

Affected by the paradigm of using ImageNet to pre-train the model in the field of computer vision, so that the model can fully learn how to extract features through massive images, and then fine-tune the model according to the task objectives, methods based on pre-trained language models in the field of natural language processing are also gradually become mainstream. The dynamic word vector model represented by ELMo opened the door to language model pre-training. Since then, the emergence of large-scale pre-trained language models based on Transformer represented by GPT and BERT has brought natural language processing into a new paradigm of pre-training fine-tuning. era.

Utilizing rich training corpus, self-supervised pre-training tasks, and deep neural network structures such as Transformer, the pre-trained language model has universal and powerful natural language representation capabilities and can effectively learn vocabulary, grammar, and semantic information. When applying the pre-trained model to downstream tasks, you do not need to know too many task details or design a specific neural network structure. You only need to "fine-tune" the pre-trained model, that is, use the annotated data of the specific task on the pre-trained language model. With supervised training, significant performance improvements can be achieved.

The Generative Pre-Training (GPT) proposed by OpenAI in 2018 is one of the typical generative pre-training language models. The GPT model structure is shown in Figure 2.3. It is a one-way language model composed of multi-layer Transformer, which is mainly divided into three parts: input layer, encoding layer and output layer.
Next, I will focus on GPT unsupervised pre-training, supervised downstream task fine-tuning, and pre-training language model practice based on HuggingFace.

1. Unsupervised pre-training

GPT uses a generative pre-training method. One-way means that the model can only model text sequences from left to right or right to left. The Transformer structure and decoding strategy adopted ensure that each position of the input text can only rely on the past moment. Information.
Given a text sequence w = w1w2...wn, GPT first maps it into a dense vector in the input layer:

Among them, is the word vector of word wi,  is the position vector of word wi, and vi is the output of the word at the i-th position after passing through the model input layer (layer 0). The input layer of the GPT model is different from the neural network language model introduced previously in that it needs to add

Figure 1.1 GPT pre-trained language model structure

Position vector, which is caused by the Transformer structure itself being unable to sense position, so additional position information from the input layer is required. After the input layer encoding, the model obtains the representation vector sequence v = v1...vn, and then sends v to the model encoding layer. The encoding layer consists of L Transformer modules. Under the action of the self-attention mechanism, each representation vector of each layer will contain the information of the previous position representation vector, so that each representation vector has rich contextual information, and after multiple After layer encoding, GPT can obtain a hierarchical combined representation of each word. The calculation process is expressed as follows:

Among them,  it represents the representation vector sequence of the Lth layer, n is the length of the sequence, d is the dimension of the hidden layer of the model, and L is the total number of layers of the model. The output layer of the GPT model predicts the conditional probability at each position based on the last layer's representation h(L). The calculation process can be expressed as:

Among them,  is the word vector matrix, |V| is the vocabulary size. The one-way language model inputs the text sequence w according to the reading order, and uses the conventional language model objective to optimize the maximum likelihood estimate of w, so that it can make accurate predictions for the current word based on the input historical sequence:

where θ represents the model parameters. It is also possible to use only part of the past words for training based on the Markov hypothesis. During pre-training, the stochastic gradient descent method is usually used for backpropagation to optimize the negative likelihood function.

2. Supervised fine-tuning of downstream tasks

Through unsupervised language model pre-training, the GPT model has certain general semantic representation capabilities. The purpose of Downstream Task Fine-tuning is to adapt according to the characteristics of downstream tasks based on universal semantic representation. Downstream tasks usually need to use labeled data sets for training. The data set is represented by D. Each sample consists of a text sequence x = x1x2...xn of input length n and the corresponding label y.
First, input the text sequence x into the GPT model to obtain the hidden layer output h(L)n corresponding to the last word of the last layer. On this basis, the label prediction result is obtained through the fully connected layer transformation combined with the Softmax function.

where are the parameters of the fully connected layer, and k is the number of labels.
Fine-tune downstream tasks by optimizing the following objective function for the entire annotated data set D :

During the fine-tuning process of downstream tasks, optimizing the task objectives can easily cause the model to forget the general semantic knowledge representation learned in the pre-training stage, thereby losing the model's versatility and generalization ability, causing the problem of catastrophic forgetting (Catastrophic Forgetting) . Therefore, a hybrid pre-training task loss and downstream fine-tuning loss are often adopted to alleviate the above problems . In practical applications, the following formula is usually used to fine-tune downstream tasks:

The value of λ is [0,1], which is used to adjust the loss ratio of pre-training tasks.

3. Practice of pre-training language model based on HuggingFace

HuggingFace is an open source natural language processing software library. Its goal is to make natural language processing technology more accessible to developers and researchers by providing a comprehensive set of tools, libraries, and models. One of HuggingFace's most famous contributions is the Transformer library, based on which researchers can quickly deploy trained models and implement new network structures. In addition, HuggingFace also provides a Dataset library, which makes it very convenient to download the most commonly used benchmark data sets in natural language processing research. In this section, we will take building the BERT model as an example to introduce the construction and use of the BERT model based on Huggingface. 

3.1. Data collection preparation

Common large-scale data sets used for pre-training language models can be directly downloaded and loaded in the Dataset library. For example, if you use Wikipedia's English corpus collection, you can directly obtain data through the following code:

from datasets import concatenate_datasets, load_dataset 
bookcorpus = load_dataset("bookcorpus", split="train") 
wiki = load_dataset("wikipedia", "20230601.en", split="train") 
# Only keep the 'text' column 
wiki = wiki .remove_columns([col for col in wiki.column_names if col != "text"]) 
dataset = concatenate_datasets([bookcorpus, wiki]) 
# Divide the data set into 90% for training and 10% for testing 
d = dataset.train_test_split(test_size=0.1)

Next, save the training and test data in local files respectively.

def dataset_to_text(dataset, output_filename="data.txt"):
    """Utility function to save dataset text to disk,
    useful for using the texts to train the tokenizer
    (as the tokenizer accepts files)"""
    with open(output_filename, "w") as f:
        for t in dataset["text"]:
            print(t, file=f)
# save the training set to train.txt
dataset_to_text(d["train"], "train.txt")
# save the testing set to test.txt
dataset_to_text(d["test"], "test.txt")

3.2. Training Tokenizer

As mentioned before, BERT uses WordPiece word segmentation to decide whether to split a complete word into multiple tokens based on the word frequency in the training corpus. Therefore, the Tokenizer needs to be trained first. You can use the BertWordPieceTokenizer class in the transformers library to accomplish the task. The code is as follows:

special_tokens = [
"[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]", "<S>", "<T>"
]#
if you want to train the tokenizer on both sets
# files = ["train.txt", "test.txt"]
# training the tokenizer on the training set
files = ["train.txt"]
# 30,522 vocab is BERT's default vocab size, feel free to tweak
vocab_size = 30_522
# maximum sequence length, lowering will result to faster training (when increasing batch size)
max_length = 512
# whether to truncate
truncate_longer_samples = False
# initialize the WordPiece tokenizer
tokenizer = BertWordPieceTokenizer()
# train the tokenizer
tokenizer.train(files=files, vocab_size=vocab_size, special_tokens=special_tokens)
# enable truncation up to the maximum 512 tokens
tokenizer.enable_truncation(max_length=max_length)
model_path = "pretrained-bert"
# make the directory if not already there
if not os.path.isdir(model_path):
    os.mkdir(model_path)
# save the tokenizer
tokenizer.save_model(model_path)
# dumping some of the tokenizer config to config file,
# including special tokens, whether to lower case and the maximum sequence length
with open(os.path.join(model_path, "config.json"), "w") as f:
    tokenizer_cfg = {
        "do_lower_case": True,
        "unk_token": "[UNK]",
        "sep_token": "[SEP]",
        "pad_token": "[PAD]",
        "cls_token": "[CLS]",
        "mask_token": "[MASK]",
        "model_max_length": max_length,
        "max_len": max_length,
    }
    json.dump(tokenizer_cfg, f)
# when the tokenizer is trained and configured, load it as BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained(model_path)

3.3. Preprocessing corpus collection

Before starting the entire model training, the pre-training corpus needs to be processed according to the trained Tokenizer. If the document length exceeds 512 tokens, it will be truncated directly. The data processing code is as follows:

def encode_with_truncation(examples):
    """Mapping function to tokenize the sentences passed with truncation"""
    return tokenizer(examples["text"], truncation=True, padding="max_length",
        max_length=max_length, return_special_tokens_mask=True)
def encode_without_truncation(examples):
    """Mapping function to tokenize the sentences passed without truncation"""
    return tokenizer(examples["text"], return_special_tokens_mask=True)
# the encode function will depend on the truncate_longer_samples variable
encode = encode_with_truncation if truncate_longer_samples else encode_without_truncation
# tokenizing the train dataset
train_dataset = d["train"].map(encode, batched=True)
# tokenizing the testing dataset
test_dataset = d["test"].map(encode, batched=True)
if truncate_longer_samples:
    # remove other columns and set input_ids and attention_mask as PyTorch tensors

    train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
    test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
else:
    # remove other columns, and remain them as Python lists
    test_dataset.set_format(columns=["input_ids", "attention_mask", "special_tokens_mask"])
    train_dataset.set_format(columns=["input_ids", "attention_mask", "special_tokens_mask"])

truncate_longer_samples  Boolean variable to control the encode() callback function used to tokenize the data set. If set to True, sentences that exceed the maximum sequence length (max_length) will be truncated. Otherwise, truncation will not occur. If truncate_longer_samples is set to False, samples without truncation need to be concatenated and combined into a fixed-length vector.

3.4. Model training

After constructing the processed pre-training corpus, model training can be started. The code looks like this:

# initialize the model with the config
model_config = BertConfig(vocab_size=vocab_size, max_position_embeddings=max_length)
model = BertForMaskedLM(config=model_config)
# initialize the data collator, randomly masking 20% (default is 15%) of the tokens
# for the Masked Language Modeling (MLM) task
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.2
)
training_args = TrainingArguments(
    output_dir=model_path, # output directory to where save model checkpoint
    evaluation_strategy="steps", # evaluate each `logging_steps` steps
    overwrite_output_dir=True,
    num_train_epochs=10, # number of training epochs, feel free to tweak
    per_device_train_batch_size=10, # the training batch size, put it as high as your GPU memory fits
    gradient_accumulation_steps=8, # accumulating the gradients before updating the weights
    per_device_eval_batch_size=64, # evaluation batch size
    logging_steps=1000, # evaluate, log and save model checkpoints every 1000 step
    save_steps=1000,
    # load_best_model_at_end=True, # whether to load the best model (in terms of loss)
    # at the end of training
    # save_total_limit=3, # whether you don't have much space so you
    # let only 3 model weights saved in the disk
)
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)
# train the model
trainer.train()

After starting training, the results can be output as follows:

[10135/79670 18:53:08 < 129:35:53, 0.15 it/s, Epoch 1.27/10]
Step Training Loss Validation Loss
1000 6.904000 6.558231
2000 6.498800 6.401168
3000 6.362600 6.277831
4000 6.251000 6.172856
5000 6.155800 6.071129
6000 6.052800 5.942584
7000 5.834900 5.546123
8000 5.537200 5.248503
9000 5.272700 4.934949
10000 4.915900 4.549236

3.5. Model usage

Based on the trained model, it can be used according to different application requirements.

# load the model checkpoint
model = BertForMaskedLM.from_pretrained(os.path.join(model_path, "checkpoint-10000"))
# load the tokenizer
tokenizer = BertTokenizerFast.from_pretrained(model_path)
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
# perform predictions
examples = [
"Today's most trending hashtags on [MASK] is Donald Trump",
"The [MASK] was cloudy yesterday, but today it's rainy.",
]
for example in examples:
    for prediction in fill_mask(example):
        print(f"{prediction['sequence']}, confidence: {prediction['score']}")
    print("="*50)

You can get the following output:

today's most trending hashtags on twitter is donald trump, confidence: 0.1027069091796875
today's most trending hashtags on monday is donald trump, confidence: 0.09271949529647827
today's most trending hashtags on tuesday is donald trump, confidence: 0.08099588006734848
today's most trending hashtags on facebook is donald trump, confidence: 0.04266013577580452
today's most trending hashtags on wednesday is donald trump, confidence: 0.04120611026883125
==================================================
the weather was cloudy yesterday, but today it's rainy., confidence: 0.04445931687951088
the day was cloudy yesterday, but today it's rainy., confidence: 0.037249673157930374
the morning was cloudy yesterday, but today it's rainy., confidence: 0.023775646463036537
the weekend was cloudy yesterday, but today it's rainy., confidence: 0.022554103285074234
the storm was cloudy yesterday, but today it's rainy., confidence: 0.019406016916036606
==================================================

This article focuses in detail on GPT unsupervised pre-training, supervised downstream task fine-tuning, and pre-training language model practice based on HuggingFace. In the next article, I will introduce the large language model network structure and attention mechanism optimization and related practices.

Reference article:

  • https://zhuanlan.zhihu.com/p/617643272
  • https://zhuanlan.zhihu.com/p/604592680

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

IntelliJ IDEA 2023.3 & JetBrains Family Bucket annual major version update new concept "defensive programming": make yourself a stable job GitHub.com runs more than 1,200 MySQL hosts, how to seamlessly upgrade to 8.0? Stephen Chow's Web3 team will launch an independent App next month. Will Firefox be eliminated? Visual Studio Code 1.85 released, floating window US CISA recommends abandoning C/C++ to eliminate memory security vulnerabilities Yu Chengdong: Huawei will launch disruptive products next year and rewrite industry history TIOBE December: C# is expected to become the programming language of the year A paper written by Lei Jun 30 years ago : "Principle and Design of Computer Virus Determination Expert System"
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10320075