Reproduce the GPT v1 model based on Tensorflow

The ChatGPT model launched by OpenAI allows us to see the development potential of general artificial intelligence. I also found related papers on GPT for research. OpenAI's 2017 paper Improving Language Understanding by Generative Pre-Training proposed the first version of GPT, and I also reproduced it with Tensorflow based on this paper.

Dataset download

GPT is pre-trained based on the bookcorpus dataset. Related datasets are available on the huggingface.co website. The following code is to download the data set and display the first data

from datasets import load_dataset
dataset = load_dataset("bookcorpusopen", split="train")
dataset["train"][0]

This dataset includes a total of 17,868 books, each of which corresponds to two fields of title and text, and we will train based on Text. According to the description of the GPT paper, it uses BPE to tokenize the text. There is an article in Huggingface that explains the principle and training details of BPE, Byte-Pair Encoding tokenization - Hugging Face NLP Course . Here I directly use the tokenizer of huggingface. The trained gpt model.

from transformers import OpenAIGPTTokenizer
block_size=513

tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')

def tokenize_function(examples):
    token_ids = [tokenizer(text) for text in examples["text"]]
    total_length = [len(t["input_ids"]) for t in token_ids]
    total_length = [(l//(block_size+1))*(block_size+1) for l in total_length]
    result = []
    label = []

    for i in range(len(total_length)):
        result.extend([token_ids[i]["input_ids"][j:j+block_size+1] for j in range(0, total_length[i], block_size+1)])
    return {"token_ids": result}

ds_test = ds['train'].select(range(10000))

tokenized_datasets = ds_test.map(
    tokenize_function, batched=True, num_proc=8, remove_columns=["title", "text"], batch_size=100
)

tokenized_datasets.save_to_disk("data/boocorpusopen_10000_512tokens")

In the above code, I converted the text of each book in the data set into a token id through the tokenizer, and then saved every 513 token ids as a data record, because 512 tokens are trained in the GPT paper, so We take the first 512 of these 513 tokens as training during training, and then the corresponding 2-513 tokens as labels, and finally save the processed data set locally.

Because I'm going to train in the tensorflow model, and I need to convert this dataset into tensorflow dataset format. We can directly call the tokenized_datasets.to_tf_dataset function to convert, but I found that after this conversion, it is very slow to read the dataset data. Therefore, I first convert the dataset to the TFRecords file format, so that the reading speed can be much faster. The following code saves every 100,000 records as a tfrecord file, each file is about 100M.

import tensorflow as tf
from tqdm import tqdm

def _int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

def serialize_example(token_ids):
    feature = {
        'token_ids': _int64_feature(token_ids)
    }

    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    return example_proto.SerializeToString()

records_num = 100000
count = 0
for record in tqdm(ds):
    if count%records_num == 0:
        writer = tf.io.TFRecordWriter("bookcorpus_"+str(count//records_num)+".tfrecords")
    writer.write(serialize_example(record['token_ids']))
    count += 1
    if count%records_num == 0:
        writer.close()
if writer:
    writer.close()

Then we can read the data

data_dir = "/data/datasets/bookcorpus_tf/"
filenames = os.listdir(data_dir)
filenames = [data_dir+f for f in filenames]
tf_ds = tf.data.TFRecordDataset(filenames)
tf_ds = tf_ds\
    .map(_parse_function, num_parallel_calls=tf.data.AUTOTUNE)\
    .shuffle(buffer_size=batch_size*100)\
    .prefetch(tf.data.experimental.AUTOTUNE)\
    .batch(batch_size)

We can check the data of Batch, take out the data and decode it with tokenizer

data = next(iter(tf_ds))
tokenizer.decode(data['token_ids'][0])

The result is as follows:

"i was sitting to the left of alex, and tinker was to his right, with lilla sitting to the right of her. tinker was leaning over to alex, chatting away, when her hand suddenly gently slid along his thigh, and onto his crotch. oops! now that was a surprise. unexpected, to say the least. right out of the blue. alex looked at me in desperation, and i initially laughed. we hadn't really thought about any sexual side to the situation, we had just considered the girls to be friends. plus, honestly, while they were very nice people, they weren't at all our cup of tea, so to speak. besides, what exactly was the situation down there, in the lady garden? had operations been done? would we end up comparing who had the bigger penis? we didn't know, and we didn't want to find out. it was time to get out of there. we both went into time - to - go mode. \n'you know, we got ta get up early in the morning, so we better hit the road.'i yelled to all and sundry. \n alex was already on his feet, yelling out something similar as well. we started heading to the door. the girls were calling after us, but i couldn't hear what they were saying, over the music, and the blood pumping in my head. i just waved back at them. \n it was with some relief that we found ourselves back out in the industrial wasteland. \n'fuck, what a surprise!'alex said as we ran off in the darkness.'i didn't see that coming. i hope they won't be upset with us. i thought they had sex for money, anyway.'\n'maybe on their night off they like a bit of young cock? fucked if i know. you should have seen the look on your face, man!'\n'shit, i don't want to think about it.'\n'don't worry, i won't be letting you forget this one.'\n we were pretty relieved, and happy to be out of that situation, and pretty much laughed about it all the way home. i would be pulling alex's leg over that one for a long time to come. mind you, probably lilla hadn't been far away from making a move on me, if it had all gone successfully with tinker and alex. sometimes it all comes down to who is the person closest to the door, or, as in that case, who is in the"

Now the dataset is ready for us.

Build a GPT model

According to the description of the paper, GPT only uses the Decoder in the Transformer, because the Encoder establishes the connection between Tokens by looking at the context of the entire training data, but for text generation, it can only be predicted by the above tokens , so only Decoder can be used. The model architecture of the paper is as follows. A total of 12 Decoders are combined, and each Decoder contains 12 Attention Heads:

For the interpretation of the Transformer model, you can see the blog I wrote before implementing a Transformer translator based on Tensorflow_tensorflow transformer_gzroy's blog-CSDN Blog

The first is to define the multi attention head, the code is as follows:

def scaled_dot_product_attention(q, k, v, mask):
    """Calculate the attention weights.
    q, k, v must have matching leading dimensions.
    k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
    The mask has different shapes depending on its type(padding or look ahead)
    but it must be broadcastable for addition.
    Args:
    q: query shape == (..., seq_len_q, depth)
    k: key shape == (..., seq_len_k, depth)
    v: value shape == (..., seq_len_v, depth_v)
    mask: Float tensor with shape broadcastable
          to (..., seq_len_q, seq_len_k). Defaults to None.
    Returns:
    output, attention_weights
    """
    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)
 
    # scale matmul_qk
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
 
    # add the mask to the scaled tensor.
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)
 
    # softmax is normalized on the last axis (seq_len_k) so that the scores
    # add up to 1.
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)
 
    output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)
 
    return output, attention_weights

class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self,*, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
 
        assert d_model % self.num_heads == 0
 
        self.depth = d_model // self.num_heads
 
        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)
 
        self.dense = tf.keras.layers.Dense(d_model)
 
    def split_heads(self, x, batch_size):
        """Split the last dimension into (num_heads, depth).
        Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
        """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
 
    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]
 
        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)  # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)
 
        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)
 
        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask)
 
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)
 
        concat_attention = tf.reshape(scaled_attention,
                                      (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)
 
        output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)
 
        return output, attention_weights

Then there is the Feed forward layer, the code is as follows

def point_wise_feed_forward_network(d_model, dff):
    return tf.keras.Sequential([
      tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
      tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
    ])

Define a decoder layer to combine the above two layers:

class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self,*, d_model, num_heads, dff, rate=0.1):
        super(DecoderLayer, self).__init__()
        self.mha = MultiHeadAttention(d_model=d_model, num_heads=num_heads)

        self.ffn = point_wise_feed_forward_network(d_model, dff)
 
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
 
        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
 
    def call(self, x, training, look_ahead_mask):
        attn, attn_weights_block = self.mha(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
        attn = self.dropout1(attn, training=training)
        out1 = self.layernorm1(attn + x)
 
        ffn_output = self.ffn(out1)  # (batch_size, target_seq_len, d_model)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(ffn_output + out1)  # (batch_size, target_seq_len, d_model)
 
        return out2, attn_weights_bloc

Finally, a GPT model is defined, which includes 12 Decoder layers.

class Decoder(tf.keras.layers.Layer):
    def __init__(self,*, num_layers, d_model, num_heads, dff, target_vocab_size, rate=0.1):
        super(Decoder, self).__init__()
 
        self.d_model = d_model
        self.num_layers = num_layers
 
        self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
        self.pos_encoding = tf.reshape(tf.range(target_vocab_size-block_size, target_vocab_size), shape=[1, -1])
 
        self.dec_layers = [
            DecoderLayer(d_model=d_model, num_heads=num_heads, dff=dff, rate=rate)
            for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(rate)
 
    def call(self, x, training, look_ahead_mask):
        #seq_len = tf.shape(x)[1]
        attention_weights = {}
 
        x = self.embedding(x)  # (batch_size, block_size, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.embedding(self.pos_encoding)
 
        x = self.dropout(x, training=training)
 
        for i in range(self.num_layers):
            x, block1 = self.dec_layers[i](x, training, look_ahead_mask)
 
            attention_weights[f'decoder_layer{i+1}_block1'] = block1
 
        # x.shape == (batch_size, target_seq_len, d_model)
        return x, attention_weights


target_vocab_size = vocab_size + block_size
 
transformer = Transformer(
    num_layers=num_layers,
    d_model=d_model,
    num_heads=num_heads,
    dff=dff,
    target_vocab_size=target_vocab_size,
    rate=dropout_rate)

To explain the above code, the input sequence Token is transformed through the embedding vector, and each Token is mapped to a 768-dimensional vector. Then the vector of this sequence needs to add position information. According to the explanation of the paper, the position information of sine and cosine is not used here, but the way of embedding vector is used. For example, the vocabulary consists of 40,000 words, corresponding to 40,000 tokens. Our input sequence includes 512 tokens, so we add 40,000-40,511 of these 512 tokens corresponding to each position of the input sequence, and then add the embedding vector corresponding to this position token to Into the embedding vector of the input sequence, so that the input contains the position information of each token.

training model

To train the model, we need to define a Loss function to calculate the Loss value of the model.

Because we predict the next token based on the given tokens above, we calculate the predicted CategoryCrossEntropy, the code is as follows:

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')
 
def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)
 
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
 
    return tf.reduce_sum(loss_)/tf.reduce_sum(mask)

train_loss = tf.keras.metrics.Mean(name='train_loss')

In order to understand the predictive performance of the model during training, we also need to define an accuracy indicator

def accuracy_function(real, pred):
    accuracies = tf.equal(real, tf.argmax(pred, axis=2))
 
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    accuracies = tf.math.logical_and(mask, accuracies)
 
    accuracies = tf.cast(accuracies, dtype=tf.float32)
    mask = tf.cast(mask, dtype=tf.float32)
    return tf.reduce_sum(accuracies)/tf.reduce_sum(mask)

train_accuracy = tf.keras.metrics.Mean(name='train_accuracy')

According to the description of the paper, the Adam optimizer is used to optimize the model. The learning rate is increased from 0 to 0.00025 in the initial 2000 Batch training, and then the cosine decay is used, and it is reduced to 0 after 100 Epoch. In the new version of Tensorflow, there is a new CosineDecay that can be called directly

epoch_steps = 1680000//batch_size
epochs = 100
decay_steps = epoch_steps*epochs
initial_learning_rate = 0
warmup_steps = 2000
target_learning_rate = 0.00025
lr_warmup_decayed_fn = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate, decay_steps, warmup_target=target_learning_rate,
    warmup_steps=warmup_steps
)
 
optimizer = tf.keras.optimizers.Adam(lr_warmup_decayed_fn, beta_1=0.9, beta_2=0.98, epsilon=1e-9)

During the training process, we want to save the training results of the intermediate process, and define checkpoint for this

checkpoint_path = './checkpoints/train'
 
#定义两个trackable object需要保存
ckpt = tf.train.Checkpoint(transformer=transformer, optimizer=optimizer)
 
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)
 
# if a checkpoint exists, restore the latest checkpoint.
if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print('Latest checkpoint restored!!')

Define a training function, calculate loss and call optimizer to optimize, such as the following code:

train_step_signature = [
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
    tf.TensorSpec(shape=(None, None), dtype=tf.int64)
]
@tf.function(input_signature=train_step_signature)
def train_step(inp, tar):
    with tf.GradientTape() as tape:
        predictions, _ = transformer(inp, training = True)
        loss = loss_function(tar, predictions)
 
    gradients = tape.gradient(loss, transformer.trainable_variables)
    optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))
 
    train_loss(loss)
    train_accuracy(accuracy_function(tar, predictions))

Finally, we can train the model. During the training process, we will print the loss value and prediction accuracy every 100 batches, and then use checkpoint to save the training results at the end of each Epoch.

for epoch in range(EPOCHS):
    start = time.time()
 
    train_loss.reset_states()
    train_accuracy.reset_states()
 
    # inp -> portuguese, tar -> english
    for (batch, inputs) in enumerate(tf_ds):
        try:
            train_step(inputs[...,:-1], inputs[...,1:])
        except ValueError:
            print(inputs)
            break
 
        if batch % 10 == 0:
            print(f'Epoch {epoch + 1} Batch {batch} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')
            if batch == 100:
                break
    
    if (epoch + 1) % 5 == 0:
        ckpt_save_path = ckpt_manager.save()
        print(f'Saving checkpoint for epoch {epoch+1} at {ckpt_save_path}')
    
    print(f'Epoch {epoch + 1} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')
 
    print(f'Time taken for 1 epoch: {time.time() - start:.2f} secs\n')

training result

My local GPU is a 2080 card with only 12GB of RAM. According to the description of the paper, 8 P600 graphics cards were used for training for 30 days. I don't have so many graphics card resources, so I simply trained a few Epochs, and I can see that the Loss value keeps decreasing during the training process and the accuracy of predicting the next Token increases. Later I will train more EPOCH, and then do supervised learning on this trained model to complete different NLP tasks (such as text classification, text question answering, etc.).

Guess you like

Origin blog.csdn.net/gzroy/article/details/131609303