Multi-round dialogue-2020: DialoGPT【Generative multi-round dialogue model】

Corpus Introduction

Here is a brief introduction to the LCCC dataset (Large-scale Cleaned Chinese Conversation). For details, you can go to Github to see it, and the download link is also above. LCCC is divided into two versions, base and large. The base is mainly derived from Weibo dialogues, and the large is based on the integration of other open source dialogue materials. According to the author, LCCC has undergone a strict cleaning process, so the overall quality depends on Still looks pretty good.

To simplify the task, all samples are processed as two-person dialogues. Here are some sample examples:

A: Let’s go back to buy some rabbit heads and have a good hot pot after Chinese New Year B: I haven’t seen any delicious rabbit heads in Taiyuan A: I saw an authentic one the day I brought you one from Hongqiao B: I love you the most A: That's a must A: Well, I'll wait! Are you in Shanghai now? The wind in Shanghai seems to be stronger than that in Nanjing, so don’t go out B: Yes, I’m at home, I’m fine. Be careful! A: I went around last year, and I met my former physical education teacher. I took a picture with you. B: Haha, I went to look for my English teacher in my freshman year of high school, but I couldn’t find her. She just happened to be out of school~ A: You too I’m really looking for memories. B: Haha, I haven’t been there since I graduated. I want to see it.

2

model design

After knowing what the data looks like, we are going to design the model next. Obviously, all we need to do is train a model to predict what to reply next. Since the corpus contains multiple rounds of dialogue, we also require this model to support multiple rounds of dialogue. The simplest way to consider the dialogue history is to stitch all the historical dialogue up to the current sentence into a single sentence text as the input information of the model.

Given some input and predicting an output, formally we should use the Seq2Seq model. It is actually not a big problem to use Seq2Seq directly, but the standard Seq2Seq is generally used for input and output with a relatively fixed form. For example, the length of the input text should be concentrated within a certain range and should not change too much. However, considering multiple rounds of dialogue, theoretically In fact, we don't know how many rounds of dialogue there are before, so in principle, the length of the input text is unlimited. With Seq2Seq, there is also the problem of low training efficiency, that is, we can only train one reply per round of dialogue. If there are n sentences of replies in a multi-round dialogue, then it must be split into n samples for training.

Therefore, we need a model whose length can be changed quite freely and can predict the entire multi-round dialogue at the same time. A more appropriate choice to achieve this requirement is a one-way language model (LM, GPT). The method is as follows:

As shown in the figure, we choose the current mainstream Transformer model, according to the conventional input format of BERT, splicing each dialogue with [SEP], and then train a one-way language model from left to right. In order to distinguish different speaking roles, we use different Segment Ids to distinguish different speakers. In addition, considering that both BERT and GPT use absolute position encoding, there is an upper limit to the length of text that can be processed, and the number of dialogue rounds is theoretically unlimited, so here we use NEZHA of relative position encoding as the basic structure, and use NEZHA's pre-trained weights are used as the initialization weights of the model.

To put it bluntly, it is to add a lower triangle Attention Mask to NEZHA to make it a language model. For related introduction, please refer to "From Language Model to Seq2Seq: Transformer is like a play, it all depends on Mask" [1].

3

training details

First, the author's reference implementation and trained model are given:

https://github.com/bojone/nezha_gpt_dialog

The code is implemented based on bert4keras, and the basic structure is NEZHA Base. In order to make the training more "civilian", the author only uses one card (22G RTX) for training with gradient accumulation, and readers who do not have enough memory can reduce the batch_size. The open source model is used

The Adam optimizer with a learning rate of 0.01 and a weight decay rate of 0.01 was trained for 10 million steps (more than 20 days), with a single-step batch_size of 16 and a cumulative gradient step of 16, so it is equivalent to training with a batch_size of 256 625,000 steps, the final loss is about 2.83 (the average cross entropy of each token).

Let's test our training results.

>>> print(chatbot.response([u'I eat hot pot every day recently'])) hot pot is delicious >>> print(chatbot.response([u'I eat hot pot every day recently'])) me too >>> > print(chatbot.response([u'I have been eating hot pot every day recently'])) Hahahahaha I have eaten two hot pot meals in the past week >>> print(chatbot.response([u'I have been eating hot pot every day recently']) ) Me too, every time I eat too much >>> print(chatbot.response([u'I eat hot pot every day recently']))Eat hot pot without gaining weight >>> print(chatbot.response([u'I recently Eat hot pot every day'])) Hot pot is good >>> print(chatbot.response([u'I eat hot pot every day recently'])) Do you have this kind of hot pot over there? >>> print(chatbot.response([u'I eat hot pot every day recently'])) Is the hot pot delicious? I want to eat too >>> print(chatbot.response([u'I eat hot pot every day recently'])) eat it, I will be happy after eating >>> print(chatbot.response([u'I eat hot pot every day recently '])) I also want to eat hot pot every day! >>> print(chatbot.response([u'I eat hot pot every day recently'])) Hot pot won't last long. >>> print(chatbot.response([u'How's the weather like?'])) okay, not hot, a bit cold >>> print(chatbot.response([u'

4

Comparative analysis

CDial-GPT also open sourced its own pre-training model, and the author also converted it into a format that bert4keras can load, CDial-GPT-tf[2], readers can also test and compare. From the training point of view, CDial-GPT uses the model implemented by pytorch. The basic structure is GPT Base, using 4 2080Ti, the total batch_size is 32, and the cumulative gradient is 64 steps. The paper says that 30 epochs were trained, and the total number of steps is about 21 million. step (twice the author), so it is roughly equivalent to training 330,000 steps with a batch_size of 2048.

In terms of input design, CDial-GPT is also different, as shown below:

As shown in the figure, the main difference between CDial-GPT and our previous design is the splicing method between multiple rounds of conversations. We used [SEP] to connect directly before, which uses [speaker1], [speaker2] (abbreviated in the figure S1, S2) to connect with role markers, and finally use a [SEP] to indicate the end of the reply. In this way, since the format of the prediction part is different from the format of the history, only one reply can be trained each time, and multiple rounds of conversations need to be split into multiple samples for training, which theoretically increases the complexity of training (to train It takes multiple steps to train a multi-round dialogue sample).

As for the effect, the feeling of personal testing is that there is no significant difference between the two. Interested readers can also compare and test by themselves.

5

article summary

This article mainly shares a dialogue model practice. Based on the open source LCCC chat corpus, the language model (GPT) is used to perform generative modeling of multiple rounds of dialogue, and a relatively general chat dialogue model is obtained. Finally, the ideas in this article are combined with CDial- GPT's own open source model was compared.

references:

[1] "From language model to Seq2Seq: Transformer is like a play, all depends on Mask": https://kexue.fm/archives/6933 [2] CDial-GPT-tf: https://github.com/bojone/CDial -GPT-tf

Guess you like

Origin blog.csdn.net/u013250861/article/details/130483642