Research on Open Source Chinese Dialogue Pre-training Model
Article directory
Summarize the current (as of April 18, 2023) several open-source large-scale Chinese pre-trained open-domain dialogue generation models.
M o d e l Model Model | n p a r a m s n_{params} nparams | n l a y e r s n_{layers} nlayers | d m o d e l d_{model} dmodel | d f f d_{ff} dff | n h e a d s n_{heads} nheads | d h e a d d_{head} dhead | n_position |
---|---|---|---|---|---|---|---|
EVA 2. 0 B base EVA2.0_{Base}E V A 2.0 _Base | 300M | 12 | 768 | 3072 | 12 | 64 | 512 |
EVA 2. 0 L arge EVA2.0_{Large}E V A 2.0 _Large | 970M | 24 | 1024 | 4096 | 16 | 64 | 512 |
EVA 2.0 x Large EVA2.0_{xLarge}E V A 2.0 _xLarge | 2.8B | 24 | 2048 | 5120 | 32 | 64 | 512 |
C D i a l G P T L C C C − b a s e CDialGPT_{LCCC-base} CDialGPTLCCC−base | 104M | 12 | 768 | 3072 | 12 | 64 | 513 |
C D i a l G P T 2 L C C C − b a s e CDialGPT2_{LCCC-base} CDialGPT2LCCC−base | 104M | 12 | 768 | 3072 | 12 | 64 | 513 |
C D i a l G P T L C C C − l a r g e CDialGPT_{LCCC-large} CDialGPTLCCC−large | 104M | 12 | 768 | 3072 | 12 | 64 | 513 |
G P T 2 − c h i t c h a t GPT2-chitchat GPT 2−chitchat | 88M | 12 | 768 | 3072 | 12 | 64 | 300 |
dialogue-bart-base-chinese | 6 | 768 | 3072 | 12 | 64 | 512 | |
dialogue-bart-large-chinese | 12 | 1024 | 4096 | 16 | 64 | 512 |
1. CDial-GPT
The main work
- Paper:A Large-Scale Chinese Short-Text Conversation Dataset
- Open source address: https://github.com/scutcyr/CDial-GPT
- Main contributions: (1) Released a large-scale high-quality Chinese dialogue corpus LCCC after strict filtering and cleaning. The base version includes 6.8 million dialogues, and the large version includes 12 million (12M) dialogues. (2) Several large-scale pre-trained dialogue models CDialGPT were released at the same time, first pre-trained on the Chinese novel dataset, and then trained on LCCC.
LCCC dataset
Data Cleansing Strategy
rule-based cleaning
-
Remove platform tag: Reply to @***, [dog];
-
Remove URLs in text;
-
The dialogue of more than 30 rounds is divided into multiple dialogues of less than 30 rounds;
-
Phrases or words repeated more than 6 times in a sentence are kept in one copy only;
-
Delete conversations with replies that are too long or too short;
-
Remove Ads (A dataset for research on short-text conversations, EMNLP2013);
-
Delete the dialogue if 90% of the trigrams in the reply are high-frequency trigrams;
-
Delete the conversation if the reply is a generic reply with some specific form;
-
Delete conversations that reply the same as the post;
-
Remove conversations containing dirty words, sensitive words, dialects, special words such as levofloxacin, names, titles or unknown abbreviations, special symbols and emoticons, platform signs such as advertisements, pictures, and video-related words.
Classifier-based cleaning
(1) Manually mark 100,000 dialogues to train a BERT to identify whether the dialogue is noise:
- The text is not fluent or severely misspelled;
- incomplete reply;
- time sensitive;
- In the reply, there are festivals, places, gender and time, etc. that are not mentioned in the post;
- Responses are not relevant to the context;
(2) Manually mark 10,000 utterances to train a BERT, and recognize external contextual knowledge, which exceeds the text and makes it difficult for people to understand.
Model
input representation
Splicing all historical utterances into a long text sequence, the input includes the sum of three embeddings, namely word embedding, speaker embedding, and position embedding. Word embeddings and location embeddings are learned in the pre-training phase, and speaker embeddings are learned in the post-training (fine-tuning) phase. Speaker embedding embedding is used to indicate different speakers, refer to BERT, use [CLS] and [SEP] to represent the beginning and end of a sentence.
train
Refer to DialoGPT, based on Chinese-GPT (GPT-Novel), trained on LCCC. For multiple rounds of dialogue, the second to the last sentence are used as responses to the dialogue history .
GPT (Noval): 12 layers of GPT, 104M parameters;
CDialGPT (LCCC-base): 12 layers of GPT, 104M parameters;
CDialGPT2(LCCC-base): 12 layers of GPT2, 104M parameters;
CDialGPT(LCCC-large): 12 layers of GPT, 104M parameters;
2. GPT2-chitchat
Refer to GPT2-Chinese and DialoGPT.
Open source address: https://github.com/yangjianxin1/GPT2-chitchat.
Use the GPT2 model to train the Chinese chat corpus, and add mutual information to the project according to the idea of Microsoft's DialoGPT. Two models were trained: Dialogue Model and MMI Model (maximum mutual information scoring function). First, use the Dialogue Model to generate multiple candidate responses, and then use the MMI Model to select the one with the smallest loss as the final response from the candidate responses.
For details: https://zhuanlan.zhihu.com/p/101151633
3.EVA1.0
- Paper:EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training
- Main contributions: Proposed EVA, a large-scale Chinese pre-training dialogue model with 2B parameters; constructed a large-scale Chinese dialogue dataset WDC-Dialogue with strict cleaning and filtering, including 1.4B dialogues.
- Open source project address: https://github.com/thu-coai/EVA
WDC-DIalogue dataset
data collection
Interaction data on social media can be divided into three categories:
- Repost: Forward blogs and create a dialogue with the reposted person;
- Comment: comments and replies;
- Q&A: online question and answer, there are many platforms like Zhihu.
data cleaning
Refer to LCCC's data cleaning strategies and methods.
Model
Participle
Traditional character-level Chinese word segmentation tends to lose the importance of Chinese vocabulary or phrases, so a Chinese subword vocabulary is constructed, including Chinese characters and words, based on the unigram language model of Sentencepiece. Contains a total of 30,000 words.
Pre-training details
-
Encoder-Decoder type architecture;
-
For n utterances, encode n-1 utterances to generate the nth utterance.
-
The maximum encoding and decoding length is set to 128.
-
In order to solve the efficiency bottleneck of short utterance caused by a large number of pads, a new data sampling strategy is proposed, which combines multiple context-response pairs into one sample, and introduces a new attention-mask to distinguish them and ensure that they do not interfere with each other.
-
EVA uses the same relative position encoding as T5.
experiment
The experimental comparison found that the generation performance of EVA is better than that of CDialGPT, which is mainly reflected in the informativeness of the generated results, and CDialGPT may be more inclined to generate general replies.
4.EVA2.0
- Paper:EVA2.0: Investigating Open-domain Chinese Dialogue Systems with Large-scale Pre-training
- Open source project address: https://github.com/thu-coai/EVA
This is currently the open source Chinese dialogue pre-training model with the largest number of parameters and the best performance. Compared with EVA1.0, it has performed more stringent data cleaning and filtering. This article describes how to construct a large-scale dialogue system in the Chinese open domain, and did some rigorous experiments to study the factors that affect the training results, such as model layer settings, pre-training methods, decoding strategies, and explained Consistency, knowledge, and security issues still exist in dialogue systems.
5. dialogue-bart-chinese
HIT-TMG open source BART-based Chinese dialogue model, the model is trained on four corpora. Chinese Persona Chat, LCCC (CPC), Emotional STC (ESTC), KdConv.
Details can be found at:
https://huggingface.co/HIT-TMG/dialogue-bart-base-chinese
https://huggingface.co/HIT-TMG/dialogue-bart-large-chinese