[Study Notes] Summary of Research on Open Source Chinese Dialogue Pre-training Model

Research on Open Source Chinese Dialogue Pre-training Model


Summarize the current (as of April 18, 2023) several open-source large-scale Chinese pre-trained open-domain dialogue generation models.

M o d e l Model Model n p a r a m s n_{params} nparams n l a y e r s n_{layers} nlayers d m o d e l d_{model} dmodel d f f d_{ff} dff n h e a d s n_{heads} nheads d h e a d d_{head} dhead n_position
EVA 2. 0 B base EVA2.0_{Base}E V A 2.0 _Base 300M 12 768 3072 12 64 512
EVA 2. 0 L arge EVA2.0_{Large}E V A 2.0 _Large 970M 24 1024 4096 16 64 512
EVA 2.0 x Large EVA2.0_{xLarge}E V A 2.0 _xLarge 2.8B 24 2048 5120 32 64 512
C D i a l G P T L C C C − b a s e CDialGPT_{LCCC-base} CDialGPTLCCCbase 104M 12 768 3072 12 64 513
C D i a l G P T 2 L C C C − b a s e CDialGPT2_{LCCC-base} CDialGPT2LCCCbase 104M 12 768 3072 12 64 513
C D i a l G P T L C C C − l a r g e CDialGPT_{LCCC-large} CDialGPTLCCClarge 104M 12 768 3072 12 64 513
G P T 2 − c h i t c h a t GPT2-chitchat GPT 2chitchat 88M 12 768 3072 12 64 300
dialogue-bart-base-chinese 6 768 3072 12 64 512
dialogue-bart-large-chinese 12 1024 4096 16 64 512

1. CDial-GPT

The main work

  • Paper:A Large-Scale Chinese Short-Text Conversation Dataset
  • Open source address: https://github.com/scutcyr/CDial-GPT
  • Main contributions: (1) Released a large-scale high-quality Chinese dialogue corpus LCCC after strict filtering and cleaning. The base version includes 6.8 million dialogues, and the large version includes 12 million (12M) dialogues. (2) Several large-scale pre-trained dialogue models CDialGPT were released at the same time, first pre-trained on the Chinese novel dataset, and then trained on LCCC.

LCCC dataset

Data Cleansing Strategy

rule-based cleaning
  • Remove platform tag: Reply to @***, [dog];

  • Remove URLs in text;

  • The dialogue of more than 30 rounds is divided into multiple dialogues of less than 30 rounds;

  • Phrases or words repeated more than 6 times in a sentence are kept in one copy only;

  • Delete conversations with replies that are too long or too short;

  • Remove Ads (A dataset for research on short-text conversations, EMNLP2013);

  • Delete the dialogue if 90% of the trigrams in the reply are high-frequency trigrams;

  • Delete the conversation if the reply is a generic reply with some specific form;

  • Delete conversations that reply the same as the post;

  • Remove conversations containing dirty words, sensitive words, dialects, special words such as levofloxacin, names, titles or unknown abbreviations, special symbols and emoticons, platform signs such as advertisements, pictures, and video-related words.

Classifier-based cleaning

(1) Manually mark 100,000 dialogues to train a BERT to identify whether the dialogue is noise:

  • The text is not fluent or severely misspelled;
  • incomplete reply;
  • time sensitive;
  • In the reply, there are festivals, places, gender and time, etc. that are not mentioned in the post;
  • Responses are not relevant to the context;

(2) Manually mark 10,000 utterances to train a BERT, and recognize external contextual knowledge, which exceeds the text and makes it difficult for people to understand.

image-20230403212441322

Model

input representation

Splicing all historical utterances into a long text sequence, the input includes the sum of three embeddings, namely word embedding, speaker embedding, and position embedding. Word embeddings and location embeddings are learned in the pre-training phase, and speaker embeddings are learned in the post-training (fine-tuning) phase. Speaker embedding embedding is used to indicate different speakers, refer to BERT, use [CLS] and [SEP] to represent the beginning and end of a sentence.

train

Refer to DialoGPT, based on Chinese-GPT (GPT-Novel), trained on LCCC. For multiple rounds of dialogue, the second to the last sentence are used as responses to the dialogue history .

GPT (Noval): 12 layers of GPT, 104M parameters;

CDialGPT (LCCC-base): 12 layers of GPT, 104M parameters;

CDialGPT2(LCCC-base): 12 layers of GPT2, 104M parameters;

CDialGPT(LCCC-large): 12 layers of GPT, 104M parameters;

2. GPT2-chitchat

Refer to GPT2-Chinese and DialoGPT.

Open source address: https://github.com/yangjianxin1/GPT2-chitchat.

Use the GPT2 model to train the Chinese chat corpus, and add mutual information to the project according to the idea of ​​Microsoft's DialoGPT. Two models were trained: Dialogue Model and MMI Model (maximum mutual information scoring function). First, use the Dialogue Model to generate multiple candidate responses, and then use the MMI Model to select the one with the smallest loss as the final response from the candidate responses.

For details: https://zhuanlan.zhihu.com/p/101151633

3.EVA1.0

  • Paper:EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training
  • Main contributions: Proposed EVA, a large-scale Chinese pre-training dialogue model with 2B parameters; constructed a large-scale Chinese dialogue dataset WDC-Dialogue with strict cleaning and filtering, including 1.4B dialogues.
  • Open source project address: https://github.com/thu-coai/EVA

WDC-DIalogue dataset

data collection

Interaction data on social media can be divided into three categories:

  • Repost: Forward blogs and create a dialogue with the reposted person;
  • Comment: comments and replies;
  • Q&A: online question and answer, there are many platforms like Zhihu.

data cleaning

Refer to LCCC's data cleaning strategies and methods.

Model

Participle

Traditional character-level Chinese word segmentation tends to lose the importance of Chinese vocabulary or phrases, so a Chinese subword vocabulary is constructed, including Chinese characters and words, based on the unigram language model of Sentencepiece. Contains a total of 30,000 words.

Pre-training details

  • Encoder-Decoder type architecture;

  • For n utterances, encode n-1 utterances to generate the nth utterance.

  • The maximum encoding and decoding length is set to 128.

  • In order to solve the efficiency bottleneck of short utterance caused by a large number of pads, a new data sampling strategy is proposed, which combines multiple context-response pairs into one sample, and introduces a new attention-mask to distinguish them and ensure that they do not interfere with each other.

  • EVA uses the same relative position encoding as T5.

experiment

The experimental comparison found that the generation performance of EVA is better than that of CDialGPT, which is mainly reflected in the informativeness of the generated results, and CDialGPT may be more inclined to generate general replies.

4.EVA2.0

  • Paper:EVA2.0: Investigating Open-domain Chinese Dialogue Systems with Large-scale Pre-training
  • Open source project address: https://github.com/thu-coai/EVA

This is currently the open source Chinese dialogue pre-training model with the largest number of parameters and the best performance. Compared with EVA1.0, it has performed more stringent data cleaning and filtering. This article describes how to construct a large-scale dialogue system in the Chinese open domain, and did some rigorous experiments to study the factors that affect the training results, such as model layer settings, pre-training methods, decoding strategies, and explained Consistency, knowledge, and security issues still exist in dialogue systems.

5. dialogue-bart-chinese

HIT-TMG open source BART-based Chinese dialogue model, the model is trained on four corpora. Chinese Persona Chat, LCCC (CPC), Emotional STC (ESTC), KdConv.

Details can be found at:

https://huggingface.co/HIT-TMG/dialogue-bart-base-chinese

https://huggingface.co/HIT-TMG/dialogue-bart-large-chinese

Guess you like

Origin blog.csdn.net/m0_47779101/article/details/129960024