dialogbot: an out-of-the-box dialogue robot solution, covering various scenarios such as question-and-answer dialogue, task-oriented dialogue and chat dialogue, providing you with a full range of dialogue interaction experience.

dialogbot: an out-of-the-box dialogue robot solution, covering various scenarios such as question-and-answer dialogues, task-oriented dialogues, and chat-type dialogues. dialog interaction experience.

The human-computer dialogue system has always been an important direction of AI. The Turing test uses dialogue to detect whether a machine has a high degree of intelligence. How to build a human-computer dialogue system or a dialogue robot?

  • The dialogue system has evolved through three generations:

    1. Rule dialog system: Vertical fields can use template matching method to match questions and corresponding answers. The advantage is that the internal logic is transparent and easy to analyze and debug. The disadvantage is that it is highly dependent on expert intervention and
      lacks flexibility and scalability.
    2. Statistical dialogue system: A statistical dialogue system based on a partially visible Markov decision-making process, first conducts Bayesian inference on questions, maintains the dialogue state of each round, and then follows up the dialogue state to select dialogue strategies to generate natural language responses
      . It basically forms a modern dialogue system framework, which avoids the high dependence on experts. The disadvantage is that the model is difficult to maintain and the scalability is relatively limited.
    3. Deep dialogue system: basically continues the framework of the statistical dialogue system, but each model uses a deep network model. Utilizing the powerful representation ability of the deep model, the language classification and generation capabilities have been greatly improved. The
      disadvantage is that a large amount of labeled data is required to effectively train the model.
  • Dialogue systems fall into three categories:

    • Question-and-answer dialogue: mostly one question and one answer, the user asks a question, and the system returns the correct answer by analyzing the question and searching the knowledge base, such as searching.
    • Task-based dialogue: refers to a multi-round dialogue driven by tasks. The machine needs to determine the user's goal through understanding, active inquiry, clarification, etc., and then search the knowledge base to return the results to complete the user's needs.
      Such as: robot selling movie tickets.
    • Chat-type dialogue: the goal is to generate interesting and informative natural responses to continue the man-machine dialogue, such as Xiaodu Audio.

1. Question-and-answer dialogue (Search Dialogue Bot)

1.1 Local Search Question Answering

Calculate the similarity between the user's question and the question in the question-and-answer database, select the most similar question, and give its corresponding answer.

Sentence similarity calculation includes the following methods:

  • TFIDF
  • BM25
  • OneHot
  • Query Vector

1.2 Web Search Questions and Answers

Retrieve answers to Baidu and Bing search result summaries

  • Baidu search, including Baidu Knowledge Graph, Baidu Poetry, Baidu Perpetual Calendar, Baidu Calculator, Baidu Know
  • Microsoft Bing search, including bing knowledge map, bing dictionary

1.3 Task Oriented Dialogue Bot

  • End to End Memory Networks(memn2n)
  • BABi dataset

1.4 Generative Dialogue Bot

  • GPT2 Model
  • Sequence To Sequence Model(seq2seq)
  • Taobao dataset

2. Demo display

Official Demo: https://www.mulanai.com/product/dialogbot/

The project is based on transformers 4.4.2+, torch 1.6.0+ and Python 3.6+.
Then, simply do:

pip3 install torch # conda install pytorch
pip3 install -U dialogbot

or

pip3 install torch # conda install pytorch
git clone https://github.com/shibing624/dialogbot.git
cd dialogbot
python3 setup.py install

3. Application scenario display

3.1 Question-and-answer dialogue (Search Bot)

example : examples / bot_demo

from dialogbot import Bot

bot = Bot()
response = bot.answer('姚明多高呀?')
print(response)

output:

query: "姚明多高呀?"
answer: "226cm"

3.2 Task Bot

example: examples/taskbot_demo.py

3.3 Chat Dialogue (Generative Bot)

3.3.1 GPT2 model usage

Chat-style dialogue model trained based on GPT2 generative model.

The model has been released to huggingface models: shibing624/gpt2-dialogbot-base-chinese

example: examples/genbot_demo.py

from dialogbot import GPTBot
bot = GPTBot()
r = bot.answer('亲 你吃了吗?', use_history=False)
print('gpt2', r)

output:

query: "亲 吃了吗?"
answer: "吃了"

3.3.2 GPT2 model fine-tune

  • Data preprocessing
    Create a data folder under the project root directory, name the original training corpus train.txt, and store it in this directory. The format of train.txt is as follows, with one line between each chat, the format is as follows:
真想找你一起去看电影
突然很想你
我也很想你

想看你的美照
亲我一口就给你看
我亲两口
讨厌人家拿小拳拳捶你胸口

今天好点了吗?
一天比一天严重
吃药不管用,去打一针。别拖着

Run preprocess.py, tokenize the data/train.txt dialogue material, and then serialize and save it to data/train.pkl. The type of the serialized object in train.pkl is List[List], which records the token contained in each conversation in the conversation list.

cd dialogbot/gpt/
python preprocess.py --train_path data/train.txt --save_path data/train.pkl
  • Train the model
    Run train.py, use the preprocessed data to perform autoregressive training on the model, and save the model in the model folder in the root directory.

During training, early stop can be performed by specifying the patience parameter. When patience=n, if the loss of the model on the verification set does not decrease for n consecutive epochs, an early stop is performed to stop the training. When patience=0, no early stop is performed.

The early stop is turned off by default in the code, because in practice, the generation effect of the model obtained by the early stop is not necessarily better.

python train.py --epochs 40 --batch_size 8 --device 0,1 --train_path data/train.pkl

For more training parameter introduction, you can directly see the parameter description in the set_args() function in train.py

3.3.3 Prediction Model (Human-Computer Interaction)

Run interact.py, use the trained model for human-computer interaction, and enter q to end the conversation, and the chat record will be saved to the sample.txt file.

python interact.py --no_cuda --model_dir path_to_your_model

When executing interact.py, you can try to adjust the generated effect by adjusting parameters such as topk, topp, repetition_penalty, and max_history_len. For more parameter introduction, you can directly refer to the parameter description in the set_args() function of interact.py.
If you want to use the GPU for generation, do not call the –no_cuda parameter, and use –device gpu_id to specify which GPU to use.

  • Q&A example
  • sample1
  user :在干嘛
chatbot:睡觉
  user :这么早就睡觉啦
chatbot:嗯啊你也早点睡啊
  user :我再玩一会儿手机
chatbot:早点休息晚安
  user :拜拜啦
chatbot:晚安晚安
  user :多喝热水
chatbot:哈哈哈
  user :多喝热水
chatbot:好的咯
  • sample2
  user :在吗
chatbot:在
  user :我不开心
chatbot:怎么啦
  user :最近好累啊
chatbot:是挺累的
  user :作业好多啊
chatbot:还好还好
  user :我觉得我快要挂科了
chatbot:我也是啊,但是我觉得我快熬出头了
  user :不说了,我还有好多作业没写
chatbot:我也是啊好多啊
  user :那先这样,下次再聊,再见了
chatbot:好哒

4. Dataset

  • Chatbot corpus sharing
    | Chinese chatbot corpus | Dataset address | Corpus description | |
    ---------|-------|-------|
    |Common Chinese chatbot| | Contains little yellow chicken corpus, Douban corpus, TV drama dialogue corpus, Tieba forum reply corpus, Weibo corpus, PTT gossip corpus, Qingyun corpus , etc. | 500,000 original corpus and pre-processed data of multiple rounds of conversations | |100,000 Chinese gossip corpus | Baidu Netdisk [extraction code: s908] or Google Drive | Contains 1,000,000 original corpus and pre-processed data of multiple rounds of conversations|

A sample of the content of the Chinese chat corpus is as follows:

谢谢你所做的一切
你开心就好
开心
嗯因为你的心里只有学习
某某某,还有你
这个某某某用的好

你们宿舍都是这么厉害的人吗
眼睛特别搞笑这土也不好捏但就是觉得挺可爱
特别可爱啊

今天好点了吗?
一天比一天严重
吃药不管用,去打一针。别拖着
  • model sharing
Model shared address Model description
model_epoch40_50w shibing624/gpt2-dialogbot-base-chinese or Baidu Netdisk (extraction code: taqh) or GoogleDrive Using 50w multi-round dialogue materials to train 40 epochs, the loss dropped to about 2.0.
  • Reference
  • Wen T H, Vandyke D, Mrksic N, et al. A Network-based End-to-End Trainable Task-oriented Dialogue System[J]. 2016.
  • How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation
  • A. Bordes, Y. Boureau, J. Weston. Learning End-to-End Goal-Oriented Dialog 2016
  • Zhao T, Eskenazi M. Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning [J]. arXiv preprint arXiv:1606.02560, 2016.
  • Kulkarni T D, Narasimhan K R, Saeedi A, et al. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation [J]. arXiv preprint arXiv:1604.06057, 2016.
  • BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems
  • Deep Reinforcement Learning with Double Q-Learning
  • Deep Attention Recurrent Q-Network
  • SimpleDS: A Simple Deep Reinforcement Learning Dialogue System
  • Deep Reinforcement Learning with a Natural Language Action Space
  • Integrating User and Agent Models: A Deep Task-Oriented Dialogue System
  • The Curious Case of Neural Text Degeneration
  • DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation
  • vyraun/chatbot-MemN2N-tensorflow
  • huggingface/transformers
  • Morizeyao/GPT2-Chinese
  • yangjianxin1/GPT2-chitchat

Reference link: https://github.com/shibing624/dialogbot

If you can't enter github, you can also enter https://download.csdn.net/download/sinat_39620217/88205596 to download relevant information for free

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/132262186