Chatbot - based training QQ chat

Personal blog: http://www.chenjianqu.com/

Original link: http://www.chenjianqu.com/show-39.html

This article describes the keras based framework, using seq2seq model, how to use my QQ chats training a bot - another 'you'.

    NLP makes me happy! During this time while writing Unity will look NLP knowledge, which is not just learning a seq2seq, to ​​think with their own QQ chats over the past year to do a simple chat robot. I found that most do when online information is seq2seq on character level to achieve, and word level is achieved basically by keras not only groping, now finally about to be out. To share it.

 

Seq2seq

    Seq2Seq (Sequence to Sequence), is capable in accordance with a given sequence, the method of another sequence generated by a specific method. For example, in the scene man-machine dialogue, you enter: "Are you free tomorrow?" To seq2seq model, it will generate: "Yes, what's up?".

b.png

    Seq2Seq main scenarios have ① machine translation (Google Translate currently the most famous, is based entirely on the developed Seq2Seq + Attention mechanism). ② chatbot (Little Love, wheatgrass and other Microsoft also used Seq2Seq technology (not all)). ③ automatically generated text summary (headlines today such as the use of this technique). ④ Image Caption automatically generated. ⑤ machines to write poetry, code completion, generated commit message, rewriting the story of style.

 

 

principle

    The main idea is by sequence Seq2Seq depth neural network model is mapped to an intermediate input semantic vector, the intermediate vector is then decoded semantic output sequence, the coding process input (Encoder) and the decoded output (Decoder) consisting of two links. encoder and decoder are generally used RNN, usually LSTM or GRU, details, see RNN, LSTM, GRU principles and implementation .

c_20190725110138_910.png

 

    If the original sentence is X = (a, b, c, d, e, f) X = (a, b, c, d, e, f), the target output is Y = (P, Q, R, S, T ) Y = (P, Q, R, S, T), then it is a basic seq2seq shown below.

i.png

    Encoder input is left, which is responsible for the input (which may be of variable length) is encoded as a vector of fixed size, this alternative models on the lot, with the GRU, LSTM RNN structure or the like CNN + Pooling, Google pure Attention may be so, the fixed size of the vector, in theory, it contains all the information input sentence.

    We have just decoder is responsible for decoding vector encoding out of our desired output. And encoder different, we emphasize on the graph decoder is a "one-way recursion", because the decoding process is recursive, the specific procedures:

1、所有输出端,都以一个通用的<start>标记开头,以<end>标记结尾,这两个标记也视为一个词/字;
2、将<start>输入decoder,然后得到隐藏层向量,将这个向量与encoder的输出混合,然后送入一个分类器,分类器的结果应当输出P;
3、将P输入decoder,得到新的隐藏层向量,再次与encoder的输出混合,送入分类器,分类器应输出Q;
4、依此递归,直到分类器的结果输出<end>。

    This is a basic seq2seq decoding process model, the process of decoding, a decoding result of each step is fed to the next step, until the output <end> position.

 

 

 

 

Here started chatting robot

Neural Network Model

    When training the neural network model structure are as follows:

chatbot_qq_model.png

Details of the structure

graph_run=_20190726185534_688.pnguploading.gifDump failed to re-upload canceledgraph_run=.png

    As a model for generating word level, if the input data using one-hot, then LSTM of computation and memory required will be immeasurable. One-hot therefore necessary to reduce the dimensions so that the front layer lstm word plus a buried layer. I use the word vector in front of the blog here " text CNN- Chinese Hotel Reviews of binary " is mentioned.

    训练过程是“teacher force”,这和预测过程是不同的,因此需要把预测模型和训练模型分开,训练模型和预测模型使用相同的层,但是他们的模型结构不同。

预测模型的第一个子模型是encoder:

chatbot_qq_encoder_model.png

第二个子模型是decoder_embedding:

chatbot_qq_emb_model.png

第三个子模型是decoder:

chatbot_qq_decoder_model.png

其实理论上应该要把第二个和第三个模型合起来,但是我这里合起来的时候有点问题。

 

 

数据集获取和去噪

    本次使用的是QQ聊天记录作为训练集,QQ本身提供了聊天记录导出的功能,步骤如下:打开消息管理器,右上角点击导出全部消息记录,保存为.txt文件。

d.png

这样就可以得到我们想要的数据了。

    接下来进行数据清洗。代码如下:

import re
data=[]
#读取数据集
file = open("D:/NLP/dataset/对话数据集/聊天记录.txt",encoding='utf-8')
last_name=''
name=''
multiLine=''
flag=0
count=0
keyNoise=['的口令红包','邀请加入','申请加入','点击查看','撤回了','(无)','对方已成功接收']
for line in file:
    #去除空行
    line=line.strip().replace('\n','')
    if(len(line)==0):
        continue
    #去噪
    if(len(line)>4 and (line[:4]=='消息记录' or line[:4]=='消息分组' or line[:4]=='消息对象' or line[:4]=='===='  
      or line[:4]=='http' or line[:6]=='[QQ红包]' or line[:3]=='管理员')):
        continue
    continueflag=False
    for s in keyNoise:
        if(s in line):
            continueflag=True
    if(continueflag):
        continue
    #同一个聊天对象的多行连接起来
    if(line[:4]=='2018' or line[:4]=='2019'):
        name=line.split(' ')[-1]
        if(name==last_name):
            flag=1
        else:
            flag=0
            last_name=name
            #print(name)
        continue
    if(flag==1):
        multiLine+=(' '+line)
        continue
    else:
        temp=line
        line=multiLine.replace('\n','')
        multiLine=temp
        
    if(name=='轨迹'):#添加“我”标记
        multiLine='CJQ'+temp
    else:#添加“朋友”标记
        multiLine='FRI'+temp
    #去除@某人的消息
    obj=re.findall( r'(@\S*\s)',line)
    for s in obj:
        line=line.replace(s,'')
    #去除图片和表情
    line=line.replace('[图片]','')
    line=line.replace('[表情]','')
    line=line.strip()
    #去除空行
    if(len(line)==3):
        continue
    data.append(line)
    count+=1
    if(count==30678):#我这里只提取前30678行
        break
print(count)
#写入数据
with open('data.txt','w',encoding='utf-8') as f:
    f.write('\n'.join(data))

 将原始的聊天数据处理之后得到的如下类似的数据。

FRI你在寝室吗
CJQ不在 等会回去
FRI你回到了告诉我吧。我来拿
CJQok 我回来了
FRI太晚了。我都要睡了。明天再来找你拿吧
CJQ卧槽我给你吧 我现在拿给你?
FRI别了吧。太麻烦你了

 再把这个对话集拆分为两个文件,代码如下:

input_text=[]
target_text=[]
for i in range(len(data)-1):
    if(data[i][:3]=='FRI' and data[i+1][:3]=='CJQ'):
        input_text.append(data[i][3:].strip())
        target_text.append(data[i+1][3:].strip())
        
with open('cjq.txt','w',encoding='utf-8') as f:
    f.write('\n'.join(target_text))
with open('fri.txt','w',encoding='utf-8') as f:
    f.write('\n'.join(input_text))

这样就得到了"朋友说.txt"和"我回答.txt"两个文件。

 

数据集向量化

    使用keras的文本预处理器,将文本映射为序列。

import jieba
from keras.models import Model
from keras.layers import Input, LSTM, Dense,Embedding
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

MAX_WORDS=15000 #使用20000个词语
SEN_LEN=32 #每条聊天的长度

#将句子分词并用空格隔开
inputTextList=[' '.join([w for w in jieba.cut(text)]) for text in input_text]
targetTextList=[' '.join([w for w in jieba.cut(text)]) for text in target_text]
tokenizer=Tokenizer(num_words=MAX_WORDS)
tokenizer.fit_on_texts(texts=inputTextList+targetTextList)

#将文本映射为数字序列
input_sequences=tokenizer.texts_to_sequences(texts=inputTextList)
target_sequence=tokenizer.texts_to_sequences(texts=targetTextList)
word_index=tokenizer.word_index

#添加两个转义字符到词典
word_index['\t']=len(word_index)+1
word_index['\n']=len(word_index)+1
reverse_word_index = dict([(i, t) for t, i in word_index.items()])#得到反转词典,用于恢复
print(len(input_sequences))
print(len(target_sequence))

 由于数据集的单词量很大,如果想直接一次训练模型,内存肯定是不够的,因此需要定义一个生成器。

#数据生成器
def train_gen(input_seq,target_seq, m, batch_size=64):
    input_seq=np.array(input_seq)
    target_seq=np.array(target_seq)
    permutation = np.random.permutation(input_seq.shape[0])
    shuffled_inputs = input_seq[permutation]
    shuffled_targets = target_seq[permutation]
    num_batches = int(m/batch_size)
    
    while 1:
        for i in range(num_batches):
            input_seq_batch=shuffled_inputs[i*batch_size:(i+1)*batch_size]
            target_seq_batch=shuffled_targets[i*batch_size:(i+1)*batch_size]
            
            encoder_x = np.zeros((batch_size, SEN_LEN),dtype='float32')
            decoder_x = np.zeros((batch_size, SEN_LEN),dtype='float32')
            decoder_y = np.zeros((batch_size, SEN_LEN, MAX_WORDS),dtype='float32')
            
            for i, (in_t, tar_t) in enumerate(zip(input_seq_batch, target_seq_batch)):
                lentext=len(tar_t[:SEN_LEN])
                lentext_in=len(in_t[:SEN_LEN])
                if(lentext==0 or lentext_in==0):
                    continue
                for j, w_index in enumerate(in_t[:SEN_LEN]):
                    encoder_x[i,j]=w_index
                
                for j, w_index in enumerate(tar_t[:SEN_LEN]):
                    if(j==0):#开始符号
                        decoder_x[i,0]=word_index['\t']
                        decoder_y[i,0, w_index] = 1.
                    elif(j==lentext-1):#结束符号
                        decoder_x[i,j]=tar_t[j-1]
                        if(lentext>=SEN_LEN):
                            decoder_y[i, j,word_index['\n']] = 1.
                        else:
                            decoder_y[i, j, w_index] = 1.
                    else:
                        decoder_x[i,j]=tar_t[j-1]
                        decoder_y[i, j, w_index] = 1.#解码器输出序列提前一个时间步
                if(lentext<SEN_LEN):#补上长度
                    decoder_x[i,lentext]=tar_t[lentext-1]
                    decoder_y[i,lentext,word_index['\n']] = 1.
            yield [encoder_x,decoder_x],decoder_y

载入词向量矩阵

    我是用的是300d的中文词向量,下面的代码载入词向量文件并设置词向量矩阵

#解析词向量文件
embeddings_index={}
f=open(r'D:\NLP\wordvector\sgns.zhihu.word\sgns.zhihu.word',encoding='utf-8')
for line in f:
    values=line.split()
    word=values[0]#第一个是单词
    coefs=np.asarray(values[1:],dtype='float32')#后面都是系数
    embeddings_index[word]=coefs
f.close()
#准备词向量矩阵
EMBEDDING_DIM=300#词向量的长度
embedding_matrix=np.zeros((MAX_WORDS,EMBEDDING_DIM))
for word,i in word_index.items():
    word_vector=embeddings_index.get(word)
    if(word_vector is not None):#若是未登录词,则词向量为初始值0
        embedding_matrix[i]=word_vector

搭建神经网络模型

    网络结构如上面所述,下面代码搭建模型,同时载入词嵌入层的参数,并编译模型。

from keras.models import Model
from keras.layers import Input, LSTM, Dense,Embedding
#编码器
encoder_inputs = Input(shape=(None,))
encoder_eb = Embedding(MAX_WORDS, EMBEDDING_DIM)
encoder_eb_outputs = encoder_eb(encoder_inputs)#嵌入文本
encoder_lstm=LSTM(EMBEDDING_DIM,return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_eb_outputs)
encoder_states = [state_h, state_c]
#解码器
decoder_inputs = Input(shape=(None,))
decoder_eb = Embedding(MAX_WORDS, EMBEDDING_DIM)
decoder_eb_outputs=decoder_eb(decoder_inputs)
decoder_lstm = LSTM(EMBEDDING_DIM, return_sequences=True,return_state=True)
decoder_outputs,_,_=decoder_lstm(decoder_eb_outputs, initial_state=encoder_states)

decoder_dense_1 = Dense(int(MAX_WORDS/8), activation='relu')
decoder_outputs_1 = decoder_dense_1(decoder_outputs)
decoder_dense = Dense(MAX_WORDS, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs_1)

#将编码器和解码器串联起来
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.summary()
plot_model(model,to_file='chatbot_qq_model.png',show_shapes=True)

#把词嵌入矩阵载入到词嵌入层中
model.layers[2].set_weights([embedding_matrix])
model.layers[2].trainable=False#
model.layers[3].set_weights([embedding_matrix])
model.layers[3].trainable=False
#编译模型
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

模型训练

    使用生成器进行训练,训练50轮。受制于显存,我这里的batch_size只能设置为16,更大可能效果好一点,训练速度也更快点。我这里也没有设置验证集,主要是我太懒了。

import keras

callbacks_list=[
    keras.callbacks.EarlyStopping(
        monitor='acc',
        patience=10,
    ),
    keras.callbacks.ModelCheckpoint(
        filepath='chatbot1_model_checkpoint.h5',
        monitor='loss',
        save_best_only=True
    ),
    keras.callbacks.TensorBoard(
        log_dir='chatbot1_log'
    )
]

model.fit_generator(train_gen(input_sequences,target_sequence,
                    len(input_sequences), 
                    batch_size=16),
                    steps_per_epoch=1000,
                    callbacks=callbacks_list,
                    epochs=50
                   )

训练结果如下:

g.png

    可以看到50轮的训练让训练精度达到0.25,如果轮数更多的话,其实可以达到更高的精度。

 

搭建预测模型

    如前面所述搭建预测模型。

encoder_model = Model(encoder_inputs, encoder_states)
encoder_model.summary()
plot_model(encoder_model,to_file='chatbot_qq_encoder_model.png',show_shapes=True)

decoder_inputs = Input(shape=(None,))
decoder_eb = Embedding(MAX_WORDS, EMBEDDING_DIM)
decoder_eb_outputs=decoder_eb(decoder_inputs)
emb_model = Model(decoder_inputs, decoder_eb_outputs)
emb_model.summary()
plot_model(emb_model,to_file='chatbot_qq_emb_model.png',show_shapes=True)

decoder_eb_input = Input(shape=(None,EMBEDDING_DIM))

decoder_state_input_h = Input(shape=(None,EMBEDDING_DIM))
decoder_state_input_c = Input(shape=(None,EMBEDDING_DIM))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_outputs, state_h, state_c = decoder_lstm(decoder_eb_input, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]

decoder_outputs_1 = decoder_dense_1(decoder_outputs)
decoder_outputs = decoder_dense(decoder_outputs_1)

decoder_model = Model([decoder_eb_input] + decoder_states_inputs, [decoder_outputs]+decoder_states)
decoder_model.summary()
plot_model(decoder_model,to_file='chatbot_qq_decoder_model.png',show_shapes=True)

预测

    读取字典

f = open('word_index_chatbot.txt','r',encoding='utf-8')
dictStr = f.read()
f.close()
tk = eval(dictStr)
rtk = dict([(i, t) for t, i in tk.items()])

    输入聊天文本,预测生成。

text='早点睡吧你'
#将输入转换为序列
input_texts=[w for w in jieba.cut(text)]
input_sequences=[tk[w] for w in input_texts if w in tk]
x = np.zeros((1, SEN_LEN),dtype='float32')
y = np.zeros((1, 1),dtype='float32')
y[0,0]=tk['\t']
for j, w_index in enumerate(input_sequences[:SEN_LEN]):
    x[0,j]=w_index
result=''

#序列预测
states_value = encoder_model.predict(x)#序列编码
#维度变换
h,c=states_value
h_3=np.zeros((1,1, 300),dtype='float32')
c_3=np.zeros((1,1, 300),dtype='float32')
h_3[0]=h
c_3[0]=c
states_value=[h_3,c_3]
i=0
stop_condition = False
while not stop_condition:
    embeded_vector=emb_model.predict(y)
    output_tokens, h, c = decoder_model.predict([embeded_vector] + states_value)#序列解码

    #将输出转换为词
    index = np.argmax(output_tokens[0, -1, :])
    word = rtk[index]
    result += word
    #结束解码
    if (index>=SEN_LEN):
        stop_condition = True
    #更新状态
    states_value = [h, c]
    i+=1
    y[0,0]=index
print(result)

预测结果:

h.png

效果其实一般般,主要原因是数据集太小了,怪我聊天聊得不多咯。

 

 

参考资料

[1] clumsy Seq2Seq .NLP of stone. Https://blog.csdn.net/qq_32241189/article/details/81591456 . 2018-08-12

[2] Su Jianlin. Fun Keras of seq2seq automatically generate the title. Https://spaces.ac.cn/archives/5861/comment-page-1 .2018-09-01

Published 74 original articles · won praise 33 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_37394634/article/details/99578897