Teach you how to use Keras to implement English to Chinese machine translation seq2seq+LSTM

Original link: https://blog.csdn.net/qq_44635691/article/details/106919244

The model implements the translation from English to Chinese. In order to better show the model architecture, the diagram borrowed from the big guy (Embeddings is not used here):

Complete code for this article: Github

table of Contents

1. Processing text data

1. Get the sentences before and after translation

 2. Create a dictionary of characters-index and index-characters

3. One-Hot encoding of Chinese and English sentences

2. Build a model

Three, the decoder predicts each character

Four, training model

Five, display


 



 The whole consists of two parts: encoder and decoder. Each part has an LSTM network, where the encoder inputs the original sentence and outputs the state vector; the decoder inputs the translated sentence containing the start symbol and outputs the target sentence.

        The specific steps are:

1. The encoder encodes the input sequence into a state vector

2. The decoder predicts from the first character

3. Feed the state vector (state_h, state_c) to the decoder and accumulate the one-hot encoding containing the previously predicted characters (the first state vector comes from the encoder, and later when each character in each target sequence is predicted, the state vector comes from the decoder ,predict state vector)

4. Use argmax to predict the position of the next character, and then find the corresponding character according to the dictionary

5. Add the characters from the previous step to the target sequence

6. End the loop until we predict that we specify the end character

1. Processing text data

This step includes segmenting the original data to obtain sentences before and after translation, generating a dictionary of characters, and finally performing One-Hot encoding on the sentences before and after translation, which is convenient for data processing.

1. Get the sentences before and after translation

First look at the style of the original data:

First import the required libraries:

代码1.1.1

import pandas as pd
import numpy as np
from keras.layers import Input, LSTM, Dense, merge,concatenate
from keras.optimizers import Adam, SGD
from keras.models import Model,load_model
from keras.utils import plot_model
from keras.models import Sequential
 
#定义神经网络的参数
NUM_SAMPLES=3000   #训练样本的大小
batch_size = 64    #一次训练所选取的样本数
epochs = 100       #训练轮数
latent_dim = 256   #LSTM 的单元个数
用pandas读取文件,然后我们只要前两列内容

代码1.1.2

data_path='data/cmn.txt'
df=pd.read_table(data_path,header=None).iloc[:NUM_SAMPLES,0:2]
#添加标题栏
df.columns=['inputs','targets']
#每句中文举手加上‘\t’作为起始标志,句末加上‘\n’终止标志
df['targets']=df['targets'].apply(lambda x:'\t'+x+'\n')
最后是这样的形式:

The last is this form:

代码1.1.3

#获取英文、中文各自的列表
input_texts=df.inputs.values.tolist()
target_texts=df.targets.values.tolist()
 
#确定中英文各自包含的字符。df.unique()直接取sum可将unique数组中的各个句子拼接成一个长句子
input_characters = sorted(list(set(df.inputs.unique().sum())))
target_characters = sorted(list(set(df.targets.unique().sum())))
 
#英文字符中不同字符的数量
num_encoder_tokens = len(input_characters)
#中文字符中不同字符的数量
num_decoder_tokens = len(target_characters)
#最大输入长度
INUPT_LENGTH = max([ len(txt) for txt in input_texts])
#最大输出长度
OUTPUT_LENGTH = max([ len(txt) for txt in target_texts])
 2.创建关于 字符-index 和 index -字符的字典
代码1.2.1

input_token_index = dict( [(char, i)for i, char in enumerate(input_characters)] )
target_token_index = dict( [(char, i) for i, char in enumerate(target_characters)] )
 
reverse_input_char_index = dict([(i, char) for i, char in enumerate(input_characters)])
reverse_target_char_index = dict([(i, char) for i, char in enumerate(target_characters)])
3.对中文和英文句子One-Hot编码
代码1.3.1

#需要把每条语料转换成LSTM需要的三维数据输入[n_samples, timestamp, one-hot feature]到模型中
encoder_input_data =np.zeros((NUM_SAMPLES,INUPT_LENGTH,num_encoder_tokens))
decoder_input_data =np.zeros((NUM_SAMPLES,OUTPUT_LENGTH,num_decoder_tokens))
decoder_target_data  = np.zeros((NUM_SAMPLES,OUTPUT_LENGTH,num_decoder_tokens))
 
for i,(input_text,target_text) in enumerate(zip(input_texts,target_texts)):
    for t,char in enumerate(input_text):
        encoder_input_data[i,t,input_token_index[char]]=1.0
 
    for t, char in enumerate(target_text):
        decoder_input_data[i,t,target_token_index[char]]=1.0
 
        if t > 0:
            # decoder_target_data 不包含开始字符,并且比decoder_input_data提前一步
            decoder_target_data[i, t-1, target_token_index[char]] = 1.0
二、建立模型
代码2.1

 
    #定义编码器的输入
    encoder_inputs=Input(shape=(None,num_encoder_tokens))
    #定义LSTM层,latent_dim为LSTM单元中每个门的神经元的个数,return_state设为True时才会返回最后时刻的状态h,c
    encoder=LSTM(latent_dim,return_state=True)
    # 调用编码器,得到编码器的输出(输入其实不需要),以及状态信息 state_h 和 state_c
    encoder_outputs,state_h,state_c=encoder(encoder_inputs)
    # 丢弃encoder_outputs, 我们只需要编码器的状态
    encoder_state=[state_h,state_c]
    
    #定义解码器的输入
    decoder_inputs=Input(shape=(None,num_decoder_tokens))
    decoder_lstm=LSTM(latent_dim,return_state=True,return_sequences=True)
    # 将编码器输出的状态作为初始解码器的初始状态
    decoder_outputs,_,_=decoder_lstm(decoder_inputs,initial_state=encoder_state)
    #添加全连接层
    decoder_dense=Dense(num_decoder_tokens,activation='softmax')
    decoder_outputs=decoder_dense(decoder_outputs)
    
    #定义整个模型
    model=Model([encoder_inputs,decoder_inputs],decoder_outputs)
 model的模型图:

 Model diagram of model:



The decoder has three inputs at each timestep, which are the two state vectors state_h, state_c from the encoder and the Chinese sequence encoded by One-Hot. 

代码2.2

    #定义encoder模型,得到输出encoder_states
    encoder_model=Model(encoder_inputs,encoder_state)
    
    decoder_state_input_h=Input(shape=(latent_dim,))
    decoder_state_input_c=Input(shape=(latent_dim,))
    decoder_state_inputs=[decoder_state_input_h,decoder_state_input_c]
    
    # 得到解码器的输出以及中间状态
    decoder_outputs,state_h,state_c=decoder_lstm(decoder_inputs,initial_state=decoder_state_inputs)
    decoder_states=[state_h,state_c]
    decoder_outputs=decoder_dense(decoder_outputs)
    decoder_model=Model([decoder_inputs]+decoder_state_inputs,[decoder_outputs]+decoder_states)
    
    plot_model(model=model,show_shapes=True)
    plot_model(model=encoder_model,show_shapes=True)
    plot_model(model=decoder_model,show_shapes=True)
    return model,encoder_model,decoder_model

The encoder model diagram:



Model diagram of decoder:

Three, the decoder predicts each character

First, the encoder generates the state vector states_value according to the input sequence and combines it with the code containing the starting character "\t" and passes it to the input layer of the decoder, predicts the position of the next character sampled_token_index, and adds the newly predicted character to the target_seq Then perform One-Hot coding, and use the state vector generated by predicting the last character as the new state vector.

The above process loops continuously in while until the end character "\n" is predicted, the loop ends, and the translated sentence is returned. It can be seen intuitively from the figure below that the decoder part is a sequence generated after translation. Note that the blue line points to target_squence, which is constantly being filled.

代码3.1

def decode_sequence(input_seq,encoder_model,decoder_model):
    # 将输入序列进行编码生成状态向量
    states_value = encoder_model.predict(input_seq)
 
    # 生成一个size=1的空序列
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # 将这个空序列的内容设置为开始字符
    target_seq[0, 0, target_token_index['\t']] = 1.
 
    # 进行字符恢复
    # 简单起见,假设batch_size = 1
    stop_condition = False
    decoded_sentence = ''
 
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
#        print(output_tokens)这里输出的是下个字符出现的位置的概率
 
        # 对下个字符采样  sampled_token_index是要预测下个字符最大概率出现在字典中的位置
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char
 
        # 退出条件:生成 \n 或者 超过最大序列长度
        if sampled_char == '\n' or len(decoded_sentence) >INUPT_LENGTH  :
            stop_condition = True
 
        # 更新target_seq
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.
 
        # 更新中间状态
        states_value = [h, c]
 
    return decoded_sentence
四、训练模型
model,encoder_model,decoder_model=create_model()
#编译模型
model.compile(optimizer='rmsprop',loss='categorical_crossentropy')
#训练模型
model.fit([encoder_input_data,decoder_input_data],decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.2)
#训练不错的模型为了以后测试可是保存
model.save('s2s.h5')
encoder_model.save('encoder_model.h5')
decoder_model.save('decoder_model.h5')
五、展示
if __name__ == '__main__':
    intro=input("select train model or test model:")
    if intro=="train":
        print("训练模式...........")
        train()
    else:
        print("测试模式.........")
        while(1):
            test()

 

 

3000 sets of training data are used, most of which are relatively short phrases or words. The effect is not too good, but it's not bad compared to English scum.

Reference:

https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

https://towardsdatascience.com/neural-machine-translation-using-seq2seq-with-keras-c23540453c74

 

The model implements the translation from English to Chinese. In order to better show the model architecture, the diagram borrowed from the big guy (Embeddings is not used here):

Complete code for this article: Github

table of Contents

1. Processing text data

1. Get the sentences before and after translation

 2. Create a dictionary of characters-index and index-characters

3. One-Hot encoding of Chinese and English sentences

2. Build a model

Three, the decoder predicts each character

Four, training model

Five, display


 



 The whole consists of two parts: encoder and decoder. Each part has an LSTM network, where the encoder inputs the original sentence and outputs the state vector; the decoder inputs the translated sentence containing the start symbol and outputs the target sentence.

        The specific steps are:

1. The encoder encodes the input sequence into a state vector

2. The decoder predicts from the first character

3. Feed the state vector (state_h, state_c) to the decoder and accumulate the one-hot encoding containing the previously predicted characters (the first state vector comes from the encoder, and later when each character in each target sequence is predicted, the state vector comes from the decoder ,predict state vector)

4. Use argmax to predict the position of the next character, and then find the corresponding character according to the dictionary

5. Add the characters from the previous step to the target sequence

6. End the loop until we predict that we specify the end character

1. Processing text data

This step includes segmenting the original data to obtain sentences before and after translation, generating a dictionary of characters, and finally performing One-Hot encoding on the sentences before and after translation, which is convenient for data processing.

1. Get the sentences before and after translation

First look at the style of the original data:

First import the required libraries:

Guess you like

Origin blog.csdn.net/stay_foolish12/article/details/112259001