Machine translation based on recurrent neural networks (English to Chinese)

Machine translation based on seq2seq model


Preface

The origin of machine translation can be traced back to the 1980s. At that time, machine translation mainly relied on the development of linguistics, analyzing syntax, semantics, pragmatics, etc. Later, researchers began to apply statistical models to machine translation. This method is based on the analysis of existing text corpora to generate translation results. With the rise of deep learning, neural networks are now being used in machine translation, and have achieved fruitful results in just a few years. The basic idea of ​​current machine translation comes from the end-to-end encoder-decoder structure proposed by Nal Kalchbrenner and Phil Blunsom in 2013. In 2014, Sutskever et al. developed a method called sequence-to-sequence (seq2seq) learning. Google used this model to provide specific implementation methods in the tutorial of its deep learning framework tensorflow, and achieved good results.

1. What is seq2seq model?

The Seq2Seq model is a model used when the length of the output is uncertain. This situation generally occurs in machine translation tasks. When a Chinese sentence is translated into English, the length of the English sentence may be shorter than the Chinese sentence, or it may be longer than the Chinese sentence. Chinese is long, so the length of the output is uncertain. As shown in the figure below, the input Chinese length is 4, and the output English length is 2.
Insert image description here
Pictures and explanatory text refer to Seq2Seq model overview

The seq2seq model is a sequence-to-sequence model. It can be clearly seen from the above figure that the model consists of two parts. If this is not obvious enough, let's take a look at the following figure: Doesn't it look
Insert image description here
simpler this way? First, the seq2seq model consists of an encoder and a decoder, both of which are recurrent neural network models.
The basic principle is this: the Encoder structure first encodes the input data into a context vector, which is usually the last hidden state of the Encoder. Decoder takes this context vector as input, decodes it, and outputs the corresponding target sequence.
Since the Encoder-Decoder structure does not limit the sequence length of input and output, it has a wide range of applications, such as machine translation, text summarization, reading comprehension, speech recognition, etc.

2. Machine translation in practice

1. Introduction to data sets and data preprocessing

First, let’s introduce the data set. This is a txt file composed of more than 20,000 pieces of data. Each line in the file includes an English sentence and its corresponding Chinese translation. The sentence and translation are separated by a \ t as shown in the figure below:
Insert image description here
Next, perform data preprocessing:
This is a data preprocessing function. This file will be directly imported into the main file in subsequent work.

def getdata():
    with open('cmn.txt', 'r', encoding='utf-8') as f:
        data = f.read()
    data = data.split('\n')
    data = data[:100]



    en_data = [line.split('\t')[0] for line in data]
    ch_data = ['\t' + line.split('\t')[1] + '\n' for line in data]


    # 分别生成中英文字典
    en_vocab = set(''.join(en_data))
    id2en = list(en_vocab)
    en2id = {
    
    c:i for i,c in enumerate(id2en)}

    # 分别生成中英文字典
    ch_vocab = set(''.join(ch_data))
    id2ch = list(ch_vocab)
    ch2id = {
    
    c:i for i,c in enumerate(id2ch)}



    en_num_data = [[en2id[en] for en in line ] for line in en_data]
    ch_num_data = [[ch2id[ch] for ch in line] for line in ch_data]
    de_num_data = [[ch2id[ch] for ch in line][1:] for line in ch_data]




    import numpy as np

    # 获取输入输出端的最大长度
    max_encoder_seq_length = max([len(txt) for txt in en_num_data])
    max_decoder_seq_length = max([len(txt) for txt in ch_num_data])


    # 将数据进行onehot处理
    encoder_input_data = np.zeros((len(en_num_data), max_encoder_seq_length, len(en2id)), dtype='float32')
    decoder_input_data = np.zeros((len(ch_num_data), max_decoder_seq_length, len(ch2id)), dtype='float32')
    decoder_target_data = np.zeros((len(ch_num_data), max_decoder_seq_length, len(ch2id)), dtype='float32')

    for i in range(len(ch_num_data)):
        for t, j in enumerate(en_num_data[i]):
            encoder_input_data[i, t, j] = 1.
        for t, j in enumerate(ch_num_data[i]):
            decoder_input_data[i, t, j] = 1.
        for t, j in enumerate(de_num_data[i]):
            decoder_target_data[i, t, j] = 1.

    return encoder_input_data, decoder_input_data,decoder_target_data,ch2id,id2ch,en_data

This function doesn't seem that complicated, so let's talk about it briefly. First, it reads the file, separates the English sentences and Chinese sentences in the file, and then splits it to generate a Chinese dictionary and an English dictionary (here, English letters are used as generated words vector basis), and finally the English sentences and Chinese sentences are translated into word vectors and represented by one-hot encoding.
At this point, data preprocessing is completed.

2. Build the model

Encoder model

First, build the Encoder part of the model. For this we need to consider three aspects:

What does the input to the Encoder model look like?
What kind of RNN unit does the Encoder model use?
Which part of the model serves as input to the Decoder?
The input of the model can be set using Input. The input shape must be the same as the dictionary length:

encoder_input = Input(shape=(None,len(vocabulary)))

The model uses the LSTM unit, the dimension is set to HIDDEN_SIZE, the return_sequences field is used to determine whether the output of each step is required, and the return_states is used to control whether to output the hidden layer state.

encoder_LSTM = LTM(HIDDEN_SIZE, return_sequences=True, return_state=True,name='encoder')

Encoder uses the hidden state of the last layer, namely encoder_state_h and encoder_state_c as the input of Decoder:

encoder_h, encoder_state_h, encoder_state_c = encoder_LSTM (encoder_input)

Decoder model

To build the Decoder part, we need to consider three aspects:

What does the input to the Decoder model look like?
What kind of RNN unit does the Decoder model use?
What is the structure of the model's output?

The first two steps are similar to the Encoder. For the output of the Decoder, we use a fully connected layer and use the softmax activation function to map the output vector to the dictionary of the target language.

lstm = LSTM(HIDDEN_SIZE, return_sequences=True, return_state=True,name='decoder')
decoder_h, _, _ = lstm(decoder_inputs, initial_state=[encoder_state_h, encoder_state_c])
decoder_dense = Dense(CH_VOCAB_SIZE, activation='softmax',name='dense')
decoder_outputs = decoder_dense(decoder_h)

The Decoder part can also use the attention mechanism:
Insert image description here

This method uses an attention layer to connect the output of each step of the Encoder part with decoder_inputs as the input of the decoder. It is equivalent to encoding the encoder into different hidden vectors c according to each time step of the sequence. During decoding, each different c is combined for decoding output, so that the result will be more accurate. But it's not used here.

3. Model training

When training the model, we use the Model module to encapsulate the Encoder and Decoder, select the optimizer through optimizer, and select the loss function through loss.

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
opt = Adam(lr=LEARNING_RATE, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accu\fracy'])

When training the model, just input the previously preprocessed data into the model and set the corresponding parameters.

model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=BATCH_SIZE,
          epochs=EPOCHS,
          validation_split=0.2)

Where x is the input data and y is the output data. batch_size is the number of sequences processed in each batch, epochs is the number of iterations of training, and validation_split is the ratio of the training set and the validation set.

4. Build an inference model and organize the output

Build an inference model

The establishment of the inference model is also divided into two parts. The structure of the model Encoder part is exactly the same as during training, so you only need to encapsulate the original encoder part:

# Encoder inference model
encoder_model = Model(encoder_inputs, [encoder_state_h, encoder_state_c])

As for the Decoder part, the output of each step needs to be used as the input of the next step:
Insert image description here

The decoder
therefore needs to be partially redesigned.
First determine the input and output of the Decoder part. The size of the input part should be the same as the size of the Encoder output.

decoder_state_input_h = Input(shape=(HIDDEN_SIZE,))
decoder_state_input_c = Input(shape=(HIDDEN_SIZE,))

Since we need the output of each step as the input of the next step, we need to save the hidden state and output vector of the Decoder for later use.

decoder_h, state_h, state_c = decoder(decoder_inputs, initial_state=[decoder_state_input_h, decoder_state_input_c])
decoder_outputs = decoder_dense(decoder_h)

Finally, encapsulate the Decoder part:

decoder_model = Model([decoder_inputs, decoder_state_input_h, decoder_state_input_c], [decoder_outputs, state_h, state_c])

Output sorting

The output of the Decoder part is a probability vector, each bit of which corresponds to the probability of the corresponding position in the dictionary. Usually we use the bit with the highest probability value in the output vector as the prediction result:

output_tokens, h, c= decoder_model.predict([target_seq, h, c])
sampled_token_index = np.argmax(output_tokens[0, -1, :])

After obtaining the output of the Encoder part, we need to design a program structure so that the Decoder part sends the output of each step to the next step, and stops the program when a termination symbol is output or the length of the longest output sequence is exceeded.

while True:
        output_tokens, h, c= decoder_model.predict([target_seq, h, c])
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        outputs.append(sampled_token_index)
        target_seq = np.zeros((1, 1, CH_VOCAB_SIZE))
        target_seq[0, 0, sampled_token_index] = 1
        if sampled_token_index == ch2id['\n'] or len(outputs) > 20: break

Results display

Let’s take a look at the final code file first:

import data_prepare
from keras.models import Model
from keras.layers import Input, LSTM, Dense, Embedding,concatenate,TimeDistributed,RepeatVector,Bidirectional
from keras.optimizers import Adam
import numpy as np
EN_VOCAB_SIZE = 47
CH_VOCAB_SIZE = 147
HIDDEN_SIZE = 256
LEARNING_RATE = 0.003
BATCH_SIZE = 100
EPOCHS = 100
encoder_input_data, decoder_input_data,decoder_target_data,ch2id,id2ch,en_data = data_prepare.getdata()
# ==============encoder=============
encoder_inputs = Input(shape=(None, EN_VOCAB_SIZE))
encoder_h, encoder_state_h, encoder_state_c = LSTM(HIDDEN_SIZE, return_sequences=True, return_state=True,name='encoder')(encoder_inputs)
# ==============decoder=============
decoder_inputs = Input(shape=(None, CH_VOCAB_SIZE))
decoder = LSTM(HIDDEN_SIZE, return_sequences=True, return_state=True,name='decoder')
decoder_dense = Dense(CH_VOCAB_SIZE, activation='softmax',name='dense')
decoder_h, _, _ = decoder(decoder_inputs, initial_state=[encoder_state_h, encoder_state_c])
decoder_outputs = decoder_dense(decoder_h)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
opt = Adam(lr=LEARNING_RATE, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
#模型训练
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=BATCH_SIZE,
          epochs=EPOCHS,
          validation_split=0.2,verbose=0)

##################
# encoder模型和训练相同,直接封装即可
encoder_model = Model(encoder_inputs, [encoder_state_h, encoder_state_c])


# decoder部分需要重新设计
decoder_state_input_h = Input(shape=(HIDDEN_SIZE,))
decoder_state_input_c = Input(shape=(HIDDEN_SIZE,))
decoder_h, state_h, state_c = decoder(decoder_inputs, initial_state=[decoder_state_input_h, decoder_state_input_c])
decoder_outputs = decoder_dense(decoder_h)
decoder_model = Model([decoder_inputs, decoder_state_input_h, decoder_state_input_c], [decoder_outputs, state_h, state_c])
##################



for k in range(50, 100):
    test_data = encoder_input_data[k:k + 1]
    #对test_data进行预测
    
    h, c = encoder_model.predict(test_data)
    target_seq = np.zeros((1, 1, CH_VOCAB_SIZE))
    target_seq[0, 0, ch2id['\t']] = 1
    outputs = []
    while True:
        output_tokens, h, c= decoder_model.predict([target_seq, h, c])
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        outputs.append(sampled_token_index)
        target_seq = np.zeros((1, 1, CH_VOCAB_SIZE))
        target_seq[0, 0, sampled_token_index] = 1
        if sampled_token_index == ch2id['\n'] or len(outputs) > 20: break
    print(en_data[k])
    print(''.join([id2ch[i] for i in outputs]))

I am using vscode to run this code and the output is:

It's me.
是我。
Join us.
加入我们吧。
Keep it.
留着吧。
Kiss me.
吻我。
Perfect!
完美!
See you.
再见!
Shut up!
閉嘴!
Skip it.
不管它。
Take it.
拿走吧。
Wake up!
醒醒!
Wash up.
去清洗一下。
We know.
我们知道。
Welcome.
欢迎。
Who won?
谁赢了?
Why not?
为什么不?
You run.
你跑。
Back off.
往后退点。
Be still.
静静的,别动。
Beats me.
我一无所知。
Cuff him.
把他铐上。
Drive on.
往前开。
Get away!
滾!
Get away!
滾!
Get down!
趴下!
Get lost!
滾!
Get real.
醒醒吧。
Good job!
干的好!
Good job!
干的好!
Grab Tom.
抓住汤姆。
Grab him.
抓住他。
Have fun.
抓住他。
He tries.
他跑。
Humor me.
抱抱汤姆。
Hurry up.
他跑。
Hurry up.
他跑。
I forgot.
我同意。
I resign.
我退出。
I'll pay.
我迷失了。
I'm busy.
我沒事。
I'm cold.
我老了。
I'm fine.
我沒事。
I'm full.
我沒事。
I'm sick.
我老了。
I'm sick.
我老了。
I'm tall.
我老了。
Leave me.
他跑。
Let's go!
留着吧。
Let's go!
留着吧。
Let's go!
留着吧。
Look out!
找到汤姆。

Summarize

Simply looking at the results shown above, you can see that most of the translations are relatively correct, such as: It's me., Join us., etc. Of course, there are also some poor ones, such as the second half of the results. Especially some English words are always translated as Tom, which is very interesting. I looked at the data set and indeed Tom accounts for a large proportion. Maybe it is because of this that some words are inexplicably translated into Tom. There is also a more interesting translation in the result, which is I'm cold. This one should originally mean I am cold, but it is translated as I am old. On second thought, I want to remove the c, doesn't it mean I am old. Overall, the overall performance of the model is good.

The blog comes from the machine translation training project of the EduCoder platform, and is written with some of my own understanding.

Guess you like

Origin blog.csdn.net/qq_44725872/article/details/112906393