Use Seq2Seq implementation English translation

1 Introduction

1.1 Deep NLP

NLP (Natural Language Processing, NLP) computer science, artificial intelligence and linguistics cross branches, mainly for computer processing or understanding natural language, such as machine translation, quiz systems. But because of the said study, the use of complex language, NLP is generally considered difficult. In recent years, with the rise of deep learning (Deep Learning, DL), people are constantly trying to DL used in NLP, is known as Deep NLP, and made a lot of breakthroughs. Among them Seq2Seq model.

1.2 reason

Seq2Seq Model is the sequence to sequence (Sequence to Sequence) for short model, also called An encoder - decoder (Encoder-Decoder) model, are based on two papers published 2014:

  • Sequence to Sequence Learning with Neural Networks by Sutskever et al.,
  • Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation by Cho et al.,

Sutskever author analyzes the Deep Neural Networks (DNNs) due to limit the length of the input and output sequences can not handle unknown length and variable-length sequences; and many important issues are the use of unknown length sequence indicated. In order to demonstrate the need to propose new ways to solve problems in the processing sequence of unknown length. Thus, innovative Seq2Seq proposed model. Let us take a look at what this model in the end yes.

2. Seq2Seq Model of continuous exploration

Why is innovation made it? Because of Sutskever after three modeling demonstration, they finally finalized Seq2Seq model. And the design model is very clever. Let us first look at the author's exploration experience. The language model (Language Model, LM) by using conditional probabilities given word to calculate the next word. This is the basis Seq2Seq prediction model. Because between sequences are linked to the context, the role of nexus similar sentence, plus features language model (conditional probability), the authors first selected RNN-LM (Recurrent Neural Network Language Model, recurrent neural network model language).
rnn.png
The figure is a simple RNN unit. RNN cyclically calculated result of the previous step as a condition, the current into the input.
Suitable context-dependent modeling of the sequence of any length. But there is a problem that we need to advance the input and output sequence alignment, but it is not clear how RNN used in non-complex sequence of different lengths in a single relationship. In order to solve the alignment problem, the authors propose a theoretical possible solution: Use two RNN. RNN is a map the input vector of a fixed length, the other RNN output sequence predicted from this vector.
double RNN.png
Why is it theoretically possible? The doctoral thesis of Sutskever TRAINING RECURRENT NEURAL NETWORKS (training recurrent neural networks) proposed training RNN is very difficult. Because of the network structure RNN because their output needs to consider its current time in front of the input of all time, so when using the back-propagation training, once the input sequence is very long, it is prone to disappear gradient (Gradients Vanish) problem. In order to solve the problem RNN hard training, the authors used LSTM (Long Short-Term Memory, long and short term memory) network.
lstm0.png
The figure is an internal cell structure LSTM. LSTM proposed to solve the problem RNN gradient disappears, its innovative joined the forgotten door, so you can choose to forget in front of LSTM independent of the input sequence, regardless of all the input sequence. After three attempts, finally after adding LSTM, a simple Seq2Seq model is established.
seq2seq1.png
The figure, a simple model includes three parts Seq2Seq, Encoder-LSTM, Decoder-LSTM , Context. The input sequence is ABC, Encoder-LSTM input sequence and returns to the process the entire input sequence of hidden states (hidden state) in the last neuron, also called context (Context, C). Then Decoder-LSTM According hidden, step by step to predict the next character of the target sequence. The final output sequence wxyz. It is worth mentioning that the author Sutskever their specific tasks specific to design specific Seq2Seq model. And the input sequence for the reverse process, so that the model can handle long sentences, but also improve the accuracy.
seq2seq1.png
The figure is a true model of Sutskever design, and proud about the three points. The first uses two LSTM, one for coding, for decoding a. This is also the result of exploration and argument. The second uses a deep LSTM (4 layers), as compared to the shallow network, each model would add a layer of difficulty is reduced by 10%. The third input sequence using the reverse operation, to improve the long sequence LSTM processing capability.

3. English translation

To the moment of our hands, and understand the above Seq2Seq model, let's build a simple English translation model.

3.1 Dataset

We use a Web site in English manythings data set, Mo now been uploaded to the platform, click to see . The data set format to English + tab + Chinese.
image.png

3.2 Data processing

from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np

batch_size = 64  # Batch size for training.
epochs = 100  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 10000  # Number of samples to train on.
# Path to the data txt file on disk.
data_path = 'cmn.txt'

# Vectorize the data.
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
with open(data_path, 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')
for line in lines[: min(num_samples, len(lines) - 1)]:
    input_text, target_text = line.split('\t')
    # We use "tab" as the "start sequence" character
    # for the targets, and "\n" as "end sequence" character.
    target_text = '\t' + target_text + '\n'
    input_texts.append(input_text)
    target_texts.append(target_text)
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)

input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

3.3 Encoder-LSTM

# mapping token to index, easily to vectors
input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])

# np.zeros(shape, dtype, order)
# shape is an tuple, in here 3D
encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

# input_texts contain all english sentences
# output_texts contain all chinese sentences
# zip('ABC','xyz') ==> Ax By Cz, looks like that
# the aim is: vectorilize text, 3D
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        # 3D vector only z-index has char its value equals 1.0
        encoder_input_data[i, t, input_token_index[char]] = 1.
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            # igone t=0 and start t=1, means 
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.

3.4 Context(hidden state)

# Define an input sequence and process it.
# input prodocts keras tensor, to fit keras model!
# 1x73 vector 
# encoder_inputs is a 1x73 tensor!
encoder_inputs = Input(shape=(None, num_encoder_tokens))

# units=256, return the last state in addition to the output
encoder_lstm = LSTM((latent_dim), return_state=True)

# LSTM(tensor) return output, state-history, state-current
encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)

# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

3.5 Decoder-LSTM

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM((latent_dim), return_sequences=True, return_state=True)

# obtain output
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,initial_state=encoder_states)

# dense 2580x1 units full connented layer
decoder_dense = Dense(num_decoder_tokens, activation='softmax')

# why let decoder_outputs go through dense ?
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn, groups layers into an object 
# with training and inference features
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
# model(input, output)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Run training
# compile -> configure model for training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
# model optimizsm
model.fit([encoder_input_data, decoder_input_data], 
          decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.2)
# Save model
model.save('seq2seq.h5')

3.6 decoding sequence

# Define sampling models
encoder_model = Model(encoder_inputs, encoder_states)
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict(
    (i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
    (i, char) for char, i in target_token_index.items())

def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index['\t']] = 1.
    # this target_seq you can treat as initial state

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token
        # argmax: Returns the indices of the maximum values along an axis
        # just like find the most possible char
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        # find char using index
        sampled_char = reverse_target_char_index[sampled_token_index]
        # and append sentence
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        # Update the target sequence (of length 1).
        # append then ?
        # creating another new target_seq
        # and this time assume sampled_token_index to 1.0
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        # update states, frome the front parts
        states_value = [h, c]

    return decoded_sentence

3.7 forecast

for seq_index in range(100,200):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('Input sentence:', input_texts[seq_index])
    print('Decoded sentence:', decoded_sentence)

The project has been made public on the platform Mo, among Seq2Seq English translation , it is recommended to use the GPU training.
Mo platform introduces a very intimate and practical functions: API Doc, (in the right sidebar development of the interface, the second).
1.png Promotion


In Mo platform to write code can easily achieve multi-window display, just drag the window's title bar can be achieved columns.
Promotion 2.png

4. Summary and Outlook

Seq2Seq proposed model classic is a great thing, the model solves many important issues and NLP not solve the problem in machine translation and voice recognition and other fields. NLP is also used in deep learning a milestone event. Follow-up, and based on the model made a lot of improvement and optimization, such as Attention mechanism. I believe in the near future, there will be a new major discovery, let us wait and see.
Project source address (Welcome computer terminal opened for fork): https://momodel.cn/explore/5d38500a1afd94479891643a?type=app

5. Reference

论文:Sequence to Sequence Learning with Neural Networks
博客:Understanding LSTM Networks
代码:A ten-minute introduction to sequence-to-sequence learning in Keras

Published 36 original articles · won praise 4 · views 10000 +

Guess you like

Origin blog.csdn.net/weixin_44015907/article/details/98394153