Introduction to Deep Learning (63) Recurrent Neural Network - Machine Translation Dataset

foreword

The core content comes from blog link 1 blog link 2 I hope you can support the author a lot
This article is used for records to prevent forgetting

Recurrent Neural Networks - Machine Translation Dataset

Textbook

Language models are key to natural language processing, 机器翻译and are the most successful benchmarks for language models. Because machine translation is the core problem of converting an input sequence into an output sequence 序列转换模型(sequence transduction). Sequential transformation models play a vital role in various modern artificial intelligence applications. To this end, this section presents the machine translation problem and the datasets that will be used later.

机器翻译(machine translation)Refers to the automatic translation of sequences from one language to another. In fact, this field of research dates back to the 1940s, shortly after the invention of the digital computer, and specifically the use of computers to break language codes in World War II. Statistical methods dominated the field for decades before the rise of end-to-end learning using neural networks. Because 统计机器翻译(statistical machine translation)it involves statistical analysis of components such as translation models and language models, neural network-based methods are often called 神经机器翻译(neural machine translation)to distinguish the two translation models.

This book focuses on neural machine translation methods, with an emphasis on end-to-end learning. Unlike the language model problem in the section on Language Models and Datasets where the corpus is a single language, a dataset for machine translation consists of pairs of text sequences in the source and target languages. Therefore, we need a completely different approach to preprocessing datasets for machine translation than reusing language model preprocessing procedures. Next, we look at how to load the preprocessed data into mini-batches for training.

import os
import torch
from d2l import torch as d2l

1 Download and preprocess the dataset

First, download an "English-French" dataset consisting of bilingual sentence pairs from the Tatoeba project. Each line in the dataset is a tab-delimited text sequence pair, and the sequence pair consists of an English text sequence and a translated French text sequence composition. Note that each text sequence can be a sentence or a paragraph containing multiple sentences. In this machine translation problem of English to French, English is 源语言(source language)and French is 目标语言(target language).

d2l.DATA_HUB['fra-eng'] = (d2l.DATA_URL + 'fra-eng.zip',
                           '94646ad1522d915e7b0f9296181140edcf86a4f5')

def read_data_nmt():
    """载入“英语-法语”数据集"""
    data_dir = d2l.download_extract('fra-eng')
    with open(os.path.join(data_dir, 'fra.txt'), 'r',
             encoding='utf-8') as f:
        return f.read()

raw_text = read_data_nmt()
print(raw_text[:75])

output

Go. Va !
Hi. Salut !
Run!        Cours !
Run!        Courez !
Who?        Qui ?
Wow!        Ça alors !

After downloading the dataset, the raw text data needs to go through several preprocessing steps. For example, we substitute spaces 不间断空格(non-breaking space), replace uppercase letters with lowercase letters, and insert spaces between words and punctuation marks.

def preprocess_nmt(text):
    """预处理“英语-法语”数据集"""
    def no_space(char, prev_char):
        return char in set(',.!?') and prev_char != ' '

    # 使用空格替换不间断空格
    # 使用小写字母替换大写字母
    text = text.replace('\u202f', ' ').replace('\xa0', ' ').lower()
    # 在单词和标点符号之间插入空格
    out = [' ' + char if i > 0 and no_space(char, text[i - 1]) else char
           for i, char in enumerate(text)]
    return ''.join(out)

text = preprocess_nmt(raw_text)
print(text[:80])

output

go .        va !
hi .        salut !
run !       cours !
run !       courez !
who ?       qui ?
wow !       ça alors !

2 Lemmatization

Unlike character-level lemmatization in the language models and datasets section, in machine translation we prefer word-level lemmatization (state-of-the-art models may use more advanced lemmatization techniques). The following tokenize_nmtfunction num_examplestokenizes the previous text sequence pair, where each token is either a word or a punctuation mark. This function returns two token lists: sourceand target: source[i]is the token list of the i-th text sequence in the source language (here English), and target[i]is the token list of the i-th text sequence in the target language (here French).

def tokenize_nmt(text, num_examples=None):
    """词元化“英语-法语”数据数据集"""
    source, target = [], []
    for i, line in enumerate(text.split('\n')):
        if num_examples and i > num_examples:
            break
        parts = line.split('\t')
        if len(parts) == 2:
            source.append(parts[0].split(' '))
            target.append(parts[1].split(' '))
    return source, target

source, target = tokenize_nmt(text)
source[:6], target[:6]

output:

([['go', '.'],
  ['hi', '.'],
  ['run', '!'],
  ['run', '!'],
  ['who', '?'],
  ['wow', '!']],
 [['va', '!'],
  ['salut', '!'],
  ['cours', '!'],
  ['courez', '!'],
  ['qui', '?'],
  ['ça', 'alors', '!']])

Let's plot a histogram of the number of tokens each text sequence contains. In this simple "English-French" dataset, most text sequences have fewer than 20 tokens.

def show_list_len_pair_hist(legend, xlabel, ylabel, xlist, ylist):
    """绘制列表长度对的直方图"""
    d2l.set_figsize()
    _, _, patches = d2l.plt.hist(
        [[len(l) for l in xlist], [len(l) for l in ylist]])
    d2l.plt.xlabel(xlabel)
    d2l.plt.ylabel(ylabel)
    for patch in patches[1].patches:
        patch.set_hatch('/')
    d2l.plt.legend(legend)

show_list_len_pair_hist(['source', 'target'], '# tokens per sequence',
                        'count', source, target);

output
insert image description here

3 vocabulary

Since machine translation datasets consist of language pairs, we can build two vocabularies for the source and target languages, respectively. When using word-level lemmatization, the vocabulary size will be significantly larger than when using character-level lemmatization. To alleviate this problem, here we treat low-frequency lemmas that occur less than 2 times as the same unknown (“ <unk>”) lemmas. In addition to this, we also specify additional specific tokens, such as filler tokens (“ ”) to pad sequences to the same length in small batches, <pad>and sequence start tokens (“ <bos>”) and end tokens element (" <eos>"). These special tokens are commonly used in natural language processing tasks.

src_vocab = d2l.Vocab(source, min_freq=2,
                      reserved_tokens=['<pad>', '<bos>', '<eos>'])
len(src_vocab)

4 Load the dataset

Recall that a sequence sample in a language model has a fixed length, whether the sample is part of a sentence or a fragment spanning multiple sentences. num_stepsThis fixed length is specified by the (number of timesteps or number of tokens) parameter in the Language Models and Datasets section . In machine translation, each sample is a pair of text sequences consisting of a source and a target, where each text sequence may have a different length.

In order to improve computational efficiency, we can still process only one small batch of text sequences at a time by means of 截断(truncation)and . 填充(padding)Assuming that each sequence in the same mini-batch should have the same length num_steps, then if the number of tokens of the text sequence is less than num_steps, we will continue to add specific " <pad>" tokens at the end until its length reaches num_steps; otherwise, When we truncate a text sequence, we take only the first num_stepstokens and discard the remaining tokens. This way, each text sequence will have the same length to be loaded in mini-batches of the same shape.

As mentioned earlier, the functions below truncate_padwill truncate or pad text sequences.

def truncate_pad(line, num_steps, padding_token):
    """截断或填充文本序列"""
    if len(line) > num_steps:
        return line[:num_steps]  # 截断
    return line + [padding_token] * (num_steps - len(line))  # 填充

truncate_pad(src_vocab[source[0]], 10, src_vocab['<pad>'])

output

[47, 4, 1, 1, 1, 1, 1, 1, 1, 1]

Now we define a function that converts text sequences into mini-batches for training. We add a special " <eos>" token to the end of all sequences to indicate the end of the sequence. When the model predicts by generating sequences one by one, the generated “ <eos>” token indicates that the sequence output work is completed. In addition, we also recorded the length of each text sequence, excluding filler tokens when counting the length, and some models that will be introduced later will need this length information.

def build_array_nmt(lines, vocab, num_steps):
    """将机器翻译的文本序列转换成小批量"""
    lines = [vocab[l] for l in lines]
    lines = [l + [vocab['<eos>']] for l in lines]
    array = torch.tensor([truncate_pad(
        l, num_steps, vocab['<pad>']) for l in lines])
    valid_len = (array != vocab['<pad>']).type(torch.int32).sum(1)
    return array, valid_len

5 training model

Finally, we define load_data_nmtfunctions to return data iterators, and two vocabularies for the source and target languages.

def load_data_nmt(batch_size, num_steps, num_examples=600):
    """返回翻译数据集的迭代器和词表"""
    text = preprocess_nmt(read_data_nmt())
    source, target = tokenize_nmt(text, num_examples)
    src_vocab = d2l.Vocab(source, min_freq=2,
                          reserved_tokens=['<pad>', '<bos>', '<eos>'])
    tgt_vocab = d2l.Vocab(target, min_freq=2,
                          reserved_tokens=['<pad>', '<bos>', '<eos>'])
    src_array, src_valid_len = build_array_nmt(source, src_vocab, num_steps)
    tgt_array, tgt_valid_len = build_array_nmt(target, tgt_vocab, num_steps)
    data_arrays = (src_array, src_valid_len, tgt_array, tgt_valid_len)
    data_iter = d2l.load_array(data_arrays, batch_size)
    return data_iter, src_vocab, tgt_vocab

Below we read out the first mini-batch in the "English-French" dataset.

train_iter, src_vocab, tgt_vocab = load_data_nmt(batch_size=2, num_steps=8)
for X, X_valid_len, Y, Y_valid_len in train_iter:
    print('X:', X.type(torch.int32))
    print('X的有效长度:', X_valid_len)
    print('Y:', Y.type(torch.int32))
    print('Y的有效长度:', Y_valid_len)
    break

output:

X: tensor([[  6, 143,   4,   3,   1,   1,   1,   1],
        [ 54,   5,   3,   1,   1,   1,   1,   1]], dtype=torch.int32)
X的有效长度: tensor([4, 3])
Y: tensor([[ 6,  0,  4,  3,  1,  1,  1,  1],
        [93,  5,  3,  1,  1,  1,  1,  1]], dtype=torch.int32)
Y的有效长度: tensor([4, 3])

6 Summary

  • Machine translation refers to the automatic translation of a sequence of text from one language to another.

  • The vocabulary size when using word-level lemmatization will be significantly larger than when using character-level lemmatization. To alleviate this problem, we can treat low-frequency tokens as the same unknown tokens.

  • By truncating and padding text sequences, it is guaranteed that all text sequences have the same length so that they can be loaded in small batches.

Guess you like

Origin blog.csdn.net/qq_52358603/article/details/128376928