Detailed transformer (2) --- using their own data training tensor2tensor

1. Environment

Custom training data train transformer, realize Chinese to English translation
environment:
tensorflow 1.14
Python 3.6.x
tensor2tensor

2. Customize the data training Tensor2Tensor

2.1 Customizing a User Directory (parameter --t2t_usr_dirvalues)

This directory is stored mainly the following documents:

(1). 自定义问题文件 (myproblem.py)
(2). 创建 __init__.py 并且在 init.py 中把 problem_name 导入,这样才能够被 t2t-datagen 和 t2t-trainer 识别,并注册到 t2t 里面

As shown below, the custom user directory, and under question and creating a custom file __init__.py

2.2 custom file problem

Custom problem relates to data processing, and is mainly t2t_datagen.pyrelated to the following described t2t_datagen.pyprocess, the main process as shown below:

t2t_datagen.pyThe information is important generate_data_for_problem[line196]and generate_data_for_registered_problem[line198], when found inside the function into the latter generated data is mainly problem.generate_dataa function, that is, generate_data Problem in function. Correspond to different problems different implementations above, in the above text to the question text, the following is achieved in this way as

class Text2TextProblem(problem.Problem):
.....
  def generate_data(self, data_dir, tmp_dir, task_id=-1):
  ....
    if self.is_generate_per_split:
      for split, paths in split_paths:
        generator_utils.generate_files(self.generate_encoded_samples(data_dir, tmp_dir, split), paths)
    else:
      generator_utils.generate_files(self.generate_encoded_samples(data_dir, tmp_dir, problem.DatasetSplit.TRAIN), all_paths)
    generator_utils.shuffle_dataset(all_paths, extra_fn=self._pack_fn())

As can be seen from the above code, mainly by generate_encoded_samplesgenerating coded samples, then the generated files. It is important here is how to generate the coded samples into the generate_encoded_samplesinternal function, as follows:

def generate_encoded_samples(self, data_dir, tmp_dir, dataset_split):
   .....
    generator = self.generate_samples(data_dir, tmp_dir, dataset_split)
    encoder = self.get_or_create_vocab(data_dir, tmp_dir)
    return text2text_generate_encoded(generator, encoder, has_inputs=self.has_inputs)

Is through the function text2text_generate_encodedgenerating encoded data sample, inside get_or_create_vocaban encoder, generate_samplesa sample generator, text2text_generate_encodedas detailed below:

def text2text_generate_encoded(sample_generator, vocab, targets_vocab=None, has_inputs=True):
  targets_vocab = targets_vocab or vocab
  for sample in sample_generator:
    if has_inputs:
      sample["inputs"] = vocab.encode(sample["inputs"])
      sample["inputs"].append(text_encoder.EOS_ID)    # 表示【EOS】,句子结束
    sample["targets"] = targets_vocab.encode(sample["targets"])
    sample["targets"].append(text_encoder.EOS_ID)
    yield sample

Mainly through the iterator, acquired sample data, the sample data is then encoded by an encoder, and then out to yield; see encoder first encoder implemented into the get_or_create_vocabfunction as follows:

  def get_or_create_vocab(self, data_dir, tmp_dir, force_get=False):
    if self.vocab_type == VocabType.CHARACTER:
      encoder = text_encoder.ByteTextEncoder()
    elif self.vocab_type == VocabType.SUBWORD:
      if force_get:
        vocab_filepath = os.path.join(data_dir, self.vocab_filename)
        encoder = text_encoder.SubwordTextEncoder(vocab_filepath)
      else:
        encoder = generator_utils.get_or_generate_vocab_inner(data_dir, self.vocab_filename, self.approx_vocab_size,
            self.generate_text_for_vocab(data_dir, tmp_dir),
            max_subtoken_length=self.max_subtoken_length,
            reserved_tokens=(text_encoder.RESERVED_TOKENS + self.additional_reserved_tokens))
    elif self.vocab_type == VocabType.TOKEN:
      vocab_filename = os.path.join(data_dir, self.vocab_filename)
      encoder = text_encoder.TokenTextEncoder(vocab_filename, replace_oov=self.oov_token)
        ....
    return encoder

After entering the function found mainly choose different encoders based on actual demand ( Note: This is to say that the encoder means that the text with numbers instead, and coding model inside behind different ), character level there text_encoder.SubwordTextEncoderand this is tensor2tensor default the subword way; there is a word level text_encoder.TokenTextEncoder

and then see how iterators, obtaining a sample of data, where generate_samplesthe function details are as follows:

  def generate_samples(self, data_dir, tmp_dir, dataset_split):
   
    raise NotImplementedError()

Embarrassed, did not materialize, in which we reference implementation written by others, as follows:

@registry.register_problem
class TranslateEndeWmtBpe32k(translate.TranslateProblem):
  ... 
  def generate_samples(self, data_dir, tmp_dir, dataset_split):
    train = dataset_split == problem.DatasetSplit.TRAIN
    dataset_path = ("train.tok.clean.bpe.32000"
                    if train else "newstest2013.tok.bpe.32000")
    train_path = _get_wmt_ende_bpe_dataset(tmp_dir, dataset_path)
    token_path = os.path.join(data_dir, self.vocab_filename)
    if not tf.gfile.Exists(token_path):
      token_tmp_path = os.path.join(tmp_dir, self.vocab_filename)
      tf.gfile.Copy(token_tmp_path, token_path)
      with tf.gfile.GFile(token_path, mode="r") as f:
        vocab_data = "<pad>\n<EOS>\n" + f.read() + "UNK\n"
      with tf.gfile.GFile(token_path, mode="w") as f:
        f.write(vocab_data)
    return text_problems.text2text_txt_iterator(train_path + ".en",  train_path + ".de")

Mainly to see the last sentence text_problems.text2text_txt_iterator, to enter to see the details as follows:

def text2text_txt_iterator(source_txt_path, target_txt_path):
  for inputs, targets in zip(
      txt_line_iterator(source_txt_path), txt_line_iterator(target_txt_path)):
    yield {"inputs": inputs, "targets": targets}

Is provided in parallel corpus generate_samples function inside the path, and then obtaining a sample can be used text2text_txt_iterator data corresponding to

the summary of the whole process is that we put into the parallel corpus generate_samplessampled out by the present iteration, and then through the get_or_create_vocabfunction of the sample taken out is coding (text digitization), then the data encoded into the generator_utils.generate_filesgenerated files, of course, in the middle there are many details, now and then to see the flow chart above, they all feel everything all put together, and we have to change the whole the mainly generate_encoded_samplesfunction

2.2.1 Custom problem- tensor2tensor use the default subword tool (using its own parallel corpus)

See the complete program my_t2t/translate_enzh_fc.py
needs to make changes when using place as follows:

  • Training, validation data path, there must be a compressed file, prevent the downloading of data (the latter may be revised again)
_NC_TRAIN_DATASETS = [[
    "/home/rd/temp_dir/train_test.tar.gz",     
    ["train.en.seg", "train.zh.seg"]
]]

_NC_TEST_DATASETS = [[
    "/home/rd/temp_dir/train_test.tar.gz",
    ("valid.en.seg", "valid.zh.seg")
]]
  • The size of the vocabulary corpus is provided
    def vocab_size(self):
        return 45000
  • Some other names such as vocabulary, what's the name of the file can be modified, see specific program

2.2.22 Custom problem- BPE carried out using subword (use your own vocabulary, parallel texts)

See the complete program my_t2t/translate_enzh_bpe.py
needs to be modified when using the place as follows:

  • Training, validation data path, pay attention here only a file name, but the actual file name and there suffix .enor .zhparallel corpus on temp_dir directory
ENZH_BPE_DATASETS = {
    "TRAIN": "corpus.train",
    "DEV": "corpus.valid"
}
  • Specify the size of the vocabulary
    def approx_vocab_size(self):
        return 45000

When using the train, validate, test corpus, glossary are put under temp_dir, the file name you want to pay attention to vocabulary and procedures as

Generating formatted data call t2t_datagen.py 2.3

Call parameters are set as follows

python tensor2tensor/bin/t2t_datagen.py \
--data_dir=$DATABASE/t2t_datagen/$CATEGORY \      # 自定义目录,用于存放生成的训练数据
--tmp_dir=$DATABASE \                             # 平行语料存放的目录
--problem=translate_enzh_fc                       # 自定义的问题文件名
--t2t_usr_dir=$CODE_DIR/my_t2t                    # 自定义的用户目录,也就是存放自定义问题的目录

Referring specifically set scriptdirectory enzh_data_gen_org.sh

generated data pattern is formatted as follows:

the chain dump image fails, the source station may have security chain mechanism, it is recommended to save the picture down directly upload

2.4 calls t2t_trainer.pytraining data

Parameter settings and meanings are as follows:

python tensor2tensor/bin/t2t_trainer.py \
    --data_dir=$DATA_DIR \                 # 存放格式化训练数据的目录
    --t2t_usr_dir=$USER_DIR \              # 存放问题文件的目录
    --problem=$PROBLEM \                   # 问题文件
    --model=$MODEL \                       # 模型
    --hparams_set=$HPARAMS \               # 超参文件
    --output_dir=$TRAIN_DIR \              # 训练文件输出目录
    --worker_gpu=4 \
    --train_steps=200                      # epoch数量

See script scriptdirectoryenzh_train_org.sh

2.5 Use t2t-decoderforecast

Parameters are as follows:

python tensor2tensor/bin/t2t-decoder \  
   --t2t_usr_dir=$USER_DIR \
   --problem=$PROBLEM \
   --data_dir=$DATA_DIR \
   --model=$MODEL \
   --hparams_set=$HPARAMS \
   --output_dir=$TRAIN_DIR \
   --decode_from_file=$BASEPATH/temp_dir/test.en.seg \      # 用来测的试源文件
   --decode_to_file=$BASEPATH/decode_res/result.txt \       # 测试结果保存的文件
   --checkpoint_path=$TRAIN_DIR/model.ckpt-27000             # 调用已经训练好的模型测试

2.6 t2t-bleu bleu calculated test value

 python tensor2tensor/bin/t2t-bleu --translation=$TARGET_FILE --reference=$TRANS_FILE

2.7 t2t-exporter to export model

cd $CODE_DIR
python ./tensor2tensor/bin/t2t-exporter --t2t_usr_dir=$USR_DIR \
                    --problem=$PROBLEM \
                    --data_dir=$DATA_DIR \
                    --model=$MODEL \
                    --hparams_set=$HPARAMS \
                    --output_dir=$OUTPUT_DIR        #导出模型保存路径
# 注意:t2t-exporter有两个参数:
#          --export_dir: 导出的模型存放位置,不指定该参数将默认保存在output_dir下的子目录中
#          --checkpoint_path:将保存的哪个模型导出,不指定该参数,将默认从ouput_dir中获取最新的模型
# 从上两个参数看来,ouput_dir 很重要, output_dir一般为训练模型时模型保存位置

2.8 Use docker deployment

#!/usr/bin/env bash
nohup docker run --gpus '"device=0"' -p 9502:8501 -p 9002:8500 --name translate_enzh_bpe -v /home/tmx/rd/org/BPE_seg/train_dir/export/Servo:/models/translate_enzh_bpe -e MODEL_NAME=translate_enzh_bpe tensorflow/serving:1.14.0-gpu &

args:
--gpus: 指定gpu
-p 端口映射
--name: 容器名称
-v 挂载卷
-e 设置环境变量
最后一个是镜像

3. Notes

  • For the Chinese, before you need to use word (Hnalp, jieba), before using the BPE to be divided word
  • Custom class of problems hump method name should be named, defined in accordance with a line spaced from the corresponding problem camelCasing, e.g. definition of the file is: translate_enzh_sub32kcorresponding to the class nameTranslateEnzhSub32k
  • BPE is used, see BPE
  • In the low version tensor2tensor, the above second parameter names are problems = $ PROBLEM
  • Some pictures can not be displayed in the text, you can go directly to github point of view, the source on github also click here skip to github

4. Reference

Published 33 original articles · won praise 1 · views 2606

Guess you like

Origin blog.csdn.net/orangerfun/article/details/104271231