T2T Transformer 笔记

Args: generator: a generator yielding (string -> int/float/str list) dictionaries. output_filenames: List of output file paths. max_cases: maximum number of cases to get from the generator; if None (default), we use the generator until StopIteration is raised. """

注意：

writers[shard].write(sequence_example.SerializeToString()) 序列化数据集

4.3）

https://github.com/tensorflow/tensor2tensor/blob/92983eaaa457ec18729b1883ba5ae4a6614bdcb5/tensor2tensor/data_generators/generator_utils.py

get_or_generate_vocab(data_dir, tmp_dir, vocab_filename, vocab_size, sources)

get_or_generate_vocab_inner(data_dir, vocab_filename, vocab_size, generator)

"""Inner implementation for vocab generators.

Args:

data_dir: The base directory where data and vocab files are stored. If None, then do not save the vocab even if it doesn't exist.

vocab_filename: relative filename where vocab file is stored

vocab_size: target size of the vocabulary constructed by SubwordTextEncoder

generator: a generator that produces tokens from the vocabulary

Returns: A SubwordTextEncoder vocabulary object. """

vocab = text_encoder.SubwordTextEncoder.build_to_target_size( vocab_size, token_counts, 1, 1e3)

4.4）https://github.com/tensorflow/tensor2tensor/blob/e3cd447aa605515753ebfc3dbf1a4d4c5ae32425/tensor2tensor/data_generators/text_encoder.py

build_to_target_size(cls, target_size, token_counts, min_val, max_val, num_iterations=4)

"""Builds a SubwordTextEncoder that has `vocab_size` near `target_size`.

Uses simple recursive binary search to find a minimum token count that most closely matches the `target_size`.

Args: target_size: Desired vocab_size to approximate.

token_counts: A dictionary of token counts, mapping string to int.

min_val: An integer; lower bound for the minimum token count.

max_val: An integer; upper bound for the minimum token count.

num_iterations: An integer; how many iterations of refinement. Returns: A SubwordTextEncoder instance.

Raises: ValueError: If `min_val` is greater than `max_val`. """

一个重要概念：minimum token count

"""Bisection to find the right size."""

# We build iteratively. On each iteration, we segment all the words, # then count the resulting potential subtokens, keeping the ones # with high enough counts for our new vocabulary.

5. 训练

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/layers/common_hparams.py

6. Server

6.1 https://research.googleblog.com/2017/11/latest-innovations-in-tensorflow-serving.html

6.2 https://towardsdatascience.com/how-to-deploy-machine-learning-models-with-tensorflow-part-1-make-your-model-ready-for-serving-776a14ec3198

6.3 http://blog.csdn.net/wangjian1204/article/details/68928656

6.4 https://weiminwang.blog/2017/09/12/introductory-guide-to-tensorflow-serving/

6.5 https://github.com/tensorflow/tensor2tensor/issues/368

6.6 https://github.com/tensorflow/tensor2tensor/issues/349

1)跑Big model崩溃了
tensorflow.python.framework.errors_impl.InvalidArgumentError: Number of ways to split should evenly divide the split dimension, but got split_dim 0 (size = 4) and num_split 3

Caused by op u'transformer/split', defined at: ...

参考：https://github.com/tensorflow/tensor2tensor/issues/266

直接把batch size开小解决了问题。

2)文件取名

newsdev2017-zhen-src.pre.bpe.zh 这个名字t2t会认为是tar文件，报错：tarfile.ReadError: file could not be opened successfully

猜你喜欢