字节对编码BPE

参考论文：

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Edinburgh neural machine translation systems for wmt 16. arXiv preprint arXiv:1606.02891.

Rico Sennrich, Barry Haddow,and Alexandra Birch. 2016b. Neural Machine Translation of Rare Words with Subword Units. In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016),Berlin, Germany.

Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Improving neural machine translation models with monolingual data." arXiv preprint arXiv:1511.06709 (2015).

参考博客：

https://cloud.tencent.com/developer/article/1089017

首先了解一下BPE算法

BPE，（byte pair encoder）字节对编码，也可以叫做digram coding双字母组合编码，主要目的是为了数据压缩，算法描述为字符串里频率最常见的一对字符被一个没有在这个字符中出现的字符代替的层层迭代过程。具体在下面描述。该算法首先被提出是在Philip Gage的C Users Journal的 1994年2月的文章“A New Algorithm for Data Compression”。

算法过程

这个算法个人感觉很简单，下面就来讲解下：

比如我们想编码：

aaabdaaabac

我们会发现这里的aa出现的词数最高（我们这里只看两个字符的频率），那么用这里没有的字符Z来替代aa：

ZabdZabac

Z=aa

此时，又发现ab出现的频率最高，那么同样的，Y来代替ab：

ZYdZYac

Y=ab

Z=aa

同样的，ZY出现的频率大，我们用X来替代ZY：

XdXac

X=ZY

Y=ab

Z=aa

最后，连续两个字符的频率都为1了，也就结束了。就是这么简单。

解码的时候，就按照相反的顺序更新替换即可。

在机器翻译中，对英文单词进行编码时，可以考虑使用这种方法进行压缩。（汉语中，utf-8中字被编码为3个byte，也可以看成类似的编码）于是，就有人这么干了

Sennrich大神在WMT16中对字节对编码是这么用的：

First, each word in the training vocabulary is represented as a sequence of characters, plus an end-of-word symbol. All characters are added to the symbol vocabulary. Then, the most frequent symbol pair is identified, and all its occurrences are merged, producing a new symbol that is added to the vocabulary. The previous step is repeated until a set number of merge operations have been learned.
貌似没什么高深的地方。

插一句，在汉语的翻译中，自从引入了BPE编码以后，各种诡异的字符也是越来越多了。

猜你喜欢