BERT, or Bidirectional Encoder Representations from Transformers

BERT是google最新提出的NLP预训练方法，在大型文本语料库（如维基百科）上训练通用的“语言理解”模型，然后将该模型用于我们关心的下游NLP任务（如分类、阅读理解）。 BERT优于以前的方法，因为它是用于预训练NLP的第一个**无监督，深度双向**系统。

简单的说就是吊打以前的模型，例如 Semi-supervised Sequence Learning,Generative Pre-Training,ELMo, ULMFit，在多个语言任务上（SQuAD, MultiNLI, and MRPC）基于BERT的模型都取得了state of the art的效果。

BERT 的核心过程:

从句子中随机选取15%去除，作为模型预测目标，例如：

Input: the man went to the [MASK1] . he bought a [MASK2] of milk.
Labels: [MASK1] = store; [MASK2] = gallon

为了学习句子之间的关系。会从数据集抽取两个句子，其中第二句是第一句的下一句的概率是 50%，

Sentence A: the man went to the store .
Sentence B: he bought a gallon of milk .
Label: IsNextSentence

Sentence A: the man went to the store .
Sentence B: penguins are flightless .
Label: NotNextSentence

主要在于Transformer模型。后续需要再分析其模型机构以及设计思想。

预训练模型

BERT-Base, Uncased:
12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Uncased:
24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Cased:
12-layer, 768-hidden, 12-heads , 110M parameters
BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters
(Not available yet. Needs to be re-generated).
BERT-Base, Multilingual:
102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Chinese:
Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M
parameters

其中包含：

A TensorFlow checkpoint (bert_model.ckpt) containing the pre-trained
weights (which is actually 3 files).
A vocab file (vocab.txt) to map WordPiece to word id.
A config file (bert_config.json) which specifies the hyperparameters of
the model.

其他语言见： Multilingual README。开放了中文数据集。

BERT-Base, Multilingual: 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M
parameters

（算力紧张情况下单独训练了一版中文，中文影响力可见一斑，我辈仍需努力啊）