Getting Started with Google 5 minutes strongest NLP model: BERT

BERT (Bidirectional Encoder Representations from Transformers)

October 11, Google AI Language published papers

BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding

BERT model proposed performance on 11 NLP tasks broken the record, including Q & A Question Answering (SQuAD v1.1), reasoning Natural Language Inference (MNLI) such as:

GLUE :General Language Understanding Evaluation
MNLI :Multi-Genre Natural Language Inference
SQuAD v1.1 :The Standford Question Answering Dataset
QQP : Quora Question Pairs 
QNLI : Question Natural Language Inference
SST-2 :The Stanford Sentiment Treebank
CoLA :The Corpus of Linguistic Acceptability 
STS-B :The Semantic Textual Similarity Benchmark
MRPC :Microsoft Research Paraphrase Corpus
RTE :Recognizing Textual Entailment 
WNLI :Winograd NLI
SWAG :The Situations With Adversarial Generations

Let's first look at the leaderboard BERT Stanford Question Answering Dataset (SQuAD) above it:
https://rajpurkar.github.io/SQuAD-explorer/

 
 

BERT can be used to do?

BERT can be used for answering system, sentiment analysis, spam filtering, named entity recognition, document clustering and other tasks, as the infrastructure of these tasks ie language model,

BERT has an open source code:
https://github.com/google-research/bert
we can fine-tune them, apply it to our objectives and tasks, the fine-tuning of training BERT is fast and simple.

For example, in the NER problem, BERT language models have been pre-trained over 100 languages, this is a list of top 100 language:
https://github.com/google-research/bert/blob/master/multilingual.md

As long as the 100 languages, if there NER data, you can quickly train NER.


BERT principles outlined

The innovation is that it will be a two-way Transformer BERT for language model,
the previous model is a text input sequence from left to right, or left-to-right and right-to-left training together.
The experimental results show that two-way language training model for understanding the context of the language will be more profound than the one-way model,
the paper introduces a new technology called Masked LM (MLM), before the emergence of this technology is not bidirectional language model training.

BERT use the encoder part of the Transformer.
Transformer is a mechanism of attention, you can learn the context of the relationship between text word.
Transformer prototypes include two independent mechanism responsible for receiving a text as input encoder, a decoder is responsible for forecasting the results of the task.
BERT's goal is to generate language model, so only need encoder mechanism.

The disposable Transformer encoder to read the entire text sequence, from left to right instead of right to left or read sequentially,
this feature enables the model can be based on both the word learning, it is equivalent to a two-way function.

Transformer figure is part of the encoder, the input sequence is a token, called a vector embedding them first, then input to the neural network, the output vector is the size of the H sequence in each vector corresponding token with the same index.

 
 

Picture by Rani Horev

当我们在训练语言模型时,有一个挑战就是要定义一个预测目标,很多模型在一个序列中预测下一个单词,
“The child came home from ___”
双向的方法在这样的任务中是有限制的,为了克服这个问题,BERT 使用两个策略:

1. Masked LM (MLM)

在将单词序列输入给 BERT 之前,每个序列中有 15% 的单词被 [MASK] token 替换。 然后模型尝试基于序列中其他未被 mask 的单词的上下文来预测被掩盖的原单词。

这样就需要:

  1. 在 encoder 的输出上添加一个分类层
  2. 用嵌入矩阵乘以输出向量,将其转换为词汇的维度
  3. 用 softmax 计算词汇表中每个单词的概率

BERT 的损失函数只考虑了 mask 的预测值,忽略了没有掩蔽的字的预测。这样的话,模型要比单向模型收敛得慢,不过结果的情境意识增加了。

 
 

图片 by Rani Horev

2. Next Sentence Prediction (NSP)

在 BERT 的训练过程中,模型接收成对的句子作为输入,并且预测其中第二个句子是否在原始文档中也是后续句子。
在训练期间,50% 的输入对在原始文档中是前后关系,另外 50% 中是从语料库中随机组成的,并且是与第一句断开的。

为了帮助模型区分开训练中的两个句子,输入在进入模型之前要按以下方式进行处理:

  1. 在第一个句子的开头插入 [CLS] 标记,在每个句子的末尾插入 [SEP] 标记。
  2. 将表示句子 A 或句子 B 的一个句子 embedding 添加到每个 token 上。
  3. 给每个 token 添加一个位置 embedding,来表示它在序列中的位置。

为了预测第二个句子是否是第一个句子的后续句子,用下面几个步骤来预测:

  1. 整个输入序列输入给 Transformer 模型
  2. 用一个简单的分类层将 [CLS] 标记的输出变换为 2×1 形状的向量
  3. 用 softmax 计算 IsNextSequence 的概率

在训练 BERT 模型时,Masked LM 和 Next Sentence Prediction 是一起训练的,目标就是要最小化两种策略的组合损失函数。

 
 

如何使用 BERT?

BERT 可以用于各种NLP任务,只需在核心模型中添加一个层,例如:

  1. 在分类任务中,例如情感分析等,只需要在 Transformer 的输出之上加一个分类层
  2. 在问答任务(例如SQUAD v1.1)中,问答系统需要接收有关文本序列的 question,并且需要在序列中标记 answer。 可以使用 BERT 学习两个标记 answer 开始和结尾的向量来训练Q&A模型。
  3. 在命名实体识别(NER)中,系统需要接收文本序列,标记文本中的各种类型的实体(人员,组织,日期等)。 可以用 BERT 将每个 token 的输出向量送到预测 NER 标签的分类层。

在 fine-tuning 中,大多数超参数可以保持与 BERT 相同,在论文中还给出了需要调整的超参数的具体指导(第3.5节)。

 
 

学习资料:
https://arxiv.org/pdf/1810.04805.pdf
https://www.lyrn.ai/2018/11/07/explained-bert-state-of-the-art-language-model-for-nlp/
https://medium.com/syncedreview/best-nlp-model-ever-google-bert-sets-new-standards-in-11-language-tasks-4a2a189bc155


Recommended reading history technology blog link collections
http://www.jianshu.com/p/28f02bb59fe5
may be able to find what you want:
[threshold question] [TensorFlow] [depth study] [reinforcement learning] [neural networks] [Machine Learning ] [Natural language processing] [bot]



Author: will not stop snails
link: https: //www.jianshu.com/p/d110d0c13063
Source: Jane book
Jane book copyright reserved by the authors, are reproduced in any form, please contact the author to obtain authorization and indicate the source.

Guess you like

Origin www.cnblogs.com/jfdwd/p/11205994.html
Recommended