A book to understand BERT (practice) summary

Organized in early December 2020, dedicated to you who are unwilling to be ordinary

For more machine learning knowledge, please check:  https://blog.csdn.net/weixin_45316122/article/details/110865833

 

Trick: This article is the knowledge extraction of "A Book to Understand BERT (Practice)"

 

 

table of Contents

1. What is BERT?

Two, BERT installation

Three, pre-training model

Fourth, run Fine-Tuning

Five, BertModel class

6. Pretraining yourself

Seven, performance test


 

1. What is BERT?

 

BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsuperviseddeeply bidirectional system for pre-training NLP.

划重点:the first unsuperviseddeeply bidirectional system for pre-training NLP.

Pre-training methods can be roughly divided into context-free word bag models and context-related methods.

The method of contacting context can be further divided into one-way and two-way contacting context. Bag-of-words models such as NNLM, Skip-Gram, Glove, etc., are a single-layer Shallow model that cannot be contextualized;

LSTM and Transformer are typical deep network models that can be contextualized.

 

 

 

 

 

There are two stages to using BERT: pre-training and fine-tuning.

 

Two, BERT installation

Three, pre-training model

Google provides a pre-trained model (checkpoint)

Currently, there are 3 types of models in English, Chinese and multilingual:

Such as:

Uncased means that it becomes lowercase during preprocessing, while cased keeps case.

Choose the version according to the language problem you are dealing with. For Chinese, you can choose the WWM model of Harbin Institute of Technology

 

Fourth, run Fine-Tuning

For most cases, there is no need to re-pretraining. What we have to do is to perform Fine-Tuning according to specific tasks, so we first introduce Fine-Tuning

Run the following command to perform Fine-Tuning

python run_classifier.py \
	--task_name=MRPC \
	--do_train=true \
	--do_eval=true \
	--data_dir=$GLUE_DIR/MRPC \
	--vocab_file=$BERT_BASE_DIR/vocab.txt \
	--bert_config_file=$BERT_BASE_DIR/bert_config.json \
	--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
	--max_seq_length=128 \
	--train_batch_size=8 \
	--learning_rate=2e-5 \
	--num_train_epochs=3.0 \
	--output_dir=/tmp/mrpc_output/

Here is a brief explanation of the meaning of the parameters, the reader can understand its meaning in more detail in the following code reading.

  • task_name The name of the task, here we Fine-Tuning MRPC task
  • whether do_train is training, here is True
  • Whether do_eval is verified after training, here is True
  • data_dir training data directory, no need to modify after configuring environment variables, otherwise fill in the absolute path
  • vocab_file BERT model dictionary
  • bert_config_file BERT model configuration file
  • init_checkpoint Fine-Tuning initialization parameters
  • max_seq_length The maximum length of the Token sequence, here is 128
  • train_batch_size batch size, for a normal 8GB GPU, the maximum batch size can only be 8, and the larger it will be OOM
  • learning_rate
  • num_train_epochs number of training epochs, adjusted according to the task
  • output_dir The storage directory of the trained model

Download the Pre-training model, set the relevant parameters, and the run is over (fine-tuning process).

 

 

 

 

 

https://zhuanlan.zhihu.com/p/112235454 How to use BERT quickly?

 

 

Five, BertModel class

  # 假设输入已经分词并且变成WordPiece的id了 
  # 输入是[2, 3],表示batch=2,max_seq_length=3
  input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
  # 第一个例子实际长度为3,第二个例子长度为2
  input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
  # 第一个例子的3个Token中前两个属于句子1,第三个属于句子2
  # 而第二个例子的第一个Token属于句子1,第二个属于句子2(第三个是padding)
  token_type_ids = tf.constant([[0, 0, 1], [0, 1, 0]])
  
  # 创建一个BertConfig,词典大小是32000,Transformer的隐单元个数是512
  # 8个Transformer block,每个block有6个Attention Head,全连接层的隐单元是1024
  config = modeling.BertConfig(vocab_size=32000, hidden_size=512,
		  num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)
 
  # 创建BertModel
  model = modeling.BertModel(config=config, is_training=True,
		  input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)
  
  # label_embeddings用于把512的隐单元变换成logits
  label_embeddings = tf.get_variable(...)
  # 得到[CLS]最后一层输出,把它看成句子的Embedding(Encoding)
  pooled_output = model.get_pooled_output()
  # 计算logits
  logits = tf.matmul(pooled_output, label_embeddings)

There are comments just below bertmodel

transformer_model,attention_layer  分别实现transformer和self attention 

6. Pretraining yourself

Although Google provides a Pretraining model, we may and will need to perform Pretraining by ourselves through Mask LM and Next Sentence Prediction.

If we have data in some fields, we can also perform Pretraining, but we can use the checkpoint provided by Google as the initial value.

 

create_pretraining_data.py ------------------create_training_instances

Seven, performance test

https://blog.csdn.net/c9Yv2cf9I06K2A9E/article/details/84351397   Two lines of code play Google BERT sentence vector word vector

 

 

 

Sacred text: https://blog.csdn.net/jiaowoshouzi/article/details/89388794

 

The actual use of bert https://www.cnblogs.com/jiangxinyang/p/10241243.html       I have time to read them all

 

trick: learn bert and albert with the following two to finish

https://github.com/jiangxinyang227/bert-for-task/tree/master/albert_task/ner_task

https://github.com/jiangxinyang227/bert-for-task    

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/weixin_45316122/article/details/110532563