Bert's implementation method

bert-as-service

The BERT model is an NLP pre-training technology. This article does not introduce the principle of BERT, but mainly focuses on how to quickly use the BERT model to generate word vectors for downstream tasks.

Google has published the TensorFlow version of the pre-trained model and code , which can be used to generate word vectors, but there is an easier way: directly call the packaged library bert-as-service.

Use bert-as-service to generate word vectors

bert-as-service is a BERT service open sourced by Tencent AI Lab. It allows users to use the BERT model by calling the service without paying attention to the implementation details of BERT. bert-as-service is divided into client and server. Users can call the service from python code or access it via http.

Installation
Use the pip command to install, the client and server can be installed on different machines:

pip install bert-serving-server # server
pip install bert-serving-client # client, independent of the server

Among them, the operating environment of the server is Python >= 3.5 and Tensorflow >= 1.10

The client can run on Python 2 or Python 3

Download the pre-trained model

Depending on the type and scale of the NLP task, Google provides a variety of pre-training models to choose from:

  • BERT-Base, Chinese: Simplified and Traditional Chinese, 12-layer, 768-hidden, 12-heads, 110M parameters
  • BERT-Base, Multilingual Cased: Multilingual (104 types), 12-layer, 768-hidden, 12-heads, 110M parameters
  • BERT-Base, Uncased: English is not case sensitive (all lowercase), 12-layer, 768-hidden, 12-heads, 110M parameters
  • BERT-Base, Cased: English is case sensitive, 12-layer, 768-hidden, 12-heads, 110M parameters

You can also use the Harbin Institute of Technology version BERT with better Chinese effect:

  • Chinese-BERT-wwm
    lists several commonly used pre-training models above, you can check more here.

After decompressing the downloaded .zip file, there will be 6 files:

  1. The TensorFlow model file (bert_model.ckpt) contains the weights of the pre-trained model. There are three model files
  2. Dictionary file (vocab.txt) records the mapping relationship between entries and id
  3. The configuration file (bert_config.json) records the hyperparameters of the model

Start the BERT service

Use the bert-serving-start command to start the service:

bert-serving-start -model_dir /tmp/english_L-12_H-768_A-12/ -num_worker=2

Among them, -model_dir is the path of the pre-trained model, and -num_worker is the number of threads, indicating how many concurrent requests can be processed at the same time

If the startup is successful, the server will display:

Insert picture description here

Get sentence vector on the client

You can simply use the following code to obtain the vector representation of the corpus:

from bert_serving.client import BertClient
bc = BertClient()
doc_vecs = bc.encode(['First do it', 'then do it right', 'then do it better'])

doc_vecs is a numpy.ndarray, each line of which is a fixed-length sentence vector, and the length is determined by the maximum length of the input sentence. If you want to specify the length, you can use the max_seq_len parameter when starting the service, and sentences that are too long will be truncated from the right end.

Another feature of BERT is that it can obtain a vector of a pair of sentences, using ||| as a separation between sentences, for example:


bc.encode(['First do it ||| then do it right'])

Get word vector

Set the parameter pooling_strategy to None when starting the service:

bert-serving-start -pooling_strategy NONE -model_dir /tmp/english_L-12_H-768_A-12/

The return at this time is the matrix of embedding for each token in the corpus

bc = BertClient()
vec = bc.encode(['hey you', 'whats up?'])

vec  # [2, 25, 768]
vec[0]  # [1, 25, 768], sentence embeddings for `hey you`
vec[0][0]  # [1, 1, 768], word embedding for `[CLS]`
vec[0][1]  # [1, 1, 768], word embedding for `hey`
vec[0][2]  # [1, 1, 768], word embedding for `you`
vec[0][3]  # [1, 1, 768], word embedding for `[SEP]`
vec[0][4]  # [1, 1, 768], word embedding for padding symbol
vec[0][25]  # error, out of index!

Invoke the BERT service remotely
You can invoke the BERT service of another machine from one machine:

#on another CPU machine
from bert_serving.client import BertClient
bc = BertClient(ip='xx.xx.xx.xx')  # ip address of the GPU machine
bc.encode(['First do it', 'then do it right', 'then do it better'])

In this example, only the client pip install -U bert-serving-client

other

Configuration requirements

The BERT model has relatively high requirements for memory. If you are stuck in the load graph from model_dir during startup, you can set num_worker to 1 or increase the memory of the machine.

Processing Chinese whether to segment words
in advance When calculating Chinese vectors, you can directly enter the entire sentence without segmentation in advance. Because in Chinese-BERT, the corpus is processed in units of words, so the output of the Chinese corpus is a word vector.

For example, when the user enters:


bc.encode(['hey you', 'whats up?', '你好么?', '我 还 可以'])

In fact, the input of the BERT model is:

tokens: [CLS] hey you [SEP]
input_ids: 101 13153 8357 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

tokens: [CLS] what ##s up ? [SEP]
input_ids: 101 9100 8118 8644 136 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

tokens: [CLS] 你 好 么 ? [SEP]
input_ids: 101 872 1962 720 8043 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

tokens: [CLS] 我 还 可 以 [SEP]
input_ids: 101 2769 6820 1377 809 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

What is ##something after entry in English

When a word is not in the dictionary, use the longest subsequence method for entry, for example:

input = "unaffable"
tokenizer_output = ["un", "##aff", "##able"]

Reference

https://github.com/google-research/bert
https://github.com/hanxiao/bert-as-service
Several implementation methods: https://zhuanlan.zhihu.com/p/112235454
Examples: https: //spaces.ac.cn/archives/6736
keras_bert and kert4keras: https://www.cnblogs.com/dogecheng/p/11617940.html

Guess you like

Origin blog.csdn.net/lockhou/article/details/113744260