Simple application of BERT pre-training model (Chinese sentence vector correlation analysis)

table of Contents

1. A simple understanding of BERT

2. Google BERT and Chinese model download

1. Google BERT source code download

2. Download the bert-as-server framework

3. Chinese pre-training model download

Three, bert generates Chinese sentence vectors

1. Start the BERT service

2. Chinese sentence vector encoding

Four, cosine similarity calculation

Five, complete experiment code


1. A simple understanding of BERT

The Google BERT pre-training model has been widely used in the fields of deep learning and NLP, and has achieved good results in text classification tasks. Compared with the traditional word embedding word2vec and golve, the effect of using bert pre-training is better.

This article will not analyze the principles and advanced applications of bert in depth and complexity, but start from scratch, aiming at the simple understanding and application of BERT for beginners, using the bert framework bert-as-server (CS architecture).

2. Google BERT and Chinese model download

1. Google BERT source code download

The complete source code download address of Google BERT: https://github.com/google-research/bert

The official explanation of BERT:

BERT is a method of pre-training language representation, which means that we train a general "language understanding" model on a large text corpus (such as Wikipedia), and then use the model for downstream NLP tasks we care about (such as answering problem). BERT is superior to previous methods because it is the first unsupervised, deep bidirectional pre-trained natural language processing system.

The application of the source code can be further studied in the later learning process, and now it is easier to use the framework at the introductory stage.

2. Download the bert-as-server framework

pip install bert-serving-server   #server
pip install bert-serving-client   #client

3. Chinese pre-training model download

google download address: https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip

Harbin Institute of Technology download link : https://pan.iflytek.com/link/A2483AD206EF85FD91569B498A3C3879  ( password 0 7 Xj )

The file directory after decompression is as follows, including bert configuration file, pre-training model and vocabulary list.

Three, bert generates Chinese sentence vectors

1. Start the BERT service

bert-serving-start -model_dir D:\PyCharm_Project\bert-use-demo-master\chinese_bert_chinese_wwm_L-12_H-768_A-12 -max_batch_size 10 -max_seq_len 20 -num_worker 1

 The file directory is the Chinese pre-trained model decompressed in the previous step, and the parameters can be set by yourself.

Successful startup effect:

2. Chinese sentence vector encoding

from bert_serving.client import BertClient
import numpy as np



def main():
    bc = BertClient()
    doc_vecs = bc.encode(['今天天空很蓝,阳光明媚', '今天天气好晴朗', '现在天气如何', '自然语言处理', '机器学习任务'])

    print(doc_vecs)


if __name__ == '__main__':
    main()

The vector obtained for each sentence is expressed as:

[[ 0.9737132  -0.0289975   0.23281255 ...  0.21432212 -0.1451838
  -0.26555032]
 [ 0.57072604 -0.2532929   0.13397914 ...  0.12190636  0.35531974
  -0.2660934 ]
 [ 0.33702925 -0.27623484  0.33704653 ... -0.14090805  0.48694345
   0.13270345]
 [ 0.00974528 -0.04629223  0.48822984 ... -0.24558026  0.09809375
  -0.08697749]
 [ 0.29680184  0.13963464  0.30706868 ...  0.05395972 -0.4393276
   0.17769393]] 

Four, cosine similarity calculation

def cos_similar(sen_a_vec, sen_b_vec):
    '''
    计算两个句子的余弦相似度
    '''
    vector_a = np.mat(sen_a_vec)
    vector_b = np.mat(sen_b_vec)
    num = float(vector_a * vector_b.T)
    denom = np.linalg.norm(vector_a) * np.linalg.norm(vector_b)
    cos = num / denom
    return cos

Experimental results:

sentence:

'今天天空很蓝,阳光明媚', '今天天气好晴朗'

Similarity: 0.9508827722696014

sentence:

'自然语言处理', '机器学习任务'

Similarity: 0.9187518514435784

sentence:

'今天天空很蓝,阳光明媚', '机器学习任务'

Similarity: 0.7653104788070156

Five, complete experiment code

from bert_serving.client import BertClient
import numpy as np


def cos_similar(sen_a_vec, sen_b_vec):
    '''
    计算两个句子的余弦相似度
    '''
    vector_a = np.mat(sen_a_vec)
    vector_b = np.mat(sen_b_vec)
    num = float(vector_a * vector_b.T)
    denom = np.linalg.norm(vector_a) * np.linalg.norm(vector_b)
    cos = num / denom
    return cos


def main():
    bc = BertClient()
    doc_vecs = bc.encode(['今天天空很蓝,阳光明媚', '今天天气好晴朗', '现在天气如何', '自然语言处理', '机器学习任务'])

    print(doc_vecs)
    similarity=cos_similar(doc_vecs[0],doc_vecs[4])
    print(similarity)


if __name__ == '__main__':
    main()

 

This article briefly introduces the basic application of BERT, using the bert framework to encode Chinese sentences to generate sentence vectors, and at the same time, it can analyze the semantics of sentences.

The Google BERT pre-training model has been widely used in the fields of deep learning and NLP, and has achieved good results in text classification tasks. Compared with the traditional word embedding word2vec and golve, the effect of using bert pre-training is better.

It can be seen that the basic use of BERT is relatively simple. This article does not analyze the principles and advanced applications of bert in depth and complexity. Instead, it starts from scratch and is positioned at the simple understanding and application of BERT for beginners, using the bert framework bert-as- The server (CS architecture) can also be considered as a basic homework for the in-depth study and research in the future.

If you think it’s good, welcome to "one-click, three-link", like, bookmark, follow, comment directly if you have any questions, and exchange and learn!

My CSDN blog: https://blog.csdn.net/Charzous/article/details/113824876

Guess you like

Origin blog.csdn.net/Charzous/article/details/113824876