table of Contents
1. A simple understanding of BERT
2. Google BERT and Chinese model download
1. Google BERT source code download
2. Download the bert-as-server framework
3. Chinese pre-training model download
Three, bert generates Chinese sentence vectors
2. Chinese sentence vector encoding
Four, cosine similarity calculation
Five, complete experiment code
1. A simple understanding of BERT
The Google BERT pre-training model has been widely used in the fields of deep learning and NLP, and has achieved good results in text classification tasks. Compared with the traditional word embedding word2vec and golve, the effect of using bert pre-training is better.
This article will not analyze the principles and advanced applications of bert in depth and complexity, but start from scratch, aiming at the simple understanding and application of BERT for beginners, using the bert framework bert-as-server (CS architecture).
2. Google BERT and Chinese model download
1. Google BERT source code download
The complete source code download address of Google BERT: https://github.com/google-research/bert
The official explanation of BERT:
BERT is a method of pre-training language representation, which means that we train a general "language understanding" model on a large text corpus (such as Wikipedia), and then use the model for downstream NLP tasks we care about (such as answering problem). BERT is superior to previous methods because it is the first unsupervised, deep bidirectional pre-trained natural language processing system.
The application of the source code can be further studied in the later learning process, and now it is easier to use the framework at the introductory stage.
2. Download the bert-as-server framework
pip install bert-serving-server #server
pip install bert-serving-client #client
3. Chinese pre-training model download
google download address: https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip
Harbin Institute of Technology download link : https://pan.iflytek.com/link/A2483AD206EF85FD91569B498A3C3879 ( password 0 7 Xj )
The file directory after decompression is as follows, including bert configuration file, pre-training model and vocabulary list.
Three, bert generates Chinese sentence vectors
1. Start the BERT service
bert-serving-start -model_dir D:\PyCharm_Project\bert-use-demo-master\chinese_bert_chinese_wwm_L-12_H-768_A-12 -max_batch_size 10 -max_seq_len 20 -num_worker 1
The file directory is the Chinese pre-trained model decompressed in the previous step, and the parameters can be set by yourself.
Successful startup effect:
2. Chinese sentence vector encoding
from bert_serving.client import BertClient
import numpy as np
def main():
bc = BertClient()
doc_vecs = bc.encode(['今天天空很蓝,阳光明媚', '今天天气好晴朗', '现在天气如何', '自然语言处理', '机器学习任务'])
print(doc_vecs)
if __name__ == '__main__':
main()
The vector obtained for each sentence is expressed as:
[[ 0.9737132 -0.0289975 0.23281255 ... 0.21432212 -0.1451838
-0.26555032]
[ 0.57072604 -0.2532929 0.13397914 ... 0.12190636 0.35531974
-0.2660934 ]
[ 0.33702925 -0.27623484 0.33704653 ... -0.14090805 0.48694345
0.13270345]
[ 0.00974528 -0.04629223 0.48822984 ... -0.24558026 0.09809375
-0.08697749]
[ 0.29680184 0.13963464 0.30706868 ... 0.05395972 -0.4393276
0.17769393]]
Four, cosine similarity calculation
def cos_similar(sen_a_vec, sen_b_vec):
'''
计算两个句子的余弦相似度
'''
vector_a = np.mat(sen_a_vec)
vector_b = np.mat(sen_b_vec)
num = float(vector_a * vector_b.T)
denom = np.linalg.norm(vector_a) * np.linalg.norm(vector_b)
cos = num / denom
return cos
Experimental results:
sentence:
'今天天空很蓝,阳光明媚', '今天天气好晴朗'
Similarity: 0.9508827722696014
sentence:
'自然语言处理', '机器学习任务'
Similarity: 0.9187518514435784
sentence:
'今天天空很蓝,阳光明媚', '机器学习任务'
Similarity: 0.7653104788070156
Five, complete experiment code
from bert_serving.client import BertClient
import numpy as np
def cos_similar(sen_a_vec, sen_b_vec):
'''
计算两个句子的余弦相似度
'''
vector_a = np.mat(sen_a_vec)
vector_b = np.mat(sen_b_vec)
num = float(vector_a * vector_b.T)
denom = np.linalg.norm(vector_a) * np.linalg.norm(vector_b)
cos = num / denom
return cos
def main():
bc = BertClient()
doc_vecs = bc.encode(['今天天空很蓝,阳光明媚', '今天天气好晴朗', '现在天气如何', '自然语言处理', '机器学习任务'])
print(doc_vecs)
similarity=cos_similar(doc_vecs[0],doc_vecs[4])
print(similarity)
if __name__ == '__main__':
main()
This article briefly introduces the basic application of BERT, using the bert framework to encode Chinese sentences to generate sentence vectors, and at the same time, it can analyze the semantics of sentences.
The Google BERT pre-training model has been widely used in the fields of deep learning and NLP, and has achieved good results in text classification tasks. Compared with the traditional word embedding word2vec and golve, the effect of using bert pre-training is better.
It can be seen that the basic use of BERT is relatively simple. This article does not analyze the principles and advanced applications of bert in depth and complexity. Instead, it starts from scratch and is positioned at the simple understanding and application of BERT for beginners, using the bert framework bert-as- The server (CS architecture) can also be considered as a basic homework for the in-depth study and research in the future.
If you think it’s good, welcome to "one-click, three-link", like, bookmark, follow, comment directly if you have any questions, and exchange and learn!
My CSDN blog: https://blog.csdn.net/Charzous/article/details/113824876