Several possible applications of BERT

  BERT is a new model Google released in November 2018, it represented a way to pre-language training, training a common "language understanding" model on a large number of text corpus (Wikipedia), and then use this model NLP want to execute the task. Once released, it will set off a whole NLP circles, which have achieved excellent results in 11 major NLP tasks, thus becoming a model of the most attractive areas of NLP. After simple terms, BERT is in training a large number of text corpora (unsupervised), can give a vector of words in English (or Chinese characters) indicate that the word (or characters) has a certain semantic representation ability, therefore, BERT have some prior knowledge in NLP tasks very eye-catching.
  In the article the use of bert-serving-server build bert word vector service (a) , the author concisely describes how to use the bert-serving-server to get the word vector Chinese characters, which greatly reduces the general practitioner threshold using the BERT .
  Work experience and thinking combined with the author this time, I try to give BERT a few possible applications, as follows:

  • NLP basic tasks
  • Find similar words
  • Extract text entities
  • Q entity in alignment

As the author Caishuxueqian and write articles to time constraints, there are deficiencies in the article, the reader is a lot of criticism!

NLP basic tasks

  BERT has published more than six months, has now become NLP in depth learning model indispensable tool, usually loaded Embedding layer model. Due to space reasons, we will not introduce their BERT project, but introduces a few BERT Github projects in basic tasks:

You can see, BERT has been widely used in NLP basic tasks, the export may see its shadow in open source projects, and the author of these projects also wrote a very detailed code works, easy to use.


  Before telling the following three specific applications, we first understand the structure of BERT project application, as follows:

Which, bert_client_lmj.py BERT word vector is calling service, specifically refer to the article use bert-serving-server build bert word vector service (a) , a complete Python code is as follows:

# -*- coding:utf-8 -*-
from bert_serving.client import BertClient
from sklearn.metrics.pairwise import cosine_similarity

class Encoding(object):
    def __init__(self):
        self.server_ip = "127.0.0.1"
        self.bert_client = BertClient(ip=self.server_ip)

    def encode(self, query):
        tensor = self.bert_client.encode([query])
        return tensor

    def query_similarity(self, query_list):
        tensors = self.bert_client.encode(query_list)
        return cosine_similarity(tensors)[0][1]

if __name__ == "__main__":
    ec = Encoding()
    print(ec.encode("中国").shape)
    print(ec.encode("美国").shape)
    print("中国和美国的向量相似度:", ec.query_similarity(["中国", "美国"]))

Find similar words

  Using the word vector may find a few words in the article specified words most similar. Specific practices: now the article word for each word after word, its similarity with the query specified words, and finally by the similarity output words can be. Our article is an example of Lao She's "gardening", reads as follows:

I love flowers, so also love gardening. I can not become a gardening expert, because there is no time to study and test. I just put flowers as a pleasure in life, good or bad do not care about the size of bloom, flowering as long as I am happy. In my small yard, a summer full of flowers, kittens had never ending play, on the ground without their playground.
Although many flowers, but not flowers and herbs. Precious flowers is not easy to feed and watch a good flower sick to death, it is a sad affair. Beijing's climate, it is not good for gardening, winter cold, windy spring, summer droughts is not a rain-soaked, autumn is the best, but will suddenly trouble frost. In this climate, the flowers of southern good support, I have not so much skill. Therefore, I only grow species easy to live, they will struggle flowers.
However, despite the flowers and they will struggle, if I ignore, either on their own, or more than half will die. I have to take care of them every day as they care about like good friends. One to two to go, I touched some of the doorways: some shade, where it is not exposed to the sun; prefer dryness should not be watered. Feeling the doorway, feed the plants, and three to five years old alive, flowering, so fun ah! It is not exaggeration to say that this is knowledge ah! And the more knowledge is never a bad thing.
I do not have a leg, I only do not help, nor can I sedentary. I do not know the flowers under my care, I'm not grateful thanks; I have to thank them. I work, I always wrote while on the go to the hospital to see, watering the tree, moving and basin, and then returned to the house to write for a while, and then go out. This cycle, so that mental and physical labor are properly adjusted wholesome, better than medication. If the catch storm or sudden change in weather, you have to mobilize the whole family, rescue flowers, very tense. Hundreds of potted plant, must quickly grab the house, and people backache, leg pain, Rehan DC. The next day, the weather is good, they have to move out of the flowers, it once again backache, leg pain, Rehan DC. However, this is how interesting it! Do not work, even the trees, the flowers not make a living, Is not it the truth?
Milkman comrades door to boast "incense", which made our family proud. Rushed to the night-blooming cereus is open when about a few friends to look at, the more flavor candle night - the night blooming cereus always open. It took root partition, and one into the trees, they presented some of my friends. See them take away the fruits of their labor, of course especially happy.
Of course, there are also sad, this summer there is such a return. Three hundred chrysanthemum seedlings in the ground (not the time to move pots), under the rain, my neighbor's wall came down, chrysanthemum seedling was killed over thirty, a hundred trees. Family days without a smile.
Mixed feelings, laughter and tears, flowers and fruit, incense colored, both have to work, and long experience, that's the fun of gardening.

Designated words as "happy", the query "flowers" in an article in the "happy" five words closest complete Python code is as follows: (find_similar_words.py)

# -*- coding:utf-8 -*-
import jieba
from bert_client_lmj import Encoding
from operator import itemgetter

# 读取文章
with open('./doc.txt', 'r', encoding='utf-8') as f:
    content = f.read().replace('\n', '')

ec = Encoding()
similar_word_dict = {}

# 查找文章中与'开心'的最接近的词语
words = list(jieba.cut(content))
for word in words:
    print(word)
    if word not in similar_word_dict.keys():
        similar_word_dict[word] = ec.query_similarity([word, '开心'])

# 按相似度从高到低排序
sorted_dict = sorted(similar_word_dict.items(), key=itemgetter(1), reverse=True)

print('与%s最接近的5个词语及相似度如下:' % '开心')
for _ in sorted_dict[:5]:
    print(_)

Output results are as follows:

与开心最接近的5个词语及相似度如下:
('难过', 0.9070794)
('高兴', 0.89517105)
('乐趣', 0.89260685)
('骄傲', 0.87363803)
('我爱花', 0.86954254)

Extract text entities

  In the event extraction, we often need to extract some specific elements, such as in the following sentence,

Pakistan local time December 16, 2014 morning, Tehrik-i-Taliban Pakistan militants attacked the northwestern city of Peshawar a military school children, killing 141 people, including 132 people aged 12 to 16 year-old student.

We need to extract the attackers, which is a terrorist organization that element.
  From the parsing directly, maybe we can get some results, but because of changing the way the event description, syntax analysis will become more complex and does not necessarily guarantee results. At this time, we try BERT term vectors, which to some extent can be used as complementary strategies to help us locate the elements of the event. Specific ideas as follows:

  • Specifies the event template element
  • Sentence word for word n-gram do
  • N-gram similarity of each query with the template
  • Sorted by similarity n-gram, taking the highest similarity n-gram

Here, our event elements as a terrorist organization, designated template for the "Islamic organization", complete Python program is as follows (find_similar_entity_in_sentence.py):

# -*- coding:utf-8 -*-

import jieba
from operator import itemgetter
from bert_client_lmj import Encoding

# 创建n-gram
def compute_ngrams(sequence, n):
    lst = list(zip(*[sequence[index:] for index in range(n)]))
    for i in range(len(lst)):
        lst[i] = ''.join(lst[i])
    return lst

# 模板
template = '伊斯兰组织'
# 示例句子
doc = "巴基斯坦当地时间2014年12月16日早晨,巴基斯坦塔利班运动武装分子袭击了西北部白沙瓦市一所军人子弟学校,打死141人,其中132人为12岁至16岁的学生。"

words = list(jieba.cut(doc))
all_lst = []
for j in range(1, 5):
    all_lst.extend(compute_ngrams(words, j))

ec = Encoding()
similar_word_dict = {}

# 查找文章中与template的最接近的词语
for word in all_lst:
    print(word)
    if word not in similar_word_dict.keys():
        similar_word_dict[word] = ec.query_similarity([word, template])

# 按相似度从高到低排序
sorted_dict = sorted(similar_word_dict.items(), key=itemgetter(1), reverse=True)

print('与%s最接近的实体是: %s,相似度为 %s.' %(template, sorted_dict[0][0], sorted_dict[0][1]))

Output results are as follows:

与伊斯兰组织最接近的实体是: 塔利班运动武装分子,相似度为 0.8953854.

It can be seen that the algorithm is successful in helping us to locate the terrorist organizations: the Taliban movement militants, the effect is very good, but because it is unsupervised word vector generated results are not necessarily controllable, and the algorithm is run slower, this It can be improved by the project.

Q entity in alignment

  In the smart questions and answers, we tend to adopt the knowledge map or database storage entity, which is a difficult entity aligned. For example, we store in the database the following entities: (entities.txt)

094 / promotion
052C type (LUYANG Ⅱ grade)
Liaoning Ship / Kuznetsov / the Varyag,
Gerald Ford · R · USS
052D type (LUYANG Class III)
054A type
CVN-72 / Lincoln / Lincoln

This name is very complex entity, if the user wants to query entity "Liaoning ship", will encounter difficulties, but because the entities stored in a database or mapping knowledge, an entity not directly modify. One way is through keyword matching positioning entity, here, we can be implemented by BERT word vector, complete Python code is as follows: (Entity_Alignment.py)

# -*- coding:utf-8 -*-
from bert_client_lmj import Encoding
from operator import itemgetter

with open('entities.txt', 'r', encoding='utf-8') as f:
    entities = [_.strip() for _ in f.readlines()]

ec = Encoding()

def entity_alignment(query):

    similar_word_dict = {}

    # 查找已有实体中与query最接近的实体
    for entity in entities:
        if entity not in similar_word_dict.keys():
            similar_word_dict[entity] = ec.query_similarity([entity, query])

    # 按相似度从高到低排序
    sorted_dict = sorted(similar_word_dict.items(), key=itemgetter(1), reverse=True)

    return sorted_dict[0]

query = '辽宁舰'
result = entity_alignment(query)
print('查询实体:%s,匹配实体:%s 。' %(query, result))

query = '林肯号'
result = entity_alignment(query)
print('查询实体:%s,匹配实体:%s 。' %(query, result))

Output results are as follows:

查询实体:辽宁舰,匹配实体:('辽宁舰/瓦良格/Varyag', 0.8534695) 。
查询实体:林肯号,匹配实体:('CVN-72/林肯号/Lincoln', 0.8389378) 。

  Here, the speed of the query should not be difficult, because we can be saved entity in an offline way to inquire about their word vector and stored, so come in a query to the entity consulted only once word vector, and calculate the term vectors with off-line similarity. This method is also flawed, mainly due to unsupervised word vector, sometimes not very accurate alignment of the entity, but as a complementary strategy might be considered.

to sum up

  This article describes several applications BERT word vectors I think this time, due to limited capacity, the article will consider the existence of improper place, the reader is also a lot of criticism.
  In addition, we will continue to study technical aspects of the word vector, such as vector word Tencent, Baidu and other word vector, welcome attention ~

Note: The author may wish to know under the micro-channel public number: Python crawlers and algorithms (Micro Signal as: easy_web_scrape), welcome attention ~

Guess you like

Origin www.cnblogs.com/jclian91/p/10987841.html