How the concept of named entity disambiguation?

1 Introduction

       The concept is named entity disambiguation named entity disambiguation: an important research sub-areas (English Named Entity Disambiguation) of (visible named entity concept paper chapter 3.1). What is the concept disambiguation? Here a simple example will be described, a named entity "Dragon", it has many different meanings, including the meaning of the drama class, meaning there are several TV series such as "Felix Wong version of the 1997 TV series", "1982 TVB version of the TV series "," 2003 years and Hu Jun version of the TV series "," "2013 years and the version of the TV series" and so on; the meaning of which comics and cartoons have several meanings, such as "Tencent anime comic", "adaptation of Tony Wong comics. "Although the meaning of the drama class there are several, but the meaning is the same concept, they are" TV "concept.


FIG meaning different from Dragon

       Thus the concept of named entity disambiguation task is to identify which one of the concepts in a text given named entity in the end belongs. 3 for example, the following text.

Table 1 Examples of text disambiguation concepts
text meaning concept
A Hong Kong version of Dragon or classic ah, Felix Wong Xiao Feng's temperament before the show Felix Wong version of the TV series 97 TV series
B I am a fan of Zhang Jizhong, of course I like Dragon friends Hu Jun 03 mainland version of the TV series TV series
C I love Hong Kong comics, such as "Son of Heaven Legend", "Dragon" Tony Wong comic adaptation Cartoon

       A text of the Dragon is "1997 Felix Wong version of the TV series" Dragon is in text B "version of the TV series 2003 years and Hu Jun," Dragon is in the text C "Tony Wong comic adaptation." Although the text of A and B in the text of the Dragon is not the same meaning, but the text of A and B in the text is the same concept Dragon category, are "drama" of the Dragon. So the concept of disambiguation task to do is to text A and text B in Dragon are divided into the "drama" of the concept, the text is divided into C in the Dragon "comic" in the concept.
Next article introduces the concept how the named entity disambiguation.

2 disambiguation process concept

2.1 entity acquiring all meaning

       In this paper, Dragon Baidu Encyclopedia explanation for the data source, the first to get all the implications of this entity Dragon "Description" text and the "Properties" table below for the meaning --1997 Dragon Felix Wong version of one of TV's " description "text and the" properties "form.


Figure 2 need to climb to take the contents of the 1997 TV series version Felix Wong

2.2 Construction of the text word keyword phrases

       Get every meaning of the "Description" text and the "Properties" form, use jieba segmentation tools described herein, "" Dragon "is an adaptation of the same name Jin Yong little costume drama of love, ......" conducted by TVBS word, to give a series list1 words constituted. Then extract the "drama, martial arts, romance, costume" and "Li Tiansheng" and other property from the word "property" table, these properties words constitute list2. Then merge list1 and list2, you can get "Felix Wong version of the 1997 TV series" the meaning of the keyword phrases.

       Dragon were each treated as follows meaning, we get the following table

Table 2 Dragon meanings corresponding keyword phrases
meaning Keyword phrases
Felix Wong version of the TV series 97 [ "1997", "Li Tiansheng", "Dragon", "Felix Wong", "Fan Siu Wong", "Zhang Guoqiang," "Benny", "Carman", "Lau Kam Ling", "Chiu," "Ho Mei Tin", "28" "Chen Guoliang", "Hong Kong", "Jin Yong", "martial arts", "costume", "Rain Lau", "Xiao Feng", "Murong Fu"]
Hu Jun 03 mainland version of the TV series [ "Drama", "2003", "costume", "Yu Min", "Liu Yifei," "Ju Jue Liang", "Zhou Xiaowen," "Zhao Arrow", "Jimmy Lin", "12", "11", "22" "Golden Eagle Award", "Dragon", "high-tiger", "Hu Jun", "Tao", "Chen Hao", "Zhang Jizhong," "good works"]
Hong Kong TV version 82 [ "Xu Zhu", "1982", "Dragon", "Excalibur," "Felix Wong", "Huang Xingxiu," "Six Pulse", "Xiao Sheng", "Bryan Leung," "Kent", "Idy Chan "," Shek Sau "," TVB "," 03 "," 22 "," legend "," martial arts "," Hong Kong, China "," Hong Kong "," Qiao Feng "]
Tony Wong comic adaptation [ "Martial arts", "Qiao Feng", "main fact", "Tony Wong", "Dragon", "Wal-Mart", "beggars", "Xu Zhu", "family section", "heroes", " Song, "" other races "," big help "," North Qiao Feng "," wife "," Kang Min "," onto the ground "," Duan Yu "," Hu Shao rights "," the international situation will "]
Tencent anime cartoon [ "Comics", "Dragon", "serial", "Tencent", "cartoon", "Phoenix", "entertainment", "creation"]
…… ……

2.3 Extraction and merging concepts

       The mention of "drama", "cartoon" These concepts are not coming out of thin air, which is obtained by the following algorithm:

       (1) the meaning of the title word segmentation and POS tagging

       Use jieba segmentation tool on the meaning of the title "Felix Wong version of the 1997 drama" in word segmentation and POS standard treatment. Thus we obtain an array [[ '1997', 'm'], [ 'in', 'm'], [ 'Felix Wong', 'nz'], [ 'Version', 'n'], [ 'drama ',' n ']], the i-th element of an array of words and word property thereof.

       (2) obtain the concept of candidate words

       Select only the noun words get in step, then we can get [ 'Felix Wong' 'version', 'drama']

       (3) determining a candidate word

       The ordinary meaning of the title last term is often able to represent the meaning of the words in this specific category of concept, we know from the previous step last term is a "drama", the title was in line with the corresponding concept. Thus obtained following listing

Table 3 keyword phrases and concepts corresponding to different meanings Dragon
meaning Keyword phrases concept
Felix Wong version of the TV series 97 [ "1997", "Li Tiansheng", "Dragon", "Felix Wong", "Fan Siu Wong", "Zhang Guoqiang," "Benny", "Carman", "Lau Kam Ling", "Chiu," "Ho Mei Tin", "28" "Chen Guoliang", "Hong Kong", "Jin Yong", "martial arts", "costume", "Rain Lau", "Xiao Feng", "Murong Fu"] TV series
Hu Jun 03 mainland version of the TV series [ "Drama", "2003", "costume", "Yu Min", "Liu Yifei," "Ju Jue Liang", "Zhou Xiaowen," "Zhao Arrow", "Jimmy Lin", "12", "11", "22" "Golden Eagle Award", "Dragon", "high-tiger", "Hu Jun", "Tao", "Chen Hao", "Zhang Jizhong," "good works"] TV series
Hong Kong TV version 82 [ "Xu Zhu", "1982", "Dragon", "Excalibur," "Felix Wong", "Huang Xingxiu," "Six Pulse", "Xiao Sheng", "Bryan Leung," "Kent", "Idy Chan "," Shek Sau "," TVB "," 03 "," 22 "," legend "," martial arts "," Hong Kong, China "," Hong Kong "," Qiao Feng "] TV series
Tony Wong comic adaptation [ "Martial arts", "Qiao Feng", "main fact", "Tony Wong", "Dragon", "Wal-Mart", "beggars", "Xu Zhu", "family section", "heroes", " Song, "" other races "," big help "," North Qiao Feng "," wife "," Kang Min "," onto the ground "," Duan Yu "," Hu Shao rights "," the international situation will "] Cartoon
Tencent anime cartoon [ "Comics", "Dragon", "serial", "Tencent", "cartoon", "Phoenix", "entertainment", "creation"] Cartoon
…… …… ……

       After obtaining the list is easy to know, whether it is "97 Felix Wong version of the TV series" or "Hu Jun, the mainland version of the TV series 03" or "82 Hong Kong version of the TV series" They all belong to the "drama" concept, they can cluster a "drama "the concept category. Similarly, "Tony Wong comic adaptation" and "Tencent anime cartoon" Clustering can also become a "cartoon" concept category. Therefore, the meaning of belonging to the same concept can be merged operation, namely "97 Felix Wong version of the TV series", "Hu Jun, the mainland version of the TV series 03" and "82 Hong Kong version of the TV series" can be obtained as follows the concept of merging the


FIG. 3 concepts phrase after incorporation

2.4 Concepts disambiguation

       Text disambiguation concept is divided into two steps, the first step is obtained meaning text vector, the second step is to calculate the cosine similarity between the text vector is determined in the target text to the named entity to which the concept (see cosine similarity concept of Terminology) belongs.

       The first step in obtaining the text introduces the concept of vector and target text vector. "TV" corresponding to the concept of a keyword phrase is [ "1997", "Li Tiansheng", "Dragon", "Felix Wong", "Louis Fan", "Guoqiang", "Benny", "Carman", ......] assumed "1997" corresponding word vectors w1, "Li Tiansheng" corresponding word vectors w2, "Dragon" corresponding word vectors w3, ....... Then we can define the "97 Felix Wong version of the TV series" The concept of text vector T1 = (w1 + w2 + ... wn) / n. The target text "Hong Kong version of Dragon or classic ah, Felix Wong Xiao Feng was performed temperament" to be jieba get keyword word processing, and the processing steps described above can be obtained target text vector.

       By cosine similarity calculation you will find the target text and vector "drama" concept vector text cosine similarity maximum, so the concept in the target text should correspond to the "drama" of this concept. As used herein, an open source Chinese word vector mapping text to vector values, dimensions, this Chinese word vector 200 is open dimension, comprising almost all of the popular Chinese words and terms.

3 Terminology

3.1 Named Entity

       Named Entity (English: Named Entity), including place names, organization names, proper nouns, as well as time, number, currency, and other text value ratio. It refers to the things you can use proper nouns (names) identified a named entity generally represents only one specific thing individuals, including names, places and so on. Names such as "Einstein", "Newton" name "Beijing," New York ", agency name" good future "," Tsinghua University "and so considered a named entity processing the named entity is NLP (Natural Language Processing English , natural language processing) is an important field of research.

3.2 vector word

       Word vector (Word embedding), a set of language modeling and feature called Word embedded natural language processing (NLP) technology in learning collectively, in which a word or phrase from the vocabulary is mapped to a vector of real numbers. Conceptually, it involves mathematical embedding from the space of each word to the continuous one-dimensional vector space of a lower dimension.

3.3 cosine similarity

       Cosine similarity to a measure of similarity between them by measuring the angle between the vector product of two cosine space. Cosine of 0 degrees is 1, and any other cosine of the angle is not greater than 1. With the cosine of the angle between two vectors as a measure of vector space between the two individual differences in the size of the measure, i.e. a measure of the difference in the two direction vectors.

Conclusion

       Of course there is also the concept of duplication of parts of speech merge in the calculation of time, such as Dragon entries appear in "1977 Hong Kong drama", "2013 China TV drama" It's time to find two "different" concept according to the proposed method, That "drama" and "film" and apparently this data redundancy occurs. Of course, this text is also a solution can be obtained by further optimized concept similarity computing, clustering concept or keyword data, so that the above problem does not appear in the concept of data we get. Finally, I hope this can help to deal with the majority of NLPer in the text.

Guess you like

Origin www.cnblogs.com/Kalafinaian/p/11407431.html