Knowledge Mapping (KG) in the synonym mining

Foreword

Recommended in semantic search, intelligent quiz scene, as through various methods of data mining to get more and more knowledge, we need to consider a new problem - the merger of old and new knowledge to integrate? Such as  "diarrhea"  and  "diarrhea"  but in reality the same meaning, but because literally a big gap in the pump task is not easy to merge, it will be treated as two different entities term storage. Further, in many involved 实体检索, Query扩展such as the work of the task, the recall will triples affected. This article focuses on what 同义词挖掘the existing general idea of the progress of work and academia work.

Synonyms come from?

The above-mentioned  "diarrhea"  and  "diarrhea"  two words from the Chinese characters, Pinyin, are shaped there is a huge difference, do associate general rule is difficult to use tools, and more common way is to first off-line batch ready data. For example, in searching the scene, ready ahead of a 同义词映射表, when the input Query encounter  "diarrhea" , stomachache  and so to  "diarrhea" alignment. The preparation synonym table is indeed a troublesome thing, the semantics of the word, the sentence itself is abstract semantics on top of the text, one can reason from a small amount of text in the message synonyms, because we have the knowledge to do ever get the support, but the computer is difficult (when is not yet perfect in knowledge map). Therefore, 同义词表often business people need to manually collect high accuracy in this way, but because of the flexibility of natural language, some data is difficult to guarantee the long tail recall, the latter will become encounter a case, to make up a word list If the word is.

Since the limited capacity of the artificial enumeration synonyms, from which data can be found in ready-made synonymous with it? Here is divided into two ideas, first from existing structured data to find where you can find the easiest existing knowledge map data, such as OpenKG (http://openkg.cn/), OwnThink (https : // www .ownthink.com /) where you can find open data mapping knowledge in different fields

These data sets are often disposed  alias  for a Class field:

In addition, another structured data a little trouble in the Baidu Encyclopedia, Wikipedia page is also very easy to find alias information:

Although this part of the data need to do some cleaning, but after all, is the result of expert finishing, accuracy can be guaranteed.

If ready-made data set can not be met, another idea only from unstructured data acquisition in. The key point of this approach is that the context of use, observe the following example:

Diarrhea the most common causes of intestinal infection , probably viral , bacterial or parasitic infections

Memories of how we think diarrhea synonyms of the word, the word is assumed that for us it is strange, a very natural way is to see the sentence description of the word, and the type of entity other words, find and digest medical knowledge close, maybe it diarrhea large may have a similar meaning to the computer, then do nothing more than follow the same logic method.

OK, so the data level, the use of unstructured data acquisition synonyms program, a key point is how to establish a parallel corpus , the popular talk is two paragraphs of text describing the same thing, this process is easy to say, but in fact, if the data corpus without combing, text data collected in the last very difficult to have good coverage. Therefore, we should "take the initiative" to find a scene more likely to produce parallel corpus, such as Dr. Ding Xiang Health Q scene products

Including a large number of doctors and patients of the conversation data, they differ in presentation, but because in this business scenario, the core of the discussion is fixed, so you can naturally form a parallel corpus.

If you currently do not have a similar scenario, but would like to try to construct a parallel corpus data, then a relatively tricky scheme by means of existing search engines, such as we use Baidu search "hurt my stomach," the key words:

Most existing search engine also incorporates synonyms map, expand queries and other functions, although crawling after cleaning the data is relatively strenuous, but in the case of insufficient data may well be a strategy.

Common synonyms thinking excavation

This article focuses on what the unstructured data sets on a common idea of what mining. In general, this type of work is divided into the following steps: 

1) extract from the text mention the word , a simple approach can be used directly word, select the results do certain word synonyms mining. If you need to consider new words or expressions in different languages corpus that may occur, you need to fit Pattern挖掘, NERor 名词短语抽取other means to obtain candidate words. 
2) ready to have a synonym table as seed data 
) acquired all the characteristics of the seed and the candidate word 3, typically characterized by the task will be considered from two angles, respectively, local contextand global context, the popular talk is local features and global features, the former focuses on the word itself, common word-level features, like word-level features; the latter is considered the target word in the data set where the distribution of word or sentence, paragraph semantic feature 
4) according to their actual characteristics of the data set in the work, from existing paper modeling different angles, such as the use and distribution of cross-validation feature pattern, or only to consider improving the word itself pre-training vectors, or differences in the distribution of key consideration candidate words synonymous with the target set. Here specifically to discuss in the next section.

How to model

In this section we have identified a few of the current academic synonym for excavation work, made the following brief summary:

Pattern of use

Mentioned extract data from text data, a very simple idea is to use 模板, which we have discussed in previous articles knowledge map construction review. In fact, 同义词it can be considered a special relationship triples, using the Patternextraction can be protected in the right way rate, but its main problem is that the number of templates as well 语义漂移. In addition, because of the special nature of this relationship synonyms, using templates way to extract data corpus also have additional requirements, such as we use specific trigger words:

Or a specific extraction template:

The only sentence in the Wikipedia category or entry class to explain the high frequency of appearance in the text, it is not easy to find such a neat expression in other types of text.

Although flawed, but the pattern is still at merit, the design part of the common ways to enhance 种子patternthe use of bootstrappingthe way of mining more pattern. Then extracted pattern data is not used directly, but the route as a new candidate word.

《Multi-Distribution Characteristics Based Chinese Entity Synonym Extraction from The Web》

The main content of this work is structured from a class using the Wikipedia data, unstructured text of synonyms. The overall process is as follows:

Author mentions the use of bootstrappingmethods to produce more Soft pattern, to take advantage of all of template to produce more synonyms candidate. It is noted that the screening candidate set is the synonym for the core of this type of work, synonyms herein after obtaining the candidate set, wherein the context will continue to take each group of synonyms, the similarity using these features to construct a similar structure:

On the basis of this figure, the author designed the corresponding Spreading Activationalgorithm:

There will be another use for a synonym spreading activation model obtained 重排模型will get the final result.

Better pre-training

In addition to the use of 模板other natural idea it is to use 词向量, and explain in a number of Query扩展technical articles, will briefly mention use word2vecto get the current query word similar word to get more recall. This is indeed a simple and effective way, here we talk about it a little more carefully.

《Hierarchical Multi-Task Word Embedding Learning for Synonym Prediction》

The focus of this work is how to tap this task as a synonym for better training of pre-term vectors. Familiar with Skip-gramthe students and other words vector training methods should all know, the ability of this approach is the use of a sliding window is actually a finite length to obtain representation of a word, it is a reverse mapping process, and this process has nothing to do with semantics . When the amount of data is large enough, we may happen to get a synonym of a word, but more often, we can only say two words more "relevant" semantically, such as "cold" and "fever" in the medical field high frequency of co-occurrence, simple word vector seek similar in fact difficult to find the difference between them. This paper is to Skip-grambe improved on the basis of an attempt to add more semantically related information:

The authors of the original Skip-grammodel is modified to multi-task in two tasks, one for each type of entity to predict where the target word in a sentence, another word for the prediction itself, so adding semantic information entities, author the test set, the model can be significantly improved semantic similarity between word vectors.

《SurfCon: Synonym Discovery on Privacy-Aware Clinical Data》

In addition to the semantic feature is added, another way to enhance pre-trained to do their own data mining more

In this paper due to the special application scenarios medical electronic medical records, pay more attention to short entity Query word in the text. The authors therefore expand the word level, word level, electronic medical records and medical entities in features to enhance the co-occurrence word vector of the pre-training process. Similarly, the author will be used at the end a sort module to enhance the robustness of the entire system.

Seeds make the best use of the data repository

In addition to the above-mentioned pattern, context feature, wherein the semantic entity type, mapping knowledge already synonyms data, the data source is worth using Reflection.

《Automatic Synonym Discovery with Knowledge Bases》

This article is Jiawei Han teacher team in KDD2017年发表的一篇工作:

Similar to the previously mentioned work, the article is also on word level and sentence level features a complete pre-training vectors for patternuse of bootstrapping methods to do the expansion. The difference is that the Model Learning stage, the authors collected a distributed approach and method templates generated Seeds of data , these seeds synonyms data is then trained to binary classifier, which further constrain the candidate word, improve accuracy.

《Mining Entity Synonyms with Efficient Neural Set Generation》

Jiawei Han is also a teacher team work, published in the excavation of synonyms AAAI2019:

This was further enhanced use of Knowledge Graph has been synonymous with the data set. Article mentions entity synonym setthe concept in the entire process, the authors intended to use the library in the overall distribution of a large number of synonyms clusters, to assess whether an entity belongs to a word synonymous clusters. Highlights of the article is a collection Set learns, and the new entity into the determination of method Set.

to sum up

In summary, in the actual work industrial scene, synonyms excavation work should start with the beginning of the data source level, look for existing structured data can be utilized to clean, this may be the most economical way. Secondly, as far as possible the parallel corpus to find using the existing disclosure crawling search engines, build the data set. In mining algorithm, consider pattern-based method and a context-based combination. Meanwhile, if relatively complete knowledge map, you can consider modeling study has been synonymous with the distribution of the data set, be applied. In addition, articles can be seen from the mining algorithms work, the entire mining pipeline actually consists of several small NLP tasks, such as named entity recognition, entity links, text correction and so on. Good optimization under these sub-tasks, but also have no small impact on the final result.

 

Published 33 original articles · won praise 0 · Views 3272

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/104548758
KG