Document Reading Notes (XI)

Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions笔记整理

 

 

First, the paper organize your thoughts flow

This article of the main methods of entities linked a comprehensive overview and analysis, and discussed a variety of applications, evaluation of physical link system and future direction. Essay is a kind of popular science articles.

  1. Firstly, we summarize the reasons for entities linked jobs created:

1) a large amount of data produced in the form of natural language, but in particular natural language generation data is ambiguous named entity class of data are high.

2) Existing knowledge base when inserting a new entity or fact inevitably require reference to the new entity and entity link original knowledge base.

  1. Then the article link on the real tasks are described in detail, it refers to entities linked task has been given a knowledge base that contains entity set E, and contains a set of named entity M text. The task is to link the entity named entity set with reference to each text entity entities matching link corresponding repository. Each entity is referred to a text sentence and m can be mapped to a previously defined entity, if the entity corresponding to the knowledge base could not find any reference to an entity, the entity will be referred labeled NIL. Usually a physical link system include:

Generating 1) a set of candidate entities

2) selection entity ranking candidate

3) indicating the result of the connection (connection failure prediction mentioned)

  1. At the beginning of the article also describes the application direction of the entity link:

1) Information Extraction: named entity and relation information extraction system usually requiring knowledge and to disambiguate link.

2) Information retrieval: the need arise explicit entity-based search semantic entities in the network mentioned in the text, in order to more accurately handle semantic entities and Web documents

3) Content Analysis

4) Q & A system

5) generating a knowledge base

  1. The article also describes the current simple common knowledge base: Wikipedia, YAGO, DBpedia, Freebase
  2. After this paper introduces the various components of the physical link systems and methods commonly used to separate, including the generation of candidate entities, entity ranking candidates, we can not predict the links mentioned three parts.
  3. Candidate generating means each entity entity referred m ∈ M, the system must find a physical link candidate entity set for it, the whole set of candidate entities each entity is an entity with knowledge base may be mentioned link. Candidate entity generates main techniques are commonly used

1) dictionary based naming technique: Wikipedia using features provided by (e.g.) obtained by combining a set of local dictionary. Named dictionary D is a ⟨key, value⟩ map key column is the name of the list. Suppose k is a bond in the column named k.value mapped values ​​in the value column k is referred to a set of entities associated with the name. Construction dictionary D is generally the following features: physical page (Wikipedia specifically described page all the information that an entity), the redirecting page (Wikipedia contains other pages may be relevant to the entity), disambiguation page (Wiki Wikipedia page distinguish multiple entities of the same name), the first paragraph of the bold words, the article hyperlinks.

2) in the form of surface identified by the local document extension: identifying a plurality of names mentioned form, such as abbreviations, aliases. The use of heuristic methods (using N-Gram method, after deleting the acronyms same initials stop words, check whether the presence of N consecutive words throughout the document), a method based on supervised learning.

3) Based on a search engine

  1. It refers to the entity ranking candidate entities ranked candidate in the candidate set of entities Em produced then select the most appropriate physical link.

1) supervised learning method: a major dichotomy (for a given entity mentioned entities and candidates, using the binary classifier to determine whether the entity mentioned in reference to the candidate entity), learning to rank methods (ranking based on the training data automatically builds model, select the highest ranked candidate entity), the likelihood method (query document largely refers to the partial coherent entity, they are using this "theme coherence" to deal with the candidate entity ranking problem), graph-based methods .

2) unsupervised learning methods: vector space model (calculation of similarity between the vector and the candidate vectors mentioned physical entity), based on the information retrieval method (candidate entities indexed as a separate document, and to extract for each entity and they mention from the entity and its context documents generated search queries)

  1. Associated with the candidate entity ranking feature

1) associated text features: name string comparison (compare strings based on similarity), repeated physical extent

2) not related to the text feature: Context text (the text between the measure mentioned in the context entity surrounding and documents associated with the candidate entity similarity, usually with a bag of words, the concept of context vector representation), the link entities the degree of coherence between (a document generally refers to one or several related topics in a coherent entity, and can utilize the common theme of continuity links to entities mentioned in the same document.)

  1. Mentioned link failure prediction:

1) ranked highest with a score s etop entity associated. If the score is less than NIL s threshold τ, the reference entity returns NIL and m is m is not mentioned prediction link.

2) based on supervised learning, training data can predict whether the named mentioned link

3) In the method of learning to rank based on the entity as a candidate add NIL, NIL output ranked as the highest-ranked entity which is considered not mentioned link.

  1. Metrics:

1) Accuracy (Precision): Consider all the entities of the system referred to by the link, and to determine how the physical link system links the correct entity by reference

2) Recall (Recall): consider all links should be mentioned entities, to properly measure the ratio of the link and a corresponding link mentioned entities all entities mentioned

3) F1 Measure

  1. Possible future direction as well as the existing problems:

1) The current most solid link system have focused on the physical link is detected entities mentioned tasks from unstructured documents (such as news articles and blog) in. However, the entity mentioned may also occur in other types of data, and these types of data need to link with Knowledge

2) Most of the work on entities linked lack of analysis of the computational complexity, so they usually do not assess the efficiency and scope of their systems

3) establish knowledge base and filling in specific areas (for example, demand in the biomedical, entertainment, products, finance and tourism) is growing, and therefore the entity links to specific areas is also important. Link to a particular Entity field concentrated in a specific data field, and the knowledge of specific areas may have different structures and general knowledge

Guess you like

Origin www.cnblogs.com/hwx1997/p/12444167.html