Knowledge of the literature review (Chapter III entity identification and links) map

Chapter entity recognition and links

1. The task definition, objectives and significance

  Entity is an important language in the information-carrying units of text, but also the knowledge map of the core unit.

  Named entity recognition means to identify named entities in the text, and classified to a specific class of tasks [Chinchor & Robinson, 1997]. Common entities categories include names, place names, organization names, dates and so on.

  Entity linked mainly to solve the name of the entity ambiguity and diversity issues, the text refers to the task in the name of the entity they represent the point of real-world entity, also commonly referred to as an entity disambiguation. For example, a word "Apple released the latest product iPhone X", an entity linked system needs to text "Apple" and its real-world meaning of "Apple" in correspondence. Entity recognition and text links are the core technology of mass analysis, in order to solve the information overload provides an effective means.

2. research and challenges

  The main task facing the entity analyzes several key scientific questions:

  1. The ambiguity and diversity entity names.

  2. Lack of resources (Low Resource) problem. Currently most of the entities analysis algorithms rely on supervised model, it requires a lot of training corpus to achieve practical performance. However, taking into account the cost of annotated data, and in most cases can not get enough training corpus to deal with different areas for different text styles (standard, non-standardized), different languages ​​(Chinese, English, along a wide variety of circumstances all the way minority languages, etc.) and so on. Without extensive training corpus unsupervised / semi-supervised technology, resources automatically builds technology, and migration learning technology is the core research problem to solve the problem.

  3. The open-ended questions entity. Entity to the complexity and openness. Complexity refers to the entity type of the entity is varied, while having a complicated hierarchical structure between the types. Openness means that the entity is not an entity closed the set, but with the increase of time, evolution and failure. Openness and complexity of the entity to entity analysis has brought enormous challenges: openness allows existing supervised extraction methods can not adapt to the open knowledge; huge scale entity makes it impossible to use enumeration or manually write the way to deal with, while over time the performance of existing models will fall.

3. Technical methods and research status

Traditional method:

  NER with CRF.

  Entity reference link using similarity computing entity (mention) and the knowledge base entity and select the target entity based on the above-mentioned specific entity similarity.

Depth learning method:

  Entity Recognition. With deep learning in different areas of popular, more and more deep learning models have been proposed to solve the problem entity recognition. Currently there are two types of learning architecture for a typical depth of named entity recognition, one is NN-CRF architecture [Lample et al, 2016], In this architecture, CNN / LSTM vector is used to learn a word at each position indicated based on the vector representation, NN-CRF best decoded tag at this position. The second is the idea of ​​using a sliding window classification, using neural network learning ngram each sentence representation, and then predict whether a target entity is the ngram [Xu et al., 2017].

  Pros Links. The core entity is to build a unified multi-link type multi-modal context and knowledge representation and modeling of different information, mutual interaction between the different evidence. By mapping the different types of information to the same feature space, and to provide efficient training algorithm end, the depth to said learning method provides a powerful tool for the task. Related work includes a plurality of current source heterogeneous learning vector representation of evidence, and evidence of similarity between different learning etc. [Ganea & Hofmann, 2017] [Gupta et al., 2017] [Sil et al., 2018]. Compared to traditional statistical methods, the main advantage of deep learning approach is that it is an end of the training process, related features without manual definition. Another advantage is that the depth learning can represent specific learning task, establishing different modalities, different types of information associated with different languages, so as to achieve better analytical performance of the entity. At present, how to integrate into the depth of knowledge to guide learning methods (such as linguistics structural constraints, knowledge structure), consider the constraints between multiple tasks, and how deep learning entity recognition to solve the problem of lack of resources (such as building language-independent name ) is the current hot work.

  Text mining method

  Traditional statistical methods and depth of learning methods require extensive training corpus and the target entity class previously well-defined, open the entity can not handle big data analysis tasks in the environment. In addition to unstructured text, Web often there are a large number of high-quality semi-structured data sources, such as Wikipedia, the form of web pages, lists, search engine query logs and so on. These structures often contain a wealth of semantic information. Therefore, semantic knowledge on semi-structured Web data source acquisition (knowledge harvesting), such as large-scale knowledge-sharing entity of knowledge on the community (such as Baidu Encyclopedia, interactive encyclopedia, Wikipedia) extraction, often using text mining method. Representative text mining extraction system comprising DBPedia [Auer et al., 2007], Yago [Suchanek & Kasneci, 2008, BabelNet, NELL Kylin, and the like. The core text mining methods are specific rule entity. Construction mined from a specific structure (e.g., list, Infobox). Since the rules themselves may be marked by uncertainty and ambiguity, and the target structure may be some noise, text mining methods tend to score and filter the semantic knowledge based on a specific algorithm. In addition, it was found that structured data sources contain only a limited category of entities, lack of long-tailed category covered, on the other hand physical access to technology is often used Bootstrapping strategy, take full advantage of big data redundancy and open access to the Web from specify the type of entity. Representative moieties include the work of the system and Snowball TextRunner system [Agichtein & Gravano, 2000].

  The main problem of open entity set extension of semantic drift in recent years, the main work focused on solving the problem. Specific techniques include exclusive Bootstrapping technology, Co-Training and Co-Bootstrapping art technology. Text mining method only from readily available and has a clear structure of the corpus of knowledge extraction, thus extracted knowledge quality is often higher. However, relying solely on structured data mining can not cover most of the semantics of human knowledge: First of all, the vast majority of structured data source knowledge is knowledge of high prevalence, insufficient coverage of the long tail knowledge; In addition, it was discovered now there are structured data sources only cover a limited category of semantic knowledge, compared to human knowledge is still far from enough.

  Therefore, how to combine text mining methods ( for semi-structured data, the high quality of the extracted knowledge but low coverage) and text extraction methods (for unstructured data, text extracted knowledge coverage but low quality methods compared Mining high) advantage, integration of knowledge from different data sources, and integrates with existing large-scale knowledge base [Nakashole et al., 2012] , is one of the direction of text mining methods.

4. Technical outlook and trends

  Looking at the situation and identify the physical state of the art research and development, we believe that its development direction as follows:

  1. depth learning model of integration of prior knowledge

  In recent years, deep learning model has been made on the entity to identify and link tasks made considerable progress, and demonstrated considerable technological potential and advantages. But the success of the current depth learning model still relies on large amounts of training data, the lack of specific task-oriented design features. Traditional statistical model has been demonstrated in many previous prior knowledge of the effectiveness of the entity to identify and link tasks, such as sentence structure, linguistic knowledge, the task itself bound knowledge and knowledge base features and so on. Design how to integrate the above prior knowledge in depth learning model is targeted and effective means to enhance the existing depth model . On the other hand, the conventional depth model during analysis entity remains a black box model, which may lead to weak explanatory, and it is difficult to incrementally build the models. How to build interpreted, incremental depth learning model is worthy of a future problem-solving.

  2. Lack of resources under entity analytics technology environment

  Currently, most research has focused on the analysis entity to build more accurate models and methods that are generally oriented pre-defined entity type, model parameters using a labeled training corpus. However, when building information extraction system in real environment, supervised these methods tend to have less than the following:

  1) after replacing existing models of supervision corpus type, there is often a significant performance degradation;

  2) the existing monitoring model can not be analyzed entity other than the target class;

  3) the existing monitoring model performance model relies on large-scale training corpus to improve.

  To solve these problems, how to build a solid analysis of the lack of resources in the system environment is related technology practical core issues . Related research interests include: Building a migration learning technology, full use of existing training corpus; self-learning technology research, building high-performance information extraction system of lifelong learning in rare human intervention; prior technology, automatic reuse research incremental learning information extraction module, so that different resources can be gradually increased, but not always re-start training; research unsupervised / semi oversight / supervision and technical knowledge, to explore effective means of supervised learning beyond the existing technology to solve the annotated data bottlenecks.

  3. Scalable field open to the analysis entity

  Since the foundation of the increasing number of tasks and task analysis application entity needs the support of the entity to identify and link technology. This requires an entity analysis technique capable of handling the challenges brought about by a variety of different contexts, to achieve good performance in an open environment. However, the existing entity analytics system, often for news text, the lack of research on other scenarios (such as different types of text microblogging, comments, and other list pages, in different contexts such as multi-modal context, and the context of short text database context) of. Therefore, one of the development entity analysis is to build an open field for scalable entity analysis techniques. Including:

1) the scalable data size: information extraction system needs to be able to efficiently handle massive scale data to be extracted;

2) Scalability of the types of data sources: information extraction system needs to be able to obtain robust performance in the face of different types of data sources;

3) Field of scalability: information extraction system needs to be able to easily migrate from one area to another area;

4) Scalability context: physical analysis system needs to be able handle a different context, and to improve their specific adaptation of different contexts.

 

Guess you like

Origin www.cnblogs.com/the-wolf-sky/p/11080767.html