Knowledge Atlas - based rule - knowledge extraction Profile

Taken Ah Heng Xu, adding some understanding and comments

Introduction

Knowledge extraction involving "knowledge" is usually clear, factual information that come from different sources and structures, but different methods of knowledge extraction data sources vary, with the knowledge acquired from the structured data D2R the difficulty lies in the complex data table, including nested table, multi-column, foreign key, etc., from the knowledge acquired by the link data in the map of challenge is aligned data, knowledge acquisition packaging from semi-structured data, the difficulty lies in wrapper automatically generated, updated and maintained, this one is mainly about access to knowledge from the text, that is, we are talking about a broad information extraction.

Here Insert Picture DescriptionInformation extracted three most important / most talked about sub-tasks:

Entity extraction
is named entity recognition, comprising detecting entity (Find) and classification (Classify)
relation extraction
generally we say triples (Triple) extract, a predicate (the predicateA) with 2 parameter (argument), as Founding -location (IBM, New York)
event extraction
is equivalent to extract one polyol relationship (not describe)

An entity extraction / NER (the NER)

Entity extraction or named entity recognition (NER) plays an important role in information extraction, extraction is the main element of atomic information in the text, such as name, organization / institution name, location, event / date, character values, the amount of value Wait. Entity extraction task has two key words: find & classify, find the named entities, and classification.

ex:
Here Insert Picture DescriptionMain applications:

Named entity as an index and hyperlinks
preparatory steps sentiment analysis, sentiment analysis in the need to identify companies and products, in order to further categorized as emotional word
relation extraction (Relation Extraction) preparation step
QA system, most of the answers are named entity

Traditional machine learning methods

Standard procedures:
Training:

1. Collect representative training documents
2. For each token (the word phrases, personal understanding) marked a named entity (not belonging to any entity on standard O Others)
3. designed to fit the text feature extraction methods and category
4. a training sequence classifier (sequence classifier) to predict the data label (categories, people, places, etc.)

Testing:

1. Collect documents Test
2. Run classifier for each token sequence marked
3. Output named entity (NE)

gfsg
Feature selection (Features for sequence labeling)
look at a more important features (Feature)
Word substrings (substring)
acting Word substrings (including the prefix and suffix) is large, an example in the following examples, NE (named entity) middle 'OXA' all likelihood drug, there is an intermediate NE ':' is mostly movie, and at the end of the NE Field often place.

Pros Links, disambiguation
entity recognition after completing the need to be normalized, such as the Wanda Group, the Dalian Wanda Group, Wanda Group Co., Ltd. These entities can actually be fused.
Here Insert Picture DescriptionThe main steps are as follows:

1. The entity recognition
named entity recognition, dictionary matching

2. Candidate entity generating
surface name extension, search engine queries entity reference table

3. candidate entity disambiguation
FIG method, generating a probability model, relating to the model, the depth study

Relation extraction

Relation extraction needs to be extracted semantic relations between two or more entities from the text, the method has the following main categories:

Formwork (hand-written patterns) based on (also referred to as rule-based)

  • Based on trigger words / strings (mode)
  • Based dependency grammar (as a starting point to construct verb rules, dependencies on speech and the edge node defined)
    Here Insert Picture DescriptionHere Insert Picture DescriptionSummary
    advantage handwritten rules:
  • Artificial rule has high accuracy (high-precision)
  • Can be customized for specific areas (tailor)
  • Easy to implement on a small scale datasets, build a simple

Disadvantages:

  • Low recall (low-recall)
  • Template requires experts in specific areas of the building, to consider all the possible pattern comprehensive difficult, very time-consuming effort
  • The need to define the pattern for each relationship
  • Difficult to maintain
  • Transplantable gender differences

Machine learning methods not detailed in this article

Supervised learning (supervised machine learning)

  • Machine Learning
  • Deep learning (Pipeline vs Joint Model)

Supervised learning - Summary
If the test set and the training set is very similar, so supervised learning accuracy will be high, however, it is limited generalization capabilities of different genre, the model is relatively weak, it is difficult to expand new relationships; on the other hand obtaining such a large training set the price is expensive.

Semi-supervised / unsupervised learning (semi-supervised and unsupervised)

  • Bootstrapping
  • Distant supervision
  • Unsupervised learning from the web
Released eight original articles · won praise 14 · views 472

Guess you like

Origin blog.csdn.net/qq_39304851/article/details/103859772