Taken Ah Heng Xu, adding some understanding and comments
Introduction
Knowledge extraction involving "knowledge" is usually clear, factual information that come from different sources and structures, but different methods of knowledge extraction data sources vary, with the knowledge acquired from the structured data D2R the difficulty lies in the complex data table, including nested table, multi-column, foreign key, etc., from the knowledge acquired by the link data in the map of challenge is aligned data, knowledge acquisition packaging from semi-structured data, the difficulty lies in wrapper automatically generated, updated and maintained, this one is mainly about access to knowledge from the text, that is, we are talking about a broad information extraction.
Information extracted three most important / most talked about sub-tasks:
Entity extraction
is named entity recognition, comprising detecting entity (Find) and classification (Classify)
relation extraction
generally we say triples (Triple) extract, a predicate (the predicateA) with 2 parameter (argument), as Founding -location (IBM, New York)
event extraction
is equivalent to extract one polyol relationship (not describe)
An entity extraction / NER (the NER)
Entity extraction or named entity recognition (NER) plays an important role in information extraction, extraction is the main element of atomic information in the text, such as name, organization / institution name, location, event / date, character values, the amount of value Wait. Entity extraction task has two key words: find & classify, find the named entities, and classification.
ex:
Main applications:
Named entity as an index and hyperlinks
preparatory steps sentiment analysis, sentiment analysis in the need to identify companies and products, in order to further categorized as emotional word
relation extraction (Relation Extraction) preparation step
QA system, most of the answers are named entity
Traditional machine learning methods
Standard procedures:
Training:
1. Collect representative training documents
2. For each token (the word phrases, personal understanding) marked a named entity (not belonging to any entity on standard O Others)
3. designed to fit the text feature extraction methods and category
4. a training sequence classifier (sequence classifier) to predict the data label (categories, people, places, etc.)
Testing:
1. Collect documents Test
2. Run classifier for each token sequence marked
3. Output named entity (NE)
Feature selection (Features for sequence labeling)
look at a more important features (Feature)
Word substrings (substring)
acting Word substrings (including the prefix and suffix) is large, an example in the following examples, NE (named entity) middle 'OXA' all likelihood drug, there is an intermediate NE ':' is mostly movie, and at the end of the NE Field often place.
Pros Links, disambiguation
entity recognition after completing the need to be normalized, such as the Wanda Group, the Dalian Wanda Group, Wanda Group Co., Ltd. These entities can actually be fused.
The main steps are as follows:
1. The entity recognition
named entity recognition, dictionary matching
2. Candidate entity generating
surface name extension, search engine queries entity reference table
3. candidate entity disambiguation
FIG method, generating a probability model, relating to the model, the depth study
Relation extraction
Relation extraction needs to be extracted semantic relations between two or more entities from the text, the method has the following main categories:
Formwork (hand-written patterns) based on (also referred to as rule-based)
- Based on trigger words / strings (mode)
- Based dependency grammar (as a starting point to construct verb rules, dependencies on speech and the edge node defined)
Summary
advantage handwritten rules: - Artificial rule has high accuracy (high-precision)
- Can be customized for specific areas (tailor)
- Easy to implement on a small scale datasets, build a simple
Disadvantages:
- Low recall (low-recall)
- Template requires experts in specific areas of the building, to consider all the possible pattern comprehensive difficult, very time-consuming effort
- The need to define the pattern for each relationship
- Difficult to maintain
- Transplantable gender differences
Machine learning methods not detailed in this article
Supervised learning (supervised machine learning)
- Machine Learning
- Deep learning (Pipeline vs Joint Model)
Supervised learning - Summary
If the test set and the training set is very similar, so supervised learning accuracy will be high, however, it is limited generalization capabilities of different genre, the model is relatively weak, it is difficult to expand new relationships; on the other hand obtaining such a large training set the price is expensive.
Semi-supervised / unsupervised learning (semi-supervised and unsupervised)
- Bootstrapping
- Distant supervision
- Unsupervised learning from the web