NLP——Information Extraction information extraction


insert image description here insert image description here

Information Extraction steps

insert image description here

  • Named Entity Recognition (NER): Named Entity Recognition
  • Relation Extraction: Relation Extraction
    insert image description here

Named Entity Recognition (NER)

insert image description here
insert image description here

Typical Entity Tags Typical Entity Tags

insert image description here

  • Ambiguity problems may be encountered in named entity recognition (this kind of problem is also common in POS tagging):
    insert image description here
  • To remove ambiguity, we need context.
    insert image description here
  • Although in POS we can use HMM, it is not suitable for NER tasks:

Hidden Markov Models (HMM) is a model that can be used to process sequence data, and it performs very well in some NLP tasks, such as part-of-speech tagging. However, HMM has some limitations, making it may not perform well when dealing with ambiguity in named entity recognition. The reasons are as follows:

  • First-order Markov property: HMMs assume that each state in a sequence (in NER tasks, these states may be entity labels) depends only on the previous state. This means that HMM cannot directly capture contextual information beyond one step. In practice, however, a word's entity type may depend on a wider context. For example, "apple" has different entity types in the sentences "I like to eat Apple" and "I like to use Apple products", but this information cannot be captured by an HMM that only looks at the former word.

  • Cannot handle complex features: HMMs usually use simpler features, such as the previous word or the previous label. However, NER tasks may require complex features, such as word parts of speech, word positions, and the relationship of words to other words in the sentence, etc. These features cannot be directly used in HMM.

  • Independent output assumption: HMMs assume that given a sequence of hidden states, the observed states (in NER tasks these might be words) are independent. This means it cannot directly model word-to-word dependencies. However, in actual texts, there are often strong dependencies between words.

IO tagging

  • We can use IO taggingto solve the above problem

In Named Entity Recognition (NER), IO tagging (IO Tagging) is a common tagging strategy. In this strategy, there are only two labels: I (Inside) and O (Outside). Here's what they mean:

  • "I": Indicates that the word is part of a named entity.
  • "O": Indicates that the word is not part of any named entity.
    For example, in the sentence "Apple is based in California", if our goal is to identify the organization name (ORG) and the place name (GPE), then the possible IO label is "I-ORG OOO I-GPE".
    insert image description here
  • But the disadvantages of IO Tagging are:
    insert image description here
  • This is because IO annotations cannot identify the beginning and end of an entity.
  • As an example, consider this sentence: "Apple and Microsoft are tech companies." If we use IO notation, then "Apple and Microsoft" will be marked as "III", so that it is impossible to determine whether "Apple", "and", and "Microsoft" are one entity or three entities.
  • Similarly, for a single entity containing multiple words, such as "San Francisco", the IO annotation will also give the "II" annotation, which looks the same as the above situation, and it is impossible to distinguish whether they are one entity or two entities.

IOB tagging

insert image description here
IOB (or BIO) tagging (IOB Tagging) is a common tagging strategy. In this strategy, there are three labels: B (Begin), I (Inside), O (Outside). Here's what they mean:

  • "B": Indicates that the word is the beginning of a named entity.
  • "I": Indicates that the word is part of a named entity, but not the beginning.
  • "O": Indicates that the word is not part of any named entity.
    For example, in the sentence "Apple is based in California", if our goal is to identify the organization name (ORG) and place name (GPE), then the possible IOB annotation is "B-ORG OOO B-GPE".

Compared with the IO tagging strategy, the IOB tagging strategy can better deal with adjacent named entities of the same type. For example, in the sentence "Apple and Google are tech companies," "Apple" and "Google" could be labeled "B-ORG" and "B-ORG," respectively, thereby distinguishing them as two distinct entities.

  • That is to say, if there are two adjacent entity names, and if the two entity names belong to the same entity, for example, then san franciscowill sanbe marked as B-ORGand franciscowill be marked as I-ORGthis to indicate that these two belong to the same entity, and if it is two unrelated entities, such as Microsoft Applewill be marked as B-ORGand B-ORGto represent two different entities
Steve Jobs founded Apple Inc. in 1976

Steve - B-PER
Jobs - I-PER
founded - O
Apple - B-ORG
Inc. - I-ORG
in - O
1976 - B-TIME

insert image description here
insert image description here
insert image description here

Neural network to do NER

insert image description here

The application of neural networks to named entity recognition (NER) tasks has achieved remarkable success. The following is a common way to use neural networks for NER:

  • Preprocessing: Split the input text into words or tokens. It may be necessary to do some other preprocessing steps, such as lowercase or stemming.

  • Word embedding: convert each word into a dense vector, which can capture the semantic information of the word. This can be done by using a pretrained word embedding model such as Word2Vec or GloVe. Pretrained models have been trained on a large amount of text, which can capture rich lexical knowledge.

  • Sequence encoding: use a neural network model that can handle sequence data, such as recurrent neural network (RNN), long short-term memory (LSTM), or gated recurrent unit (GRU), to process word embedding sequences. Such a model can capture the sequential relationship between words.

  • Decoding: For each word, its label is predicted based on its contextual encoding. This is usually achieved by adding a fully connected layer and a softmax function on top of the sequence encoding. Each label will have a softmax score representing the probability that the word belongs to each possible entity class.

Relation Extraction

Relation Extraction (Relation Extraction) is a key task in Natural Language Processing (NLP), whose goal is to identify and extract predefined relationships between entities from text. For example, in the sentence "Barack Obama was born in Hawaii.", we can extract the relation ("Barack Obama", "born in", "Hawaii").
insert image description here
insert image description here
insert image description here

Rule-based

insert image description here

Supervised Relation Extraction

insert image description here
insert image description here

  • In this example, since we have already identified all of them through NER entity, the task now is to entitycombine them and perform binary classification to determine whether there is a difference between them.relation
  • Positive means there is a relationship, negative means there is no relationship
  • For those that are related, we classify the relationship
    insert image description here
    insert image description here

Semi-supervised

insert image description here

  • The data set required for supervised learning requires a large number of annotations, but the annotations are very expensive, so a semi-supervised method can be used for model training
    insert image description here

It is mainly used to overcome the problem that supervised relation extraction requires a large amount of labeled data. Semi-supervised relation extraction methods use a small amount of labeled data and a large amount of unlabeled data for training.

A common semi-supervised relation extraction method is Bootstrapping . Bootstrapping methods start with a small set of labeled data (seed instances) for training, and then use the trained model to predict relationships in unlabeled data. These predictions are considered correct and added to the training set. Then, the model is retrained using the updated training set. This process is repeated until the model converges or reaches a preset number of iterations.

For example, assuming we have a seed instance "Barack Obama was born in Hawaii", we can first train a model to recognize "was born in"this relationship. Then, we use this model to predict the relationship in other sentences, and if the model predicts that there is such a "was born in"relationship in "Steve Jobs was born in San Francisco", then we add this instance to the training set and then retrain.

Semantic Drift semantic drift

Semantic Drift is a common problem in semi-supervised learning, especially in the bootstrapping process . This is because during bootstrapping, the model updates its training data at each iteration by adding the predicted instances to the training set.

Semantic drift refers to the situation when the model starts to learn one relation from the initial seed instance, and then gradually starts to learn other relations different from the initial relation. This is usually caused by model misprediction, i.e. the model mistakenly believes that an instance represents the target relationship and adds it to the training set, thereby changing the distribution of the training data.

For example, suppose we are in the task of relation extraction, and the goal is to find "was born in"the relation of . Our initial seed instance might be "Steve Jobs was born in San Francisco". However, if the model mistakenly identifies "Apple was founded in Cupertino" as a "was born in"relation in one iteration and includes it in the training set, the model may start to learn "founded in"this relation, which creates semantic drift.

insert image description here

Distant supervision Remote supervision

insert image description here
Distant Supervision is a weakly supervised learning method for relation extraction. The core idea of ​​distant supervision is that if two entities have a certain relationship in a knowledge base (such as Freebase, Wikidata, etc.), then all sentences containing these two entities can be regarded as instances of this relationship.

For example, if we know that there is a "birthplace" relationship between "Barack Obama" and "Hawaii" in the knowledge base, then any sentence that mentions "Barack Obama" and "Hawaii", such as "Barack Obama was born in Hawaii" or "Hawaii is the birthplace of Barack Obama", can be regarded as an instance of the "birthplace" relationship.

Using distant supervision methods, we can automatically construct large-scale relation extraction training sets from unlabeled text and knowledge bases. However, distant supervision methods also suffer from a significant problem, namely the mislabeling problem (also known as the insufficient distance supervision assumption). Since not all sentences containing two related entities represent such a relationship, this may lead to many mislabeled instances. For example, the sentence “Barack Obama went on vacation in Hawaii” does not imply a relation of “birthplace”, but under the assumption of distant supervision, it may be mislabeled as an instance of a relation "was born in".

Unsupervised Relation Extraction (“OpenlE”)

insert image description here

Unsupervised Relation Extraction (Unsupervised Relation Extraction), such as OpenIE (Open Information Extraction), is a method of directly extracting entity relationships from unlabeled text without relying on labeled data.

OpenIE aims to extract a form of (subject, relation, object) triples from text. The advantage of this approach is that it can extract any kind of relationship, not just predefined relationship types. However, this also means that it may extract a large number of less useful or noisy relations.

The workflow of OpenIE usually includes the following steps:

  • Sentence Segmentation and Word Tagging: First, the text is segmented into sentences and subjected to part-of-speech tagging and named entity recognition.

  • Syntactic parsing: Next, each sentence is parsed, usually dependency parsing.

  • Relation extraction: Then, extract (subject, relation, object) triples according to the result of syntactic parsing. Typically, subject and object are named entities in a sentence, and relations are verb phrases between them.

  • Triple screening: Finally, triples can be screened or scored by some heuristic rules or learning-based methods to improve the quality of the results.

It is worth noting that the OpenIE approach is usually computationally intensive since it needs to process a large number of sentences and parse trees. Furthermore, due to its unsupervised nature, its results may contain a lot of noise, requiring subsequent processing to improve usability.

Evaluation

insert image description here

  • For named entity recognition (NER) and relationship extraction tasks, commonly used evaluation methods include precision (Precision), recall rate (Recall) and F1 score (F1-Score).

  • These indicators are defined based on the concepts of True Positives (TP), False Positives (FP) and False Negatives (FN).

  • For the relation extraction task, the evaluation needs to consider that both entity pairs and their relations are correct before the prediction is correct. For example, if a relation is ("Barack Obama", "born in", "Hawaii") but the model predicts ("Barack Obama", "live in", "Hawaii"), then this is considered a wrong prediction.

Other Information Extraction task

Temporal Expression Extraction

insert image description here

Event Extraction

insert image description here

Guess you like

Origin blog.csdn.net/qq_42902997/article/details/131216388