[Intensive reading of papers] GPT-NER: Named Entity Recognition via Large Language Models

foreword

An article that was posted on arxiv on April 26, 2023 is the first article I have seen that uses LLM to solve NER tasks. In my opinion, LLM is the optimal solution to NER problems, especially for small samples. Scene, LLM with rich prior knowledge, its emergent ability always amazes me.


Abstract

The performance of LLM on NER is lower than the baseline, because the two tasks are different, the former is a text generation task, and the latter is a sequence labeling task. GPT-NER bridges the gap between the two by converting the sequence labeling task to the generation task of LLM. For example, the input is Columbus is a city, the output @@Columbus## is a city, @@## is the tag of the entity to be extracted. In order to solve the hallucination problem of LLM, that is, LLM tends to confidently treat NULL output as an entity, this paper also proposes a self-validation strategy , prompting LLM to ask itself whether the extracted entity belongs to the labeled entity label. GPT-NER achieves comparable performance to fully supervised baselines on five widely used datasets, and outperforms supervised models in few-shot scenarios.

1. Introduction

LLM requires only simple examples and can generate results for new test inputs. Under the framework of situational learning, LLM has achieved promising results in various NLP tasks such as translation, question answering, and relation extraction. However, due to the different tasks of LLM and NER, the performance of LLM on NER is much lower than the baseline. GPT-NER can convert NER tasks into text generation tasks. This strategy can significantly reduce the difficulty of generating label information texts that encode input sequences. Experiments show that this strategy significantly improves performance.
In order to alleviate the problem that LLM tends to confidently treat NULL output as an entity, a self-verification strategy is proposed, which is placed after the entity extraction stage and prompts LLM to ask whether the extracted entity belongs to the labeled entity label. This way the hallucination problem can be effectively solved and the performance can be improved significantly.
The experimental part has achieved the performance equal to that of full supervision. In addition, due to the token length limit (4096), the performance cannot be stabilized. If GPT-4 with more than 20K length tokens is used, it will definitely be improved.
GPT-NER outperforms supervised models in low-resource, small-sample NER.

2. Related Work

2.1 Named Entity Recognition

Named entity recognition (NER) is a task of identifying key information in text and classifying it into a set of predefined categories.

2.2 Large Language Models and In-context Learning

Strategies for using LLMs for downstream tasks can be classified into two categories, fine-tuning and contextual learning. The former needs to continue to train and update parameters on the downstream supervised data, and the latter prompts LLM to generate text under small-sample demonstrations. Better prompts and demonstrations can improve the performance of situational learning.

3. Background

3.1 NER as Sequence Labeling

A common way to solve NER is to treat it as a sequence labeling task, which can be divided into two steps: representation extraction and classification.
Representation extraction: aim to obtain the high-dimensional representation of the input sequence token, send the input sentence to the encoder model such as BERT, and then use the last layer of embedding as the high-dimensional representation of the token hi ∈ R m × 1 h_i\in \mathbb{R }^{m \times 1}hiRm × 1 .
Classification:Each embedded high-dimensional vector is fed to an MLP, which generates a distribution using softmax.

4. GPT-NER

GPT-NER uses LLM to solve NER tasks, which follows the general paradigm of situational learning and can be broken down into three steps:

  1. Build a prompt, and build a prompt for each input sentence;
  2. Feed the constructed prompt into the LLM to get the generated text sequence;
  3. Convert text sequences to entity labels.

The following describes strategies for adapting LLMs to NER tasks.

4.1 Prompt Construction

image.png
The picture above is an example of GPT-NER, which consists of three parts.

4.1.1 Task Description

The task description can be further broken down into three parts:

  1. The first sentence is a description of the task, telling the LLM to use linguistic knowledge to produce output;
  2. Indicates the category of the entity to be extracted. For each input sentence, construct N prompts, each corresponding to an entity type, which can be understood as N binary classification tasks, which are limited by the length of the token;
  3. Describes the location of the small sample demo.

4.1.2 Few-shot Demonstration

The format of each tagged sentence needs to satisfy:

  1. Contains information about each word label, which can be easily converted to a sequence of entity types;
  2. Can be successfully generated by LLM.

For example, the sentence "Columbus is a city" generates "LOC OO O", condition 1 is easy to meet, but for generating sequences, LLM needs to learn text alignment, which increases the difficulty of generating tasks, and the author found that GPT-3 is difficult to generate Sentences of the same length as the input. In order to solve this problem, the author designed a special symbol to surround the entity, as shown below:
image.png
This method significantly reduces the difficulty of text generation.

4.1.3 Input Sentence

This part feeds the current input sentence into the LLM and expects the LLM to produce an output sequence according to the format defined in Section 4.1.2.

4.2 Few-shot Demonstrations Retrieval

4.2.1 Random Retrieval

The most straightforward strategy is to randomly select K samples from the training set, but this cannot guarantee that the retrieved examples are semantically close to the input.

4.2.2 kNN-based Retrieval

In order to solve the random retrieval correlation problem, the K nearest neighbors of the input sequence can be retrieved from the training set. First, the representation of all training samples is calculated, and on this basis, the k nearest neighbors of the input sequence are obtained.
**kNN based on sentence-level representation: **Using the text similarity model, obtain the sentence-level representation of the training examples and the input sequence to calculate the cosine similarity to find kNN. Such shortcomings are obvious. NER is a token-level task, and it pays more attention to the local area. The examples that may be found do not contain NER.
Entity-level embedding: first extract entity-level representations of all tokens using a fine-tuned NER tagging model, for a given input sequence of length N, we first traverse all tokens in the sequence to find the kNN for each token, obtaining K × N retrieval arrived mark. Next, we select the top k tokens from the K×N retrieved tokens and use their associated sentences as demonstrations.

4.3 Self-verification

LLM has hallucinations or over-prediction problems, as shown below:
image.png
where Hendrix is ​​identified as location information is obviously wrong, for which the author proposes a self-verification strategy. Given the entities extracted by the LLM, we ask the LLM to further verify whether the extracted entities are correct, answer yes or no.
image.png
Again, this requires a small number of demonstrations to improve the accuracy of the self-validator, as indicated by the yellow box in the figure above.
Example Selection: We need to select demos for self-validation with small samples. Since the core of self-validation is to ask whether the extracted entity is a specific entity type, we need to select training examples with extracted entities.
Entity-level embeddings are therefore chosen for kNN demo search instead of sentence-level representations:

  1. First extract entity-level representations for all training tokens through a fine-tuned NER model;
  2. Use the same model to extract representations of query terms;
  3. Finally, we use the representation of the query term to select k examples from the data store as few-shot demonstrations.

5. Experiments

Experiment with GPT-3.

5.1 Results on the Full Training Set

5.1.1 Results on Flat NER

For planar NER, entities cannot overlap each other. Experiments are carried out on CoNLL2003 and OntoNotes5.0. The former contains four types of named entities: Location, Organization, Person, and Other. The latter contains 18 types of named entities, 11 types (person, organization) and 7 values ​​(data, percentages).
image.png
Main Results. The above table shows the results of the partial and complete test sets of flat NER respectively. The observations are as follows:

  1. kNN retrieval is crucial for NER tasks;
  2. Token-level embedding significantly improves performance;
  3. Adding self-validation further improves performance;
  4. The LLM-based system achieves the same performance as the baseline.

5.1.2 Results on Nested NER

With nested NER, entities in each sentence may overlap with each other. The authors conduct experiments on three widely used nested NER datasets ACE2004 & ACE2005 and GENIA. The former contains seven types of entities, which are divided into training set, verification set and test set by 8:1:1. The latter is a nested NER dataset in the field of molecular biology, which contains five entity types.
image.png
Main Results. The results are shown in the table above, and it is observed that:

  1. kNN retrieval is crucial for NER tasks;
  2. Token-level embedding significantly improves performance;
  3. Adding self-validation further improves performance.

The gap between SOTA and SOTA is larger than flat NER because:

  1. Nested NER contains more similar entities;
  2. The annotation guidelines for the three nested NER datasets are more complex and less straightforward.

5.2 Results on Low-resource Scenario

NER experiments in low-resource scenarios were conducted on CoNLL2003. To imitate low-resource scenarios, randomly select subsets of the training set, 8 training sentences, 100 training sentences, and 10K training sentences. The setting of 8 training sentences ensures that each entity type contains one positive and one negative example.

5.2.1 Results

image.png
The result is shown in the figure above, with the following observations:

  1. When the training set size is small, the performance of the supervised model is much lower than that of GPT-3;
  2. With the increase of training data, the performance of kNN retrieval is faster than that of random retrieval;
  3. When the amount of data reaches 10%, as the amount of training data increases, the performance of the supervised model is significantly improved, while the results of GPT-3 are marginally improved.

6. Ablation Study

6.1 Varying the Format of LLM Output

Compare the following two output formats:
image.pngimage.png
BMES: directly output the beginning, middle, result and O of each token;
Entity+Position: ask LLM to output the entity and its position in the sentence.
For like-for-like comparisons, we conduct experiments with 32 small samples on the 100-sample CoNLL 2003 dataset using the same settings as for the three output formats. The results are 92.68 (##@@strategy), 29.75, 38.73 respectively. The analysis may be that BMES needs to learn the alignment between tokens and labels, and it is difficult to output and input strings of the same length. For the entity + location strategy, LLM confuses the location index, resulting in incorrect entity locations.

6.2 The Number of Few-shot Demonstrations

image.png
Looking at the graph above, it can be seen that as k increases, all three results keep increasing, meaning that if more demonstrations are allowed, the performance will still improve.
An interesting phenomenon is that when the number of demonstrations is small, the KNN strategy is inferior to the random retrieval strategy, probably because KNN tends to select demonstrations that are very similar to the input sentence, and if the input sentence does not contain any entities, most of the retrieved demonstrations do not contain any entity. As follows:
image.png

7. Conclusion

This paper proposes GPT-NER to adapt LLM to NER tasks. The author designs a prompt to prompt LLM to generate entity tags. In addition, KNN and token embedding are designed in the demonstration part to help LLM generate better output, and the author proposes a self-verification strategy to alleviate the hallucination problem of LLM. The final model performance is comparable to the baseline, and has a significant advantage in low-resource scenarios.

read summary

An article that was only posted on arxiv on April 26, 2023 was the first article I saw that used LLM to solve NER tasks. Let me lament the unlimited potential of LLM, and this is only the effect of the baseline in GPT-3. If it is replaced by the current GPT-4, the result is unimaginable. In my opinion, LLM is the best for NER tasks. Solution, such a complex sequence labeling problem, really needs magic to defeat magic. The comparative learning method and meta-learning method I have seen before may be really vulnerable to LLM. Of course, I will continue to investigate.

Guess you like

Origin blog.csdn.net/HERODING23/article/details/130476395