Using BERT for named entity recognition tasks

Named entity recognition NER task is a common task of NLP,

It is short for Named Entity Recognition.

Simply put, it is to identify various named entities in a sentence.

Such as: names of people, places, institutions, etc.

For example, for the following sentence:

小明对小红说:"你听说过安利吗?"

Its NER extraction results are as follows:

[{'entity': 'person',
  'word': '小明',
  'start': 0,
  'end': 2},
 {'entity': 'person',
  'word': '小红',
  'start': 3,
  'end': 5},
 {'entity': 'organization',
  'word': '安利',
  'start': 12,
  'end': 14}]

In essence, NER is a token classification task, and each token in the text needs to be classified.

Those tokens that are not named entities are generally represented by a big 'O'.

It is worth noting that since some named entities are composed of multiple consecutive tokens, in order to avoid indistinguishable two consecutive identical named entities, it is necessary to distinguish whether the token is at the beginning of the named entity.

For example, for the following sentence.

我爱北京天安门

If we do not distinguish whether the token is the beginning of a named entity, we may get such a token classification result.

我(O) 爱(O) 北(Loc) 京(Loc) 天(Loc) 安(Loc) 门(Loc)

Then when we do post-processing, we connect tokens of the same category to get a location entity 'Beijing Tiananmen'.

However, 'Beijing' and 'Tiananmen' are two different location entities, and it is more reasonable to distinguish them. Therefore, we can classify tokens in this way.

我(O) 爱(O) 北(B-Loc) 京(I-Loc) 天(B-Loc) 安(I-Loc) 门(I-Loc)

We use B-Loc to indicate that this token is the start token of a location entity, and use I-Loc to indicate that this token is an internal (including middle and end) token of a location entity.

In this way, when we do post-processing, we can connect the B-loc and the I-loc behind it into one entity. In this way, you can get the result that 'Beijing' and 'Tiananmen' are two different locations.

The advantage of distinguishing whether the token starts with entity is that we can distinguish consecutive named entities of the same category. The disadvantage is that the number of categories will almost double (n+1->2n+1).

In many cases, it is not common to have such consecutive entities with the same name, but for the sake of safety, it is still necessary to distinguish whether the token starts with entity.

1. Prepare data

Reply keyword: torchkeras in the background of the public account algorithm food house , get the notebook code of this article and the download link of the lane line data set.

import numpy as np 
import pandas as pd 

from transformers import BertTokenizer
from torch.utils.data import DataLoader,Dataset 
from transformers import DataCollatorForTokenClassification
import datasets

1. Data loading

datadir = "./data/cluener_public/"

train_path = datadir+"train.json"
val_path = datadir+"dev.json"

dftrain = pd.read_json(train_path,lines=True)
dfval = pd.read_json(train_path,lines=True)

entities = ['address','book','company','game','government','movie',
              'name','organization','position','scene']

label_names = ['O']+['B-'+x for x in entities]+['I-'+x for x in entities]

id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}
text = dftrain["text"][43]
label = dftrain["label"][43]

print(text)
print(label)
世上或许有两个人并不那么喜欢LewisCarroll的原著小说《爱丽斯梦游奇境》(
{'book': {'《爱丽斯梦游奇境》': [[31, 39]]}, 
'name': {'LewisCarroll': [[14, 25]]}}

2. Text word segmentation

from transformers import BertTokenizer
model_name = 'bert-base-chinese'
tokenizer = BertTokenizer.from_pretrained(model_name)
tokenized_input = tokenizer(text)
print(tokenized_input["input_ids"])
[101, 686, 677, 2772,..., 518, 113, 102]
#可以从id还原每个token对应的字符组合
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
for t in tokens:
    print(t)
[CLS]
[UNK]
(
[SEP]

3. Label alignment

It can be seen that the token length after text segmentation is not the same as the text length,

There are mainly the following reasons: First, BERT will add some special characters after word segmentation, such as [CLS],[SEP]

Second, there will be subwords of some English words as a token. (such as 'charles' in this example),

Additionally, there are elements that are not in the dictionary marked as [UNK]having an impact.

Therefore, it is not an easy task to assign correct labels to these tokens.

We take two steps. The first step is to convert the original dict form label into character-level char_label

In the second step, align char_label to token_label

# 把 label格式转化成字符级别的char_label
def get_char_label(text,label):
    char_label = ['O' for x in text]
    for tp,dic in label.items():
        for word,idxs in dic.items():
            idx_start = idxs[0][0]
            idx_end = idxs[0][1]
            char_label[idx_start] = 'B-'+tp
            char_label[idx_start+1:idx_end+1] = ['I-'+tp for x in range(idx_start+1,idx_end+1)]
    return char_label
char_label = get_char_label(text,label)
for char,char_tp in zip(text,char_label):
    print(char+'\t'+char_tp)
世	O
上	O
或	O
许	O
有	O
两	O
个	O
人	O
并	O
不	O
那	O
么	O
喜	O
欢	O
L	B-name
e	I-name
w	I-name
i	I-name
s	I-name
C	I-name
a	I-name
r	I-name
r	I-name
o	I-name
l	I-name
l	I-name
的	O
原	O
著	O
小	O
说	O
《	B-book
爱	I-book
丽	I-book
斯	I-book
梦	I-book
游	I-book
奇	I-book
境	I-book
》	I-book
(	O
def get_token_label(text, char_label, tokenizer):
    tokenized_input = tokenizer(text)
    tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
    
    iter_tokens = iter(tokens)
    iter_char_label = iter(char_label)  
    iter_text = iter(text.lower()) 

    token_labels = []

    t = next(iter_tokens)
    char = next(iter_text)
    char_tp = next(iter_char_label)

    while True:
        #单个字符token(如汉字)直接赋给对应字符token
        if len(t)==1:
            assert t==char
            token_labels.append(char_tp)   
            try:
                char = next(iter_text)
                char_tp = next(iter_char_label)
            except StopIteration:
                pass  

        #添加的特殊token如[CLS],[SEP],排除[UNK]
        elif t in tokenizer.special_tokens_map.values() and t!='[UNK]':
            token_labels.append('O')              


        elif t=='[UNK]':
            token_labels.append(char_tp) 
            #重新对齐
            try:
                t = next(iter_tokens)
            except StopIteration:
                break 

            if t not in tokenizer.special_tokens_map.values():
                while char!=t[0]:
                    try:
                        char = next(iter_text)
                        char_tp = next(iter_char_label)
                    except StopIteration:
                        pass    
            continue

        #其它长度大于1的token,如英文token
        else:
            t_label = char_tp
            t = t.replace('##','') #移除因为subword引入的'##'符号
            for c in t:
                assert c==char or char not in tokenizer.vocab
                if t_label!='O':
                    t_label=char_tp
                try:
                    char = next(iter_text)
                    char_tp = next(iter_char_label)
                except StopIteration:
                    pass    
            token_labels.append(t_label) 

        try:
            t = next(iter_tokens)
        except StopIteration:
            break  
            
    assert len(token_labels)==len(tokens)
    return token_labels
token_labels = get_token_label(text,char_label,tokenizer)
for t,t_label in zip(tokens,token_labels):
    print(t,'\t',t_label)
[CLS] 	 O
世 	 O
上 	 O
或 	 O
许 	 O
有 	 O
两 	 O
个 	 O
人 	 O
并 	 O
不 	 O
那 	 O
么 	 O
喜 	 O
欢 	 O
[UNK] 	 B-name
的 	 O
原 	 O
著 	 O
小 	 O
说 	 O
《 	 B-book
爱 	 I-book
丽 	 I-book
斯 	 I-book
梦 	 I-book
游 	 I-book
奇 	 I-book
境 	 I-book
》 	 I-book
( 	 O
[SEP] 	 O

4. Build the pipeline

dftrain.head()

51f8d021b4b43b9f4cb3db995417b477.png

def make_sample(text,label,tokenizer):
    sample = tokenizer(text)
    char_label = get_char_label(text,label)
    token_label = get_token_label(text,char_label,tokenizer)
    sample['labels'] = [label2id[x] for x in token_label]
    return sample
from tqdm import tqdm 
train_samples = [make_sample(text,label,tokenizer) for text,label in 
                 tqdm(list(zip(dftrain['text'],dftrain['label'])))]
val_samples = [make_sample(text,label,tokenizer) for text,label in 
                 tqdm(list(zip(dfval['text'],dfval['label'])))]
100%|██████████| 10748/10748 [00:06<00:00, 1717.47it/s]
100%|██████████| 10748/10748 [00:06<00:00, 1711.10it/s]
ds_train = datasets.Dataset.from_list(train_samples)
ds_val = datasets.Dataset.from_list(val_samples)
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
dl_train = DataLoader(ds_train,batch_size=8,collate_fn=data_collator)
dl_val = DataLoader(ds_val,batch_size=8,collate_fn=data_collator)
for batch in dl_train:
    break

Second, define the model

from transformers import BertForTokenClassification

net = BertForTokenClassification.from_pretrained(
    model_name,
    id2label=id2label,
    label2id=label2id,
)

#冻结bert基模型参数
for para in net.bert.parameters():
    para.requires_grad_(False)

print(net.config.num_labels) 

#模型试算
out = net(**batch)
print(out.loss) 
print(out.logits.shape)

ad597c4424d147c8f854813634a263b7.png

Three, training model

import torch 
from torchkeras import KerasModel 

#我们需要修改StepRunner以适应transformers的数据集格式

class StepRunner:
    def __init__(self, net, loss_fn, accelerator, stage = "train", metrics_dict = None, 
                 optimizer = None, lr_scheduler = None
                 ):
        self.net,self.loss_fn,self.metrics_dict,self.stage = net,loss_fn,metrics_dict,stage
        self.optimizer,self.lr_scheduler = optimizer,lr_scheduler
        self.accelerator = accelerator
        if self.stage=='train':
            self.net.train() 
        else:
            self.net.eval()
    
    def __call__(self, batch):
        
        out = self.net(**batch)
        
        #loss
        loss= out.loss
        
        #preds
        preds =(out.logits).argmax(axis=2) 
    
        #backward()
        if self.optimizer is not None and self.stage=="train":
            self.accelerator.backward(loss)
            self.optimizer.step()
            if self.lr_scheduler is not None:
                self.lr_scheduler.step()
            self.optimizer.zero_grad()
        
        all_loss = self.accelerator.gather(loss).sum()
        
        labels = batch['labels']
        
        #precision & recall
        
        precision =  (((preds>0)&(preds==labels)).sum())/(
            torch.maximum((preds>0).sum(),torch.tensor(1.0).to(preds.device)))
        recall =  (((labels>0)&(preds==labels)).sum())/(
            torch.maximum((labels>0).sum(),torch.tensor(1.0).to(labels.device)))
    
        
        all_precision = self.accelerator.gather(precision).mean()
        all_recall = self.accelerator.gather(recall).mean()
        
        f1 = 2*all_precision*all_recall/torch.maximum(
            all_recall+all_precision,torch.tensor(1.0).to(labels.device))
        
        #losses
        step_losses = {self.stage+"_loss":all_loss.item(), 
                       self.stage+'_precision':all_precision.item(),
                       self.stage+'_recall':all_recall.item(),
                       self.stage+'_f1':f1.item()
                      }
        
        #metrics
        step_metrics = {}
        
        if self.stage=="train":
            if self.optimizer is not None:
                step_metrics['lr'] = self.optimizer.state_dict()['param_groups'][0]['lr']
            else:
                step_metrics['lr'] = 0.0
        return step_losses,step_metrics
    
KerasModel.StepRunner = StepRunner
optimizer = torch.optim.AdamW(net.parameters(), lr=3e-5)

keras_model = KerasModel(net,
                   loss_fn=None,
                   optimizer = optimizer
                   )
keras_model.fit(
    train_data = dl_train,
    val_data= dl_val,
    ckpt_path='bert_ner.pt',
    epochs=50,
    patience=5,
    monitor="val_f1", 
    mode="max",
    plot = True,
    wandb = False,
    quiet = True
)

713beb26fe8fd7b653e5e0aacab9a0d5.png

Fourth, the evaluation model

from torchmetrics import Accuracy
acc = Accuracy(task='multiclass',num_classes=21)
acc = keras_model.accelerator.prepare(acc)

dl_test = keras_model.accelerator.prepare(dl_val)
net = keras_model.accelerator.prepare(net)
from tqdm import tqdm 
for batch in tqdm(dl_test):
    with torch.no_grad():
        outputs = net(**batch)
        
    labels = batch['labels']
    labels[labels<0]=0
    #preds
    preds =(outputs.logits).argmax(axis=2) 
    acc.update(preds,labels)
acc.compute()  #这里的acc包括了 ’O‘的分类结果,存在高估。
tensor(0.9178, device='cuda:0')

Five, use the model

We can use pipeline to string together the entire prediction process.

Note that we use the built-in 'simple' aggregation_strategy here,

Automatically merge tokens that should be merged into one entity.

from transformers import pipeline
recognizer = pipeline("token-classification", 
                      model=net, tokenizer=tokenizer, aggregation_strategy='simple')
net.to('cpu');
recognizer('小明对小红说,“你听说过安利吗?”')
[{'entity_group': 'name',
  'score': 0.6913842,
  'word': '小 明',
  'start': None,
  'end': None},
 {'entity_group': 'name',
  'score': 0.58951116,
  'word': '小 红',
  'start': None,
  'end': None},
 {'entity_group': 'name',
  'score': 0.74060774,
  'word': '安 利',
  'start': None,
  'end': None}]

Six, save the model

After saving the model and tokenizer, we can load it with a pipeline and make batch predictions.

net.save_pretrained("ner_bert")
tokenizer.save_pretrained("ner_bert")
('ner_bert/tokenizer_config.json',
 'ner_bert/special_tokens_map.json',
 'ner_bert/vocab.txt',
 'ner_bert/added_tokens.json')
recognizer = pipeline("token-classification", 
                      model="ner_bert",
                      aggregation_strategy='simple')
recognizer('小明对小红说,“你听说过安利吗?”')
[{'entity_group': 'name',
  'score': 0.6913842,
  'word': '小 明',
  'start': 0,
  'end': 2},
 {'entity_group': 'name',
  'score': 0.58951116,
  'word': '小 红',
  'start': 3,
  'end': 5},
 {'entity_group': 'name',
  'score': 0.74060774,
  'word': '安 利',
  'start': 12,
  'end': 14}]

Reply keywords in the background of the public account: torchkeras , get the notebook source data set and more interesting examples of alchemy~

d3ebc76c457278f76c27ceba3990b48d.png

Guess you like

Origin blog.csdn.net/Python_Ai_Road/article/details/131388753