2022 Sohu Campus NLP Algorithm Competition First Place Sentiment Analysis Program Understanding and Reappearance

Table of contents

1. Comprehension of the competition and program

Defects of baseline

first place plan

Data dimension changes

Second, the code implementation

number one code

swa - average weight

baseline code

3. Effect display

The first plan:

a、adamW + swa

b、sgd + swa 

baseline scheme


        On Zhihu, I saw the sharing and sharing of the first-place sentiment analysis plan in the 2022 Sohu Campus NLP Algorithm Competition . I felt that the plan was very simple and elegant, and at the same time it had a hint of prompt learning (strictly speaking, it was not prompt learning), and the effect was very good. Although they also gave more detailed ideas and code based on pytorch-lightning in their solution sharing, but some details are not clear enough, and the code is not easy to understand, so make a clearer explanation in the blog And share more concise (better understand torch-based) code.

1. Comprehension of the competition and program

        The task of this competition is - Entity-oriented text description sentiment polarity and color intensity analysis. Emotional polarity and intensity are divided into five situations: extremely positive, positive, neutral, negative, and extremely negative. The contestant needs to analyze the emotional polarity and intensity of the entity from the perspective of text description for each given entity object.

Data are as follows:

{"id": 7410, "content": "With such an amazing Nets, can fans and experts have high expectations for him in the playoffs? Therefore, throughout the season, everyone's predictions are reasonable. This year's season The Eastern Conference finals of the playoffs should still be the same as last year, and the Nets and Bucks are still expected to meet as scheduled. Today's Nets Big Three are not the ultimate, but they are also easy to "cook the Ding Jie Niu", the way for the Bulls should still be For a long time, the NBA is still the stage where the superstar speaks!", "entity": {"Nets": 1, "Playoffs": 0}}
{"id": 88679, "content": "2014.09 Member of the Standing Committee of the Hainan Provincial Party Committee, Secretary of the Danzhou Municipal Party Committee, and Deputy Secretary of the Yangpu Economic Development Zone Working Committee 2014.10 Member of the Standing Committee of the Hainan Provincial Party Committee and Secretary of the Sanya Municipal Party Committee 2016.11 Member of the Standing Committee of the Hainan Provincial Party Committee and Secretary of the Haikou Municipal Party Committee Inspected in September 2019.", "entity": {"Secretary of the Municipal Party Committee": 0, "Hainan Provincial Party Committee": 0}}

For the content text and the given entity in the above data, analyze the emotional colors contained in the content respectively. Obviously this is a classification task. When I saw this question, the solution that flashed in my mind was exactly the same as the baseline they gave:

[CLS]content[SEP]entity_0[SEP]

[CLS]content[SEP]entity_1[SEP]

[CLS]content[SEP]entity_2[SEP]

......

[CLS]content[SEP]entity_n[SEP]

After splicing the content and each entity according to the above, it is sent to the bert model to extract the sentence vector, and then passed through the classifier. This completes the task. This scheme is also used in the competition. It is said that the effect is not very satisfactory. Here's a look at the plan for the first place in the competition:

Defects of baseline

As shown in the figure below (citing the figure in the competition author's plan sharing)

 Because the entity data of each piece of data is not equal, a splicing scheme like the baseline will cause the model to see the content text differently, which may have an impact on the final effect; at the same time, each piece of data is copied the number of entities, resulting in Too much training data and low efficiency. Another problem is that the selection of the sentence vector obtained by the model will also have a certain error. In the baseline scheme, either cls or all token embeddings are used for meanPooling, which will also have a certain impact on the final result; the last is That each entity is individually spliced, it feels a bit weakening the connection between each entity, which will have a certain impact on the final result.

first place plan

As shown in the figure above (referring to the picture shared by the author of the competition), the entities in each piece of data are spliced ​​with [MASK], and then spliced ​​with the content text using [SEP], so that a classification can be efficiently constructed in a piece of data The task does not need to be repeated multiple times for each piece of data like the baseline. At the same time, the choice of the last sentence vector is also avoided here, and the embedding corresponding to [MASK] is directly used as the classification embedding of each entity emotion. The introduction of [MASK] in this scheme also has a hint of prompt learning in it, and the author said that the effect is better. On the other hand, it is not strict prompt learning, it does not need to predict what the specific token at [Mask] is, and then do the class mapping, that is, it does not need to do the construction of the Prompt answer space mapping (Verbalizer), just do A Prompt template (Template) construct.

In general, this scheme is indeed more elegant, and of course the effect is better, which makes people feel a little refreshing at first glance. Of course, if you read more papers (prompt learning), you should be able to think of similar solutions. Some details implemented in the code - the dimension transformation of the matrix, give a clearer description, and it is easier to understand the whole scheme.

Data dimension changes

a batch of data

[CLS]content_0[SEP]entity_0_0[MASK]entity_0_1[MASK]entity_0_2[MASK][SEP]

[CLS]content_1[SEP]entity_1_0[MASK]entity_1_1[MASK][SEP]

[CLS]content_2[SEP]entity_2_0[MASK][SEP]

[CLS]content_3[SEP]entity_3_0[MASK]entity_3_1[MASK][SEP]

[CLS]content_4[SEP]entity_4_0[MASK]entity_4_1[MASK]entity_4_2[MASK][SEP]

......

[CLS]content_(batch_size-1)[SEP]entity_(batch_size-1)_0[MASK]entity_(batch_size-1)_1[MASK][SEP]

 After tokenizer, the token is mapped to the id corresponding to the dictionary. It is necessary to record the input_ids, attention_mask, mask_tokens, entity_count, and label of each data. The corresponding dimension changes as follows

input_ids:[batch,seq_length]

[

[101,******,102,**,103,**,103,102,,0,0,0,0,0],

[101,******,102,**,103,**,103,102],

[101,******,102,**,103,102,0,0,0,0,0],

......

[101,******,102,**,103,**,103,**,103,**,103,0,0]

]

attention_mask:[batch,seq_length]

[

[1,******,1,**,1,**,1,1,,0,0,0,0,0],

[1,******,1,**,1,**,1,1],

[1,******,1,**,1,1,0,0,0,0,0],

......

[1,******,1,**,1,**,1,**,1,**,1,0,0]

]

mask_tokens:[batch,seq_length]

[

[0,******,0,**,1,**,1,0,,0,0,0,0,0],

[0,******,0,**,1,**,1,0],

[0,********,0,**,1,0,0,0,0,0,0],

......

[0,******,0,**,1,**,1,**,1,**,1,0,0]

]

Label is maintained by list

[

[-2,2],

[1,2],

[-2],

......

[2, -2, 0, -1]

]

If the number of entities in the batch is m, then the matrix of the label is [m]

[-2,2,1,2,...,2,-2,0,-1]

The result obtained after input_ids+attention_mask passes through bert:

# m表示batch内有m个实体
is_masked = inputs['is_masked'].bool()
inputs = {k: v for k, v in inputs.items() if k in ["input_ids", "attention_mask"]}
outputs = self.bert(**inputs,return_dict=True, output_hidden_states=True)
# [batch, seq_length, 768]
outputs = outputs.last_hidden_state
# [m,768]
masked_outputs = outputs[is_masked]
# [m,5]
logits = self.classifier(masked_outputs)

Second, the code implementation

number one code

The author gave the code based on pytorch-lightning. I think the encapsulation is relatively high and it is not easy to understand. On this basis, I implemented a version of the code based on torch:

model code

from transformers import BertPreTrainedModel,BertModel
import torch.nn as nn

class SentiClassifyBertPrompt(BertPreTrainedModel):
    def __init__(self,config):
        super(SentiClassifyBertPrompt,self).__init__(config)
        self.bert = BertModel(config=config)
        self.classifier = nn.Sequential(
            nn.Linear(config.hidden_size, config.hidden_size),
            nn.LayerNorm(config.hidden_size),
            nn.LeakyReLU(),
            nn.Dropout(p=config.dropout),
            nn.Linear(config.hidden_size, config.output_dim),
        )

    def forward(self,inputs):
        # m表示batch内有m个实体
        is_masked = inputs['is_masked'].bool()
        inputs = {k: v for k, v in inputs.items() if k in ["input_ids", "attention_mask"]}
        outputs = self.bert(**inputs,return_dict=True, output_hidden_states=True)
        # [batch, seq_length, 768]
        outputs = outputs.last_hidden_state
        # [m,768]
        masked_outputs = outputs[is_masked]
        # [m,5]
        logits = self.classifier(masked_outputs)
        return logits

data loading code

import torch
from torch.utils.data import Dataset
from tqdm import tqdm
import json
class DataReader(Dataset):
    def __init__(self,file_path,tokenizer,max_langth):
        self.file_path = file_path
        self.tokenizer = tokenizer
        self.max_length = max_langth
        self.data_list = self.texts_tokeniztion()
        self.allLength = len(self.data_list)

    def texts_tokeniztion(self):
        with open(self.file_path,'r',encoding='utf-8') as f:
            lines = f.readlines()

        res = []
        for line in tqdm(lines,desc='texts tokenization'):
            line_dic = json.loads(line.strip('\n'))
            content = line_dic['content']
            entity = line_dic['entity']
            prompt_length = 0
            prompts = ""
            label = []
            en_count = len(entity)
            for k,v in entity.items():
                prompt_length += len(k) + 1
                #标签化为 0-4的整数
                label.append(v+2)
                prompts += k +"[MASK]"
            #直接最大长度拼接
            content = content[0:self.max_length-prompt_length-1-10]
            text = content + "[SEP]" + prompts
            input_ids,attention_mask,masks = self.text2ids(text)

            input_ids = torch.tensor(input_ids,dtype=torch.long)
            attention_mask = torch.tensor(attention_mask,dtype=torch.long)
            masks = torch.tensor(masks, dtype=torch.long)
            #记录每条数据有多少个实体,方便推理的时候batch推理
            en_count = torch.tensor(en_count,dtype=torch.long)

            temp = []
            temp.append(input_ids)
            temp.append(attention_mask)
            temp.append(masks)
            temp.append(label)
            temp.append(en_count)
            res.append(temp)


        return res

    def text2ids(self,text):
        inputs = self.tokenizer(text)
        input_ids = inputs['input_ids']
        attention_mask = inputs['attention_mask']
        masks = [ int(id==self.tokenizer.mask_token_id)  for id in input_ids]
        return input_ids, attention_mask, masks


    def __getitem__(self, item):
        input_ids = self.data_list[item][0]
        attention_mask = self.data_list[item][1]
        masks = self.data_list[item][2]
        label = self.data_list[item][3]
        en_count = self.data_list[item][4]
        return input_ids, attention_mask, masks, label, en_count

    def __len__(self):
        return self.allLength

Model training code

from data_reader.reader import DataReader
import torch
from torch.utils.data import DataLoader
from transformers import BertTokenizer,BertConfig
from torch.optim import AdamW
from model import SentiClassifyBertPrompt
from torch.optim.swa_utils import AveragedModel, SWALR
from torch.nn.utils.rnn import pad_sequence
from log.log import  Logger
from tqdm import tqdm
import torch.nn.functional as F
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "1"

def collate_fn(batch):
    input_ids, attention_mask, masks, label, en_count = zip(*batch)
    input_ids = pad_sequence(input_ids,batch_first=True,padding_value=0)
    attention_mask = pad_sequence(attention_mask,batch_first=True,padding_value=0)
    masks = pad_sequence(masks, batch_first=True, padding_value=0)
    labels = []
    for ele in label:
        labels.extend(ele)
    labels = torch.tensor(labels,dtype=torch.long)
    en_count = torch.stack(en_count,dim=0)
    return input_ids, attention_mask, masks, labels, en_count


def dev_validation(dev_loader,device,model):
    total_correct = 0
    total = 0
    model.eval()
    with torch.no_grad():
        for step, batch in enumerate(tqdm(dev_loader, desc="dev_validation")):
            batch = [t.to(device) for t in batch]
            inputs = {"input_ids": batch[0], "attention_mask": batch[1], "is_masked": batch[2]}
            label = batch[3]
            logits = model(inputs)
            preds = torch.argmax(logits,dim=1)

            correct = (preds==label).sum()
            total_correct += correct
            total += label.size()[0]

    acc = total_correct/total
    return acc

def set_seed(seed = 1):
    torch.cuda.manual_seed_all(seed)
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True



if __name__ == '__main__':
    set_seed()
    log_level = 10
    log_path = "logs/train_bert_prompt_AdamW_swa.log"
    logger = Logger(log_name='train_bert_prompt', log_level=log_level, log_path=log_path).logger

    pretrain_model_path = "./pretrained_models/chinese-bert-wwm-ext"
    batch_size = 16
    epochs = 10
    tokenizer = BertTokenizer.from_pretrained(pretrain_model_path)
    config = BertConfig.from_pretrained(pretrain_model_path)
    config.dropout = 0.2
    config.output_dim = 5
    config.batch_size = batch_size
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = SentiClassifyBertPrompt.from_pretrained(config=config,pretrained_model_name_or_path = pretrain_model_path)
    model.to(device)
    optimizer = AdamW(params=model.parameters(),lr=1e-6)

    # 随机权重平均SWA,实现更好的泛化
    swa_model = AveragedModel(model=model,device=device)
    # SWA调整学习率
    swa_scheduler = SWALR(optimizer, swa_lr=1e-6)

    train_dataset = DataReader(tokenizer=tokenizer, max_langth=512, file_path='./data/train_split.txt')
    train_loader = DataLoader(dataset=train_dataset, shuffle=True, batch_size=batch_size, collate_fn=collate_fn)

    dev_dataset = DataReader(tokenizer=tokenizer, max_langth=512, file_path='./data/dev_split.txt')
    dev_loader = DataLoader(dataset=dev_dataset, shuffle=True, batch_size=batch_size, collate_fn=collate_fn)



    for epoch in range(epochs):
        model.train()
        for step,batch in enumerate(tqdm(train_loader,desc="training")):
            batch = [ t.to(device) for t in batch]
            inputs = {"input_ids":batch[0],"attention_mask":batch[1],"is_masked":batch[2]}
            label = batch[3]
            logits = model(inputs)
            loss = F.cross_entropy(logits,label)

            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        swa_model.update_parameters(model)
        swa_scheduler.step()

        acc = dev_validation(dev_loader,device,model)
        swa_acc = dev_validation(dev_loader,device,swa_model)
        logger.info('Epoch %d acc is %.6f'%(epoch,acc))
        logger.info('Epoch %d swa_acc is %.6f' % (epoch, swa_acc))


The project directory is as follows

swa - average weight

There is a training trick - swa - average weight in the above training code, which I have not seen and used before. It is necessary to mention that its core idea is the last model retained in the training process, not verification. The model with the best effect on the set is the average weight of the models trained by all epochs, so that the trained model has the best generalization ability and the best effect . We don’t need to implement how to calculate the average of weights by ourselves. Torch already has a standardized process and code. The specific effect needs to be verified by experiments (some people said that sgd+swa is effective).

    ......
    optimizer = AdamW(params=model.parameters(),lr=1e-6)
    # 随机权重平均SWA,实现更好的泛化
    swa_model = AveragedModel(model=model,device=device)
    # SWA调整学习率
    swa_scheduler = SWALR(optimizer, swa_lr=1e-6)
    for epoch in range(epochs):
        model.train()
        for step,batch in enumerate(tqdm(train_loader,desc="training")):
            ......
            #正常训练
            logits = model(inputs)
            loss = F.cross_entropy(logits,label)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        #每个epoch后swa_model模型更新参数
        swa_model.update_parameters(model)
        #调整学习率
        swa_scheduler.step()

baseline code

In order to simply verify the effect, I also ran the baseline scheme, the code is as follows:

import torch
from torch.utils.data import Dataset
from tqdm import tqdm
import json
from transformers import BertPreTrainedModel,BertModel
import torch.nn as nn
class SentiClassifyBert(BertPreTrainedModel):
    def __init__(self,config):
        super(SentiClassifyBert,self).__init__(config)
        self.bert = BertModel(config=config)
        self.classifier = nn.Sequential(
            nn.Linear(config.hidden_size, config.hidden_size),
            nn.LayerNorm(config.hidden_size),
            nn.LeakyReLU(),
            nn.Dropout(p=config.dropout),
            nn.Linear(config.hidden_size, config.output_dim),
        )

    def forward(self,inputs):
        inputs = {k: v for k, v in inputs.items() if k in ["input_ids", "attention_mask"]}
        outputs = self.bert(**inputs,return_dict=True, output_hidden_states=True)
        outputs = outputs.last_hidden_state
        cls_output = outputs[:,0:1,:].squeeze()
        logits = self.classifier(cls_output)
        return logits

class DataReader(Dataset):
    def __init__(self,file_path,tokenizer,max_langth):
        self.file_path = file_path
        self.tokenizer = tokenizer
        self.max_length = max_langth
        self.data_list = self.texts_tokeniztion()
        self.allLength = len(self.data_list)

    def texts_tokeniztion(self):
        with open(self.file_path,'r',encoding='utf-8') as f:
            lines = f.readlines()

        res = []
        for line in tqdm(lines,desc='texts tokenization'):
            line_dic = json.loads(line.strip('\n'))
            content = line_dic['content']
            entity = line_dic['entity']
            for k,v in entity.items():
                # 直接最大长度拼接
                content = content[0:self.max_length - len(k) - 1 - 10]
                text = content + "[SEP]" + k
                input_ids, attention_mask, masks = self.text2ids(text)

                input_ids = torch.tensor(input_ids, dtype=torch.long)
                attention_mask = torch.tensor(attention_mask, dtype=torch.long)
                label = torch.tensor(v+2, dtype=torch.long)
                temp = []
                temp.append(input_ids)
                temp.append(attention_mask)
                temp.append(label)
                res.append(temp)

        return res

    def text2ids(self,text):
        inputs = self.tokenizer(text)
        input_ids = inputs['input_ids']
        attention_mask = inputs['attention_mask']
        masks = [ int(id==self.tokenizer.mask_token_id)  for id in input_ids]
        return input_ids, attention_mask, masks


    def __getitem__(self, item):
        input_ids = self.data_list[item][0]
        attention_mask = self.data_list[item][1]
        label = self.data_list[item][2]
        return input_ids, attention_mask, label



from data_reader.reader import DataReader
import torch
from torch.utils.data import DataLoader
from transformers import BertTokenizer,BertConfig
from torch.optim import AdamW,SGD
from model import SentiClassifyBert
from torch.optim.swa_utils import AveragedModel, SWALR
from torch.nn.utils.rnn import pad_sequence
from log.log import  Logger
from tqdm import tqdm
import torch.nn.functional as F
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "0"

def collate_fn(batch):
    input_ids, attention_mask,  label = zip(*batch)
    input_ids = pad_sequence(input_ids,batch_first=True,padding_value=0)
    attention_mask = pad_sequence(attention_mask,batch_first=True,padding_value=0)
    label = torch.stack(label,dim=0)
    return input_ids, attention_mask, label


def dev_validation(dev_loader,device,model):
    total_correct = 0
    total = 0
    model.eval()
    with torch.no_grad():
        for step, batch in enumerate(tqdm(dev_loader, desc="dev_validation")):
            batch = [t.to(device) for t in batch]
            inputs = {"input_ids": batch[0], "attention_mask": batch[1]}
            label = batch[2]
            logits = model(inputs)
            preds = torch.argmax(logits,dim=1)

            correct = (preds==label).sum()
            total_correct += correct
            total += label.size()[0]

    acc = total_correct/total
    return acc

def set_seed(seed = 1):
    torch.cuda.manual_seed_all(seed)
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True



if __name__ == '__main__':
    set_seed()
    log_level = 10
    log_path = "logs/train_bert_adamW_swa_20220718.log"
    logger = Logger(log_name='train_bert', log_level=log_level, log_path=log_path).logger

    pretrain_model_path = "./pretrained_models/chinese-bert-wwm-ext"
    batch_size = 16
    epochs = 20
    tokenizer = BertTokenizer.from_pretrained(pretrain_model_path)
    config = BertConfig.from_pretrained(pretrain_model_path)
    config.dropout = 0.2
    config.output_dim = 5
    config.batch_size = batch_size
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = SentiClassifyBert.from_pretrained(config=config,pretrained_model_name_or_path = pretrain_model_path)
    model.to(device)
    optimizer = AdamW(params=model.parameters(),lr=1e-6)
    # optimizer = SGD(params=model.parameters(), lr=1e-5,momentum=0.9)

    # 随机权重平均SWA,实现更好的泛化
    swa_model = AveragedModel(model=model,device=device)
    # SWA调整学习率
    swa_scheduler = SWALR(optimizer, swa_lr=1e-6)

    train_dataset = DataReader(tokenizer=tokenizer, max_langth=512, file_path='./data/train_split.txt')
    train_loader = DataLoader(dataset=train_dataset, shuffle=True, batch_size=batch_size, collate_fn=collate_fn)

    dev_dataset = DataReader(tokenizer=tokenizer, max_langth=512, file_path='./data/dev_split.txt')
    dev_loader = DataLoader(dataset=dev_dataset, shuffle=True, batch_size=batch_size, collate_fn=collate_fn)



    for epoch in range(epochs):
        model.train()
        for step,batch in enumerate(tqdm(train_loader,desc="training")):
            batch = [ t.to(device) for t in batch]
            inputs = {"input_ids":batch[0],"attention_mask":batch[1]}
            label = batch[2]
            logits = model(inputs)
            loss = F.cross_entropy(logits,label)

            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        swa_model.update_parameters(model)
        swa_scheduler.step()

        acc = dev_validation(dev_loader,device,model)
        swa_acc = dev_validation(dev_loader,device,swa_model)
        logger.info('Epoch %d acc is %.6f'%(epoch,acc))
        logger.info('Epoch %d swa_acc is %.6f' % (epoch, swa_acc))


In the training, the training set of about 9W data is divided into 1W data as the verification set, and the Chinese-bert-wwm-ext is used as the pre-training model to train 20 epochs; the effects of the SGD and AdamW optimizers are compared; The effects of the baseline and the first-place scheme are compared; of course, whether the effect of swa is good or not can not be concluded because there is no test set.

3. Effect display

The first plan:

a、adamW + swa

 Accuracy on the validation set was 0.929579 within 20 epochs using the AdamW optimizer; swa was 0.928673 - it's less clear how it performs on the test set.

b、sgd + swa 

In terms of accuracy, the convergence of sgd is relatively slow. The accuracy rate of 19 epcohs is lower to reach the highest value, and the accuracy rate is not as high as AdamW, which is only 89.7. However, it seems that it has not fully converged. It took a long time; it seems that a smart optimizer like AdamW is more suitable for people like me who are not very good at tuning optimizer parameters. 

baseline scheme

In comparison, the effect of the baseline is a bit worse. The first-place solution is indeed effective. There are two main points. First, there is no repeated splicing, which causes changes in data distribution. At the same time, the model may be better at learning the direct relationship between entities; The second is that the choice of sentence vector is more appropriate. Neither cls nor meanPooling is selected for embedding, but the embedding corresponding to [MASK] is more accurate. This is essentially the change of prompt learning and applied here. The gap between pre-training and fine-tuning is smaller, the extracted embedding is more accurate, and all the effects are good.

The program is elegant and worth learning and learning!

Reference article

2022 Sohu Campus NLP Algorithm Competition Sentiment Analysis First Prize Scheme Sharing

2022 Sohu Campus Sentiment Analysis Algorithm Competition

Guess you like

Origin blog.csdn.net/HUSTHY/article/details/125809156