[Presentation] HPT: Hierarchy-aware Prompt Tuning for Hierarchical Text Classification

Paper information

name content
paper title HPT: Hierarchy-aware Prompt Tuning for Hierarchical Text Classification
Paper address https://arxiv.org/abs/2204.13413
field of study NLP, text classification, hint learning, hierarchical label text classification
proposed model HPT(Hierarchy-aware Prompt Tuning)
source EMNLP 2022
source code https://github.com/wzh9969/HPT

read summary

  This article is an earlier Promptpaper that applied it to the field of hierarchical multi-label text classification. The idea is to layer the hierarchical labels into Pattern, and then modify the loss function to adapt to the multi-label classification task.

[0] Summary

  When the author describes the innovation points and ideas, he mainly grasps gapthis word:

  1. There is a huge gap between PLMthe conventional method of fine-tuning the pre-trained language model for hierarchical multi-label text classification and the (MLM)pre-training task of the masked language model. Therefore, the classification task should be transplanted into MLMthe form to make up for the gap.

  2. MLMThe task is a multi-classification task , and the hierarchical multi-label classification is a multi-label classification task . The loss functions of the two are different, and there is a gap.

  The author's idea is to bridge the above two gaps .

[1 Introduction

  Hierarchical text classification HTCis a multi-label text classification problem, where classification results correspond to one or more paths in the hierarchy.

insert image description here
  The figure above (a)shows the model structure of the traditional PLMsolution HTC. The text is directly input PLMinto it to encode, and the hierarchical label information will be modeled as a tree or graph structure (usually GAT, that is, the graph attention network).

[Note] Since the text encoding model is a bit saturated, you can see that many master's thesis are now exploring the concept of "label correlation", which will be related to or graph GAT.

  Despite Fine tuningits success, the GPT3advent Prompt Learningof . Therefore to use , bridge the gap between the downstream task and the pre-training task, so as to fully tap the potential.Fine tuningPLMPromptPLMPLM

  When using Prompt Learning(或者说MLM任务)solve HTCtasks, two problems arise:

  1. (MLM)The label of the mask language model is flat, HTChow to extend the hierarchical structure label to MLMthe task.

  2. MLMThe task is a multi-classification task . Hierarchical multi-label classification is a multi-label classification task . How to set the loss function.

  As shown in the figure above (b), the author injects hierarchical information into MLMthe task Soft Patternto solve problem 1; uses zero-bounded multi-label cross-entropy loss to solve problem 2.

[2] Related work

[2.1] HTC

  According to the way of using label hierarchies, HTCexisting work can be divided into local methods and global methods: local methods build a classifier for each node or level, while global methods only build a classifier for the whole graph.

  Early global approaches ignore the label hierarchy and treat the problem as a flat multi-label classification. Later, some studies attempted to incorporate label structures through meta-learning, reinforcement learning, and attention modules. While these methods can capture hierarchical information, directly encoding the overall label structure via a structure encoder can further improve performance.

[2.2] Prompt tuning

  Mainly introduces Hard Promptand Soft Prompt.

[4] method

  Two points are introduced: how to transition the tree label structure to MLMthe task; how to solve the loss of multi-label text classification.
insert image description here

[Note] As shown in the above picture:
   part 1 is a template, you can see that after the regular text input, some special pseudo tags are spliced;
   part 2 is a layered perception prompt module, and each layer of the tag tree has a tag ti t_iti, into the template, ti t_itifollowed by e [ PRED ] e_{[PRED]}e[ FOR D ]To predict this mark, after MLMthe model, mark ti t_itiIt will only predict the label of the corresponding level, which is called the level constraint in the article Hierarchy Constraint;
   part 3 is MLMthe loss of the task, which is Prompt Text Classificationalso called in many papers auxiliary loss. In fact, it is BERTthe loss calculated after dropping 15% of the input random mask. It is the same as our calculation classification. The loss can be considered independent and does not interfere with each other;
   part 4 is the loss function of hierarchical classification. The paper does not use the conventional multi-label classification loss BCELoss, but proposes it Zero-bounded Multi-label Cross Entropy Loss.

[4.1] Layer-aware hints

  Hierarchy Constraints

  To incorporate label hierarchies, hierarchical hints are proposed. Since the tag hierarchy is a tree structure, templates are constructed based on the depth of the hierarchy.

  Given a predefined label hierarchy yh = (Y, E), depth L, and input text x, the template is [CLS] x [SEP] [V1] [PRED] [V2] [PRED]…”[vl] [pred] [sep]. Special [PRED]tokens for label prediction, denote multi-label prediction.

  For the verbalizer, a learnable virtual label word vi is created for each label yi, and its embedding vi is initialized by averaging the embeddings of its corresponding labels. As shown in the green part in Figure 2, instead of predicting all labels in one slot, labels are divided into different groups according to their layers and constrained to [PRED]predict only labels on one layer. To this end, each template word [Vi]is followed by a [PRED]token for prediction at layer i.

  layer injection

  Hierarchical constraints only introduce the depth of labels, but lack their connectivity and hierarchical structure. In order to solve the connectivity problem of label level, the article proposes level injection .

  As in the blue part of part 2 , the author uses GATto encode label information and inject hierarchical information into the input template.

[4.2] Zero-bounded multi-label cross-entropy loss

  The first MLMtask is multi-category classification, single-label classification, using cross-entropy loss, which does not apply to our hierarchical multi-label classification.

  Conventional hierarchical multi-label classification uses a binary cross-entropy loss (BCELoss), but the authors sayBCE ignores the correlation between labels, in order to bridge this multi-label and multi-category gap, in this paper, we expect the scores of all target labels to be greater than the scores of all non-target labels, instead of calculating the scores of each label separately .

  In the paper, the authors propose two loss functions: multi-label cross-entropy loss (MLCE) and zero-boundary multi-label cross-entropy loss (ZMLCE).

  The multi-label cross-entropy loss is as follows:

insert image description here

  where N p N^pNp represents the target label set,N n N^nNn represents the set of non-target labels,yi y_iyiis the probability after the sigmoid for each label. The goal of the loss function is to maximize the gap between the score of the target label and the score of the non-target label. However, in the actual inference process, the above formula cannot be calculated.

  To solve this problem, an anchor label with a constant score of 0 is introduced, and it is hoped that both target and non-target labels have scores greater than and less than 0. In this way, they propose a zero-bound multi-label cross-entropy loss:

insert image description here

  It consists of two parts: the first part is the score of the target label, and the second part is the score of the non-target label. This is easy to calculate. But this loss is only the loss of one layer, and the sum is required at the end.

insert image description here
The total loss of the model is the ZMLCE ZMLCE   of each layerZM L CE loss plusMLM MLMM L M loss.

Guess you like

Origin blog.csdn.net/qq_43592352/article/details/130916951