Paper information
name | content |
---|---|
paper title | HPT: Hierarchy-aware Prompt Tuning for Hierarchical Text Classification |
Paper address | https://arxiv.org/abs/2204.13413 |
field of study | NLP, text classification, hint learning, hierarchical label text classification |
proposed model | HPT(Hierarchy-aware Prompt Tuning) |
source | EMNLP 2022 |
source code | https://github.com/wzh9969/HPT |
read summary
This article is an earlier Prompt
paper that applied it to the field of hierarchical multi-label text classification. The idea is to layer the hierarchical labels into Pattern
, and then modify the loss function to adapt to the multi-label classification task.
[0] Summary
When the author describes the innovation points and ideas, he mainly grasps gap
this word:
1. There is a huge gap between PLM
the conventional method of fine-tuning the pre-trained language model for hierarchical multi-label text classification and the (MLM)
pre-training task of the masked language model. Therefore, the classification task should be transplanted into MLM
the form to make up for the gap.
2. MLM
The task is a multi-classification task , and the hierarchical multi-label classification is a multi-label classification task . The loss functions of the two are different, and there is a gap.
The author's idea is to bridge the above two gaps .
[1 Introduction
Hierarchical text classification HTC
is a multi-label text classification problem, where classification results correspond to one or more paths in the hierarchy.
The figure above (a)
shows the model structure of the traditional PLM
solution HTC
. The text is directly input PLM
into it to encode, and the hierarchical label information will be modeled as a tree or graph structure (usually GAT
, that is, the graph attention network).
[Note] Since the text encoding model is a bit saturated, you can see that many master's thesis are now exploring the concept of "label correlation", which will be related to or graph
GAT
.
Despite Fine tuning
its success, the GPT3
advent Prompt Learning
of . Therefore to use , bridge the gap between the downstream task and the pre-training task, so as to fully tap the potential.Fine tuning
PLM
Prompt
PLM
PLM
When using Prompt Learning(或者说MLM任务)
solve HTC
tasks, two problems arise:
1. (MLM)
The label of the mask language model is flat, HTC
how to extend the hierarchical structure label to MLM
the task.
2. MLM
The task is a multi-classification task . Hierarchical multi-label classification is a multi-label classification task . How to set the loss function.
As shown in the figure above (b)
, the author injects hierarchical information into MLM
the task Soft Pattern
to solve problem 1; uses zero-bounded multi-label cross-entropy loss to solve problem 2.
[2] Related work
[2.1] HTC
According to the way of using label hierarchies, HTC
existing work can be divided into local methods and global methods: local methods build a classifier for each node or level, while global methods only build a classifier for the whole graph.
Early global approaches ignore the label hierarchy and treat the problem as a flat multi-label classification. Later, some studies attempted to incorporate label structures through meta-learning, reinforcement learning, and attention modules. While these methods can capture hierarchical information, directly encoding the overall label structure via a structure encoder can further improve performance.
[2.2] Prompt tuning
Mainly introduces Hard Prompt
and Soft Prompt
.
[4] method
Two points are introduced: how to transition the tree label structure to MLM
the task; how to solve the loss of multi-label text classification.
[Note] As shown in the above picture:
part 1 is a template, you can see that after the regular text input, some special pseudo tags are spliced;
part 2 is a layered perception prompt module, and each layer of the tag tree has a tag ti t_iti, into the template, ti t_itifollowed by e [ PRED ] e_{[PRED]}e[ FOR D ]To predict this mark, afterMLM
the model, mark ti t_itiIt will only predict the label of the corresponding level, which is called the level constraint in the articleHierarchy Constraint
;
part 3 isMLM
the loss of the task, which isPrompt Text Classification
also called in many papersauxiliary loss
. In fact, it isBERT
the loss calculated after dropping 15% of the input random mask. It is the same as our calculation classification. The loss can be considered independent and does not interfere with each other;
part 4 is the loss function of hierarchical classification. The paper does not use the conventional multi-label classification lossBCELoss
, but proposes itZero-bounded Multi-label Cross Entropy Loss
.
[4.1] Layer-aware hints
Hierarchy Constraints
To incorporate label hierarchies, hierarchical hints are proposed. Since the tag hierarchy is a tree structure, templates are constructed based on the depth of the hierarchy.
Given a predefined label hierarchy yh = (Y, E), depth L, and input text x, the template is [CLS] x [SEP] [V1] [PRED] [V2] [PRED]…”[vl] [pred] [sep]
. Special [PRED]
tokens for label prediction, denote multi-label prediction.
For the verbalizer, a learnable virtual label word vi is created for each label yi, and its embedding vi is initialized by averaging the embeddings of its corresponding labels. As shown in the green part in Figure 2, instead of predicting all labels in one slot, labels are divided into different groups according to their layers and constrained to [PRED]
predict only labels on one layer. To this end, each template word [Vi]
is followed by a [PRED]
token for prediction at layer i.
layer injection
Hierarchical constraints only introduce the depth of labels, but lack their connectivity and hierarchical structure. In order to solve the connectivity problem of label level, the article proposes level injection .
As in the blue part of part 2 , the author uses GAT
to encode label information and inject hierarchical information into the input template.
[4.2] Zero-bounded multi-label cross-entropy loss
The first MLM
task is multi-category classification, single-label classification, using cross-entropy loss, which does not apply to our hierarchical multi-label classification.
Conventional hierarchical multi-label classification uses a binary cross-entropy loss (BCELoss)
, but the authors sayBCE ignores the correlation between labels, in order to bridge this multi-label and multi-category gap, in this paper, we expect the scores of all target labels to be greater than the scores of all non-target labels, instead of calculating the scores of each label separately .
In the paper, the authors propose two loss functions: multi-label cross-entropy loss (MLCE) and zero-boundary multi-label cross-entropy loss (ZMLCE).
The multi-label cross-entropy loss is as follows:
where N p N^pNp represents the target label set,N n N^nNn represents the set of non-target labels,yi y_iyiis the probability after the sigmoid for each label. The goal of the loss function is to maximize the gap between the score of the target label and the score of the non-target label. However, in the actual inference process, the above formula cannot be calculated.
To solve this problem, an anchor label with a constant score of 0 is introduced, and it is hoped that both target and non-target labels have scores greater than and less than 0. In this way, they propose a zero-bound multi-label cross-entropy loss:
It consists of two parts: the first part is the score of the target label, and the second part is the score of the non-target label. This is easy to calculate. But this loss is only the loss of one layer, and the sum is required at the end.
The total loss of the model is the ZMLCE ZMLCE of each layerZM L CE loss plusMLM MLMM L M loss.