EPiDA: A Simple Plugin Data Augmentation Framework for High Performance Text Classification

This article comes from NAACL 2022. It introduces a text classification data enhancement framework. Based on the existing data enhancement algorithm, it gives the method and conditions for screening and generating samples through the separator. The author of this article is from several classmates of Shanghai Fudan University and Tongji University.

论文地址:EPiDA: An Easy Plug-in Data Augmentation Framework for High Performance Text Classification - ACL Anthology

code 地址:zhaominyiz/EPiDA: Official Code for ‘EPiDA: An Easy Plug-in Data Augmentation Framework for High Performance Text Classification’ - NAACL 2022 (github.com)


1. Summary

In this paper, data enhancement is mainly performed by automatically generating data, and the generated data is screened through relative entropy maximization (REM) and conditional entropy minimization (CEM), and samples with high diversity and semantic consistency are selected. Add classification to it for training. Finally, experiments show that this method is superior to the current sota method,

2. Introduction

Introduced the basic content of DA (Data Augmentation). On CV, DA is generally performed by image flipping, cropping, and tilting, while on NLP, methods such as text synonym replacement, deletion, addition, and exchange are used, but it may reduce the performance of the model, and then use Adversarial training such as fgm, awg and other methods are performed on the embedding layer for DA.

The method of this article:

insert image description here

​ Figure 1. The pipeline of the method framework. First, the existing DA method is used to generate samples, and SEAS is used to screen samples through the feedback of the classifier, continuously improving the diversity and quality of the screened samples, and finally improving the performance of the classifier.

3. Related work

This part introduces the traditional DA application methods in NLP, including rule-based, interpolation-based and model-based.

  • Rule-based: synonyms, deletions, changes, additions, misspellings, sentence reversals, substructure substitutions, etc.
  • Based on interpolation: perform cv-like interpolation operations in the embedding layer or other hidden layers.
  • Model-based: Use various seq2seq models for "back-translation" operations, use GPT and other language models for conditional generation, and use GAN to generate text, etc., but the speed is slow, resource consumption is high, and it is not cost-effective for data enhancement.

4. Method

The method is simply to screen the samples generated by DA: 1. Diversity; 2. Consistency.

  1. Diversity, pass the generated data y through the classifier learned by x, and get the loss. If the loss value is large, the diversity of the original data increases, that is, the greater the relative entropy
  2. Consistency, calculate the source data x and generated data y, and obtain the mutual information of the information through the model. The greater the mutual information, the more consistent the two are, that is, the lower the conditional entropy.
  3. Finally, the scores of the two are calculated, and the super-participation rights are summed for realization.

insert image description here

5. Experiment

Experiment 1

Experiments were carried out on 5 different classification task data sets, and the evaluation index was the f1-score of the bert classifier, respectively 10% original data + 30% enhanced data, and 40% original data + 40% enhanced data for 5 repeated experiments on average values, while reporting the data perplexity PPL score.

insert image description here

Experiment 2

In order to confirm that the plug-in component will achieve good performance on different classifiers and different DA algorithms, DA is performed on EDA, CWE, and TextAttack, and text classification tasks are performed on bert, cnn, and xlnet classifiers.

insert image description here

Ablation experiment

insert image description here

Before an ablation study was performed to examine the effectiveness of different EPiDA configurations. We use CNN as the classifier and EDA as the DA algorithm, and report Macro-F1 scores in five replicate experiments on TREC 1% and Irony 1%.

Among them, DA data enhancement, REM relative entropy, CEM conditional entropy, OA online enhancement, PT pre-training

6. My idea

The purpose of data enhancement is to generate data with high diversity and accuracy, which is not easy to reduce the effect of the model. That is, the input end makes the generated data have a large diff with the original data, while the output end, that is, the model prediction results should be consistent.

The conditional entropy and relative entropy in this article seem to be considered at the output end, and the calculation goals of the two will conflict, but combined, the model is still slightly improved.

The difference between the two at the input end is large, and the cos similarity can be larger at the embedding layer, and then the mutual information at the bottom layer is ensured to be large enough, or the embedding similarity at the bottom layer is high, and different evaluation methods are used to evaluate the generated fake data and the original data. The similarity before and after passing the classifier. At this time, the classifier should learn a little better, similar to the learning process of an alternate classifier in the GAN method, which generates data filter learning.

Guess you like

Origin blog.csdn.net/be_humble/article/details/127302591