This article comes from NAACL 2022. It introduces a text classification data enhancement framework. Based on the existing data enhancement algorithm, it gives the method and conditions for screening and generating samples through the separator. The author of this article is from several classmates of Shanghai Fudan University and Tongji University.
1. Summary
In this paper, data enhancement is mainly performed by automatically generating data, and the generated data is screened through relative entropy maximization (REM) and conditional entropy minimization (CEM), and samples with high diversity and semantic consistency are selected. Add classification to it for training. Finally, experiments show that this method is superior to the current sota method,
2. Introduction
Introduced the basic content of DA (Data Augmentation). On CV, DA is generally performed by image flipping, cropping, and tilting, while on NLP, methods such as text synonym replacement, deletion, addition, and exchange are used, but it may reduce the performance of the model, and then use Adversarial training such as fgm, awg and other methods are performed on the embedding layer for DA.
The method of this article:
Figure 1. The pipeline of the method framework. First, the existing DA method is used to generate samples, and SEAS is used to screen samples through the feedback of the classifier, continuously improving the diversity and quality of the screened samples, and finally improving the performance of the classifier.
3. Related work
This part introduces the traditional DA application methods in NLP, including rule-based, interpolation-based and model-based.
- Rule-based: synonyms, deletions, changes, additions, misspellings, sentence reversals, substructure substitutions, etc.
- Based on interpolation: perform cv-like interpolation operations in the embedding layer or other hidden layers.
- Model-based: Use various seq2seq models for "back-translation" operations, use GPT and other language models for conditional generation, and use GAN to generate text, etc., but the speed is slow, resource consumption is high, and it is not cost-effective for data enhancement.
4. Method
The method is simply to screen the samples generated by DA: 1. Diversity; 2. Consistency.
- Diversity, pass the generated data y through the classifier learned by x, and get the loss. If the loss value is large, the diversity of the original data increases, that is, the greater the relative entropy
- Consistency, calculate the source data x and generated data y, and obtain the mutual information of the information through the model. The greater the mutual information, the more consistent the two are, that is, the lower the conditional entropy.
- Finally, the scores of the two are calculated, and the super-participation rights are summed for realization.
5. Experiment
Experiment 1
Experiments were carried out on 5 different classification task data sets, and the evaluation index was the f1-score of the bert classifier, respectively 10% original data + 30% enhanced data, and 40% original data + 40% enhanced data for 5 repeated experiments on average values, while reporting the data perplexity PPL score.
Experiment 2
In order to confirm that the plug-in component will achieve good performance on different classifiers and different DA algorithms, DA is performed on EDA, CWE, and TextAttack, and text classification tasks are performed on bert, cnn, and xlnet classifiers.
Ablation experiment
Before an ablation study was performed to examine the effectiveness of different EPiDA configurations. We use CNN as the classifier and EDA as the DA algorithm, and report Macro-F1 scores in five replicate experiments on TREC 1% and Irony 1%.
Among them, DA data enhancement, REM relative entropy, CEM conditional entropy, OA online enhancement, PT pre-training
6. My idea
The purpose of data enhancement is to generate data with high diversity and accuracy, which is not easy to reduce the effect of the model. That is, the input end makes the generated data have a large diff with the original data, while the output end, that is, the model prediction results should be consistent.
The conditional entropy and relative entropy in this article seem to be considered at the output end, and the calculation goals of the two will conflict, but combined, the model is still slightly improved.
The difference between the two at the input end is large, and the cos similarity can be larger at the embedding layer, and then the mutual information at the bottom layer is ensured to be large enough, or the embedding similarity at the bottom layer is high, and different evaluation methods are used to evaluate the generated fake data and the original data. The similarity before and after passing the classifier. At this time, the classifier should learn a little better, similar to the learning process of an alternate classifier in the GAN method, which generates data filter learning.