Summary and comparison of 15 NLP data enhancement methods

Methods of Data Augmentation

Data Augmentation (DA for short) refers to a class of methods for synthesizing new data based on existing data . After all, data is the real effect ceiling. With more data, the effect can be improved , the generalization ability of the model can be enhanced, and the robustness can be improved . However, due to the inherent difficulty of NLP tasks, cropping methods like CV may change the semantics. It is necessary to ensure both data quality and diversity , making everyone very cautious when doing data enhancement.

According to the diversity of generated samples, the author divides data enhancement into the following three methods:

  • Paraphrasing: Make some changes to the words, phrases, and sentence structures in the sentence, retaining the original semantics
  • Noising: While keeping the label unchanged, adding some discrete or continuous noise has little effect on semantics
  • Sampling: Aiming at selecting new samples based on the current data distribution, more data will be generated.
    insert image description here

Paraphrasing

insert image description here
This type of method can be divided into . The author summarizes a total of 6 methods

  • Thesaurus
    uses external data such as dictionaries and knowledge graphs to randomly replace non-stop words with synonyms or hypernyms. If the diversity is increased, it can also be replaced with other words of the same part of speech.
  • Semantic Embeddings
    uses semantic vectors to replace words or phrases with similar ones (not necessarily synonyms). Since each word has a semantic representation, the scope of replacement is wider, while the previous method can only replace those in the map.
  • MLMs
    use models such as BERT to randomly mask out some components and generate new ones
  • Rules
    Use some rules, such as abbreviations, verb conjugations, negation, etc., to rewrite some components of the sentence, such as changing is not into isn't
  • Machine Translation
  • : Divided into two types, Back-translation refers to translating sentences into other languages ​​and then translate them back , Unidirectional Translation refers to translating sentences into other languages ​​in cross-language tasks**
  • Model Generation
  • Generating semantically consistent sentences using a Seq2seq model.
    insert image description here
    Ambiguity" mainly means that some polysemous words have different meanings in different scenarios

Noising

Humans are immune to noise when reading text , such as out-of-order words and typos. Based on this idea, some noise can be added to the data to improve the robustness of the model .

insert image description here

  • In addition to exchanging words, Swapping
    can also exchange instances or sentences in classification tasks
  • Deletion
    can also exchange instance or sentence in classification tasks
  • Insertion
    can randomly insert synonyms into sentences
  • Substitution
    randomly replaces some words with other words (non-synonymous), simulating the scenario of misspelling. In order to avoid changing the label, you can use label-independent words, or use other sentences in the training data
  • The Mixup
    method has been popular in the past two years. The sentence representation and label are fused with a certain weight, and continuous noise is introduced to generate data between different labels**, but the interpretability is poor**

In general, the data enhancement method that introduces noise is easy to use, but it has an impact on sentence structure and semantics, and the diversity is limited, mainly to improve the robustness of the model.

Anti-sample
Dropout : It is also used by SimCSE, and R-drop, which adds continuous noise through dropout.
Feature Cut-off : For example, the vectors of BERT are all 768 dimensions, and some dimensions can be randomly set to 0. This effect is also good

Sampling

insert image description here
Taking new samples from the distribution of data, unlike the more general paraphrasing, sampling is more task-dependent and requires more diversity while ensuring data reliability . For example, the first two data enhancement methods are more difficult, and the author organizes four methods.
insert image description here
insert image description here

Method Stacking

In actual application, multiple methods or different granularities of a method may be applied.

Summarize

Data enhancement can be regarded as a very down-to-earth research. Few samples and domain migration are problems that every NLPer will encounter , and making a fuss about data may be more effective than other methods of model modification. At the same time, you can also see from this review that data enhancement can actually be very fancy** without affecting the online speed. For example, I have done data enhancement with T5 and ELECTRA before, and both have some effects, which can be described as low-key . Luxurious without losing, elegant without losing the atmosphere, the depth is firmly grasped .

Guess you like

Origin blog.csdn.net/kuxingseng123/article/details/129114960