NLP data enhancement Methods: EDA, BT, MixMatch, UDA

BACKGROUND 1. Enhanced data and scenarios

With the gradual development of AI technology and better neural network model requirements for data size also increased gradually. In the classification task, if the amount of different types of data vary greatly, the model appears over-fitting phenomenon seriously affect the accuracy of prediction.

Broadly speaking, effective supervision model relative semi-supervised or unsupervised learning are leading. But there are models of supervision need to get a lot of tagging data, when data needs to reach thousands, millions or even more, the manual annotation data dearly have let a lot of people stay away.

How to use limited tagging data, to get more training data, reduce network overfitting, training a stronger generalization ability of the model? Data enhancement is undoubtedly a powerful solution.

Data enhancement at first used more often in the field of computer vision, mainly the use of various techniques to generate new training samples, you can pan the image, rotation, compression, color adjustment and other ways to create new data. Although the 'new' sample changes the appearance to some extent, but a sample of the label remains unchanged. NLP and the data is discrete, which leads us directly simply can not be converted to the input data, replace one word could change the entire meaning of the sentence. Therefore, this paper will focus on the text data enhancement methods and techniques , data to quickly add text.

2 traditional text data enhancement technology

Data Augmentation of existing NLP generally have two ideas, one is adding noise, the other is the back-translation, are supervised methods. Plus noise that is similar to the original data create new data by replacing the word, delete the word, etc. in the original data base. Back Translation sucked original data translated into another language and then translated back to the original language, because of the different languages such as logical order, back-translation method are often able to get the original data and the new data vary greatly.

Easy Data Augmentation for Text Classification Tasks ( EDA) proposed and verified several text augmentation techniques of adding noise, are synonymous replacement (SR: Synonyms Replace), random insertion (RI: Randomly Insert), a random exchange (RS: Randomly Swap ), random deletion (RD: randomly the delete) , introduced briefly below:

2.1 EDA

(1) synonym replacement (SR: Synonyms Replace): not considered stopwords, n randomly selected words in the sentence, and then randomly synonyms from thesaurus, and replaced.

Eg: "I really like this movie" -> "I really like this movie," the sentence still have the same meaning, it is likely to have the same label.

(2) random insertion (RI: Randomly Insert): not considered stopwords, a random word, and then randomly selects a synonym word in the set, to insert a random position of the original sentence. The process may be repeated n times.

Eg: "I really like this movie" -> "Love I really like this movie."

(3) a random exchange (RS: Randomly Swap): sentences, randomly selected two words, exchange positions. The process may be repeated n times.

Eg: "know how to evaluate the 2017 almost mountains Cup game machine learning?" -> "2017 machine learning how to evaluate the competition know almost see the mountains Cup?."

(4) random deletion (RD: Randomly Delete): each word in the sentence, with probability p random deleted.

Eg: "I know how to evaluate the 2017 almost mountains Cup game machine learning?" -> "How to see the mountains Cup 2017 machine learning."

These four methods of how effective it? English data on the effect is very impressive. After these four operations, the sentence after data enhancement may be difficult to understand, but the authors found that the model becomes more robust, especially in some small data sets. Results as shown:

Each of these methods are the result of the good results also showed:

The figure is an average performance gain of EDA operations for five different training set size of text classification task. α parameter is roughly means "percentage change each time the expansion of words in the sentence", and the vertical axis is the model gain.

We can see that when α = 0.1, the upgrade model can achieve good results. The less training data, enhance the effect of the more obvious effects. Too much data to enhance data actually enhance the model is limited, even in the SR and RD are two methods also seriously damage effects.

Overall, the traditional text data enhancement method in low-volume data has better performance results, but the shortcomings of the four methods can not be ignored:

Synonym replacement SR there is a small problem, synonymous with a very similar word vectors, and training model these two sentences will be treated almost the same sentence, but did not actually dataset effective expansion.
RI random insertion is very intuitive you can see the original training data loss of semantic structure and semantics of the order, regardless of stop words approach makes the expansion out of the data does not contain much valuable information, and not focus on adding synonyms Key words in a sentence, the actual data will be more limited in the expansion of diversity.
RS Random switching essentially does not change the original sentence morpheme, the new sentence, sentence, word generalization essentially similar upgrade is limited.
Random delete RD is not only flawed randomly inserted keywords are not focused, but also a problem of random exchange sentence sentence generalization of poor results. Although the random method can take care of every word, but no keyword focused, if the random deletion of the word is just a word classification features the strongest, then not only semantic information may be changed, the correctness of the label will be a problem.

2.2 back-translation

In this method, we use machine translation to some Chinese translated into another language and then translated back into Chinese.

Eg: "Jay is a strength to sing Chinese music, his albums sold all over the world." -> "Jay Chou is a strength singer in the Chinese music scene, his albums are sold all over the world." -> "Jay is a great singer Chinese music industry, his albums sold around the world."

This method has been successfully used in a malicious comments Categories Kaggle competition. Reverse translation is an enhanced data NLP method frequently used in machine translation, its essence is to quickly achieve the purpose of generating a number of translations increased data .

Back-translation method often able to increase the diversity of text data, compared to replace the word, sometimes can change the syntactic structure and semantic information retention. However, the data generated by back-translation method depends on the quality of the translation, translation results most likely arise not so accurate. If you use some translation software interfaces, you may also encounter situations account restrictions.

3 deep learning data enhancement technology

3.1 Semi-supervised Mixmatch

Semi-supervised learning method proposed is to make better use of unlabeled data, reducing reliance on large-scale tagging data set; now also proved that this is a powerful learning paradigm.

In this paper, the authors present different tasks in semi-supervised learning approach is made uniform, got a new algorithm - MixMatch . The way it works is produced by a method for amplifying low entropy MixUp speculation tag untagged data sample, and the labeled and unlabeled data mixed data.

The authors experimentally show MixMatch in a variety of different data sets, there are a variety of different tag data can scale to very substantially ahead of all previous methods. For example, on CIFAR data set, only the case of 250 tags, authors to reduce the error rate to a quarter of the prior method, on the STL-10 data set is also reduced to half the previous method.

The authors also demonstrated MixMatch can use in the privacy of purpose differential achieve a much better balance between accuracy and privacy protection. Finally, the authors conducted a controlled experiment, which analyzed MixMatch method most critical components.

3.2 Unsupervised data enhancement UDA

From the results of EDA, traditional data augmented method has some effect, but mainly for the small amount of data, a lot of desire for deep learning model training data, the traditional method effect has been limited. The Unsupervised Data Augmentation (UDA) unsupervised data amplification method proposed to open the door to a lot of missing data.

MixMatch algorithm In addition to using common data augmentation, there is a secret Mixup augmentation surgery. The UDA success, thanks to the use of a specific target for a specific task data enhancement algorithms .

Compared with such a conventional noise Gaussian noise, noise Dropout can produce a more efficient data using different data for different tasks enhancement method. This method can effectively produce, true noise, noise and diverse. In addition to the objectives and performance-oriented data enhancement strategies can focus on learning how to find missing or most desired signal training in the original mark (such as color image data to target data enhancement).

The following figure shows the structure of the target and the UDA training time, all data for using labeled and unlabeled, when the data of the training labels joined cross entropy loss function. Unmarked data, and Mixmatch using different l2 loss, using the UDA KL divergence prediction result of the data after the unlabeled augmented. Targeted data augmentation data enhancement specific target is included back translation back translation, autoaugment (image), TFIDF word replacement. Which is transferred back translation from English into French and then translated back to English, IDF is obtained from DBPedia corpus.

On the handling of text selected back-translated and keyword extraction in two ways, back-translation of ways to help the sentence and sentence data-rich, and tfidf approach optimizes random word processing EDA's strategy, based on a priori DBPedia knowledge and practical expected word frequency to determine the keyword, and then determine based on good keyword synonyms to replace, to avoid useless data and erroneous data.

In addition, excellent UDA another important breakthrough is the use of Training Signal Annealing (TSA) method gradually release training signal during training.

When collecting a small amount of labeled data and unlabeled large amounts of data, it could face tag data and unlabeled data very different situation. For example, labeled and insurance-related data, but unlabeled data are hot news. Because the need to use large amounts of unlabeled data for training, the model will be required too large, and large models will easily fit there have been limited data on supervision, then TSA will gradually release supervised training data the signal.

On each training step will set up a threshold value eta T, and less than or equal to 1, when the probability of a correct category P of a tag example above a threshold eta T, model remove this case from the loss function, the training only other marking this minibatch example of.

As FIG TSA shows three kinds of ways, three ways for different data. exp model is more suitable to the problem is relatively easy to label or with less amount. Mainly because the supervisory signal at the end of training released and can prevent over-fitting model quickly. Similarly, log mode is suitable for the case of large amounts of data, the training process is not easy to over-fitting.

So how UDA effect? The results of the show, this unsupervised method to create data have good performance on multiple tasks: ① the IMDb classification test data set , UDA only use 20 labels get the best than the previous better training method on 25,000 label data results; ② in the standard semi-supervised learning test (CIFAR-10,4000 labels; and SVHN, 1000 tags) in, UDA beat all previous methods, including MixMatch and the error rate is reduced by at least% 30; ③ on large data sets, such as on ImageNet, only need an additional 1.3 million unlabeled images, compared to the previous method, UDA can continue to improve and the first five bit hit rate.

4 data enhancement technology practice

Eda and back-translation method using the amplified data we have written related projects:

Installation can be invoked by pip

pip install textda

from textda.data_expansion import *


print(data_expansion('生活里的惬意，无需等到春暖花开'))

output:

[ 'Inside the comfortable life without having to wait until the spring'

'Wait until spring of life'

'Life do not need nice, wait until spring'

'Comfortable life without having to wait until the spring'

'Life is comfortable, you do not need to wait until spring'

'Without having a comfortable life, until in the spring'

'Life comfortable, no need to wait until spring']

4.1 back-translation of a translation software:

The original sentence: life is pleasant, without waiting for spring

In -> English -> in: living comfort, without waiting for spring flowering

In -> -> M in: Life comfort, without waiting for spring flowers

In -> Germany -> in: living comfort, without waiting for spring flowering

In -> Law -> in: living comfort, without waiting for spring flowers

4.2 EDA resulting data:

4.3 textda to enhance the effect of unbalanced text classification

Here to the middle of the negative sentiment text classification results, for example 3:

Initial training text: neg1468, pos 8214, neu 712
Test text: neg1264, pos 1038, neu 708
Classification Model: fastText text classifier training model

FIG apparent from the confusion matrix model by a weighted value of 0.749 f1

The method using the data textda expanded to neg: 7458, pos: 8214, neu: 3386

When the data tends to balance, f1 value increases to 0.783, nearly 4%

Thus data enhancement method in dealing with unbalanced data classification task can improve the performance of the model .

5 Data Enhanced expansion

5.1 Other data enhancement methods

There are many data enhancement method and the method in the text, voice, images are different.

(1) Audio:

Noise enhancement
Randomly drawn splicing of the same type
When the shift enhancement
Pitch conversion enhancement
Speed adjustment
Volume Adjustment
Mixed background sound
Increase in white noise
Mobile Audio
Expanding an audio signal

(2) Image:

Flip vertically flipped horizontally
Rotation
Zoom Zoom
Cut out
Translation
Gaussian noise
GAN network generates confrontation
AutoAugment

(3) Other data text enhancement method:

The syntax tree replacement
Chapter interception
generating a data sequence seq2seq
GAN network generates confrontation
Pre-trained language model

Although either text, voice, or image, data enhancement different methods, but these methods are similar in nature: the traditional visual method is to cut different signals, splicing, exchange, rotating, stretching, etc., using depth study the method of generating a model of the main data and the original data similar .

5.2 Other methods to prevent over-fitting

In depth study in order to avoid over-fitting (Overfitting), is usually sufficient amount of input data is the best solution. When the data model can not meet the requirements, or to add certain types of data because the data model leads to excessive over-fitting, the following method may also play some role:

Regularization: data model will lead to relatively small amount of over-fitting, so that a small training error and the test error is particularly large. Loss Function later by adding a regularization term can be suppressed by generating a fitting. The disadvantage is the introduction of a need to manually adjust the hyper-parameter.
Dropout: This is also a regularization means, but with more different is that it will be part of the neuron output to zero is achieved by random.
Unsupervised Pre-training: convolution in the form of RBM Auto-Encoder or layer by layer, to do unsupervised pre-training, add the final layer to do a supervised classification Fine-Tuning.
Transfer Learning (transfer learning): In some cases, the collection of the training set may be difficult or costly. Therefore, it is necessary to create some kind of high performance learning machine (Learner), such that they can easily be trained based on data obtained from other areas, and be able to predict when the data is excellent in the other field.

6 Summary and Outlook

Training machine learning models or the depth of learning, good data is often one of the impact model of the effect of the most important factors. The data enhancement is insufficient data is a commonly used method.

Enhanced text data from the changes to the original data word to change the sentence to change the paragraph has a different method, in order to truly improve the quality assurance data, there are several important points the following:

(1) increase the original data to ensure data consistency and semantic information.

The original data and the new data has the same label at the same time, there is need to ensure that the same semantic information. To remove an individual random words in ways that are likely to change the meaning of the sentence (for example, remove a negative word).

(2) increasing the data need to diversify.

In terms of the replacement word, sentence, sentence, etc. need to have new data to enhance the generalization ability of the model, the way a separate exchange of words more limited.

(3) an increase in the data to avoid over-fitting the data with a tag.

When large amounts of data on a small number of labeled data over-fitting, though the model may appear high f1 value, but the real effect will be to predict a lot of difference. Ensure data diversity but also to ensure the quality of the data.

(4) increasing data and original data to maintain a certain smoothness will be more valuable, improve training efficiency.

Generated data closer to real data security can be guaranteed tag, large noise generated data and raw data of the data is likely to be different. Especially in some series model, the degree of fluent text data seriously affect the prediction model.

(5) increase the data with target process required to choose.

Faster data in order to find a clear lack of demand for the desired data, synonyms demand for certain keywords may be more emphasis on the way to replace the word, sentence to be missing more emphasis on back-translation or conversion of the sentence syntax tree the way.

In the case of small data, the use of text or simple back-translation method EDA can achieve lifting effect; but want to use large quantities of data to train the neural network model, EDA or text generated by way of back-translation may not meet demand.

The UDA this unsupervised data enhancement technology, both for the small amount of data or large data volume of data, you can find a way to smooth the data with the goal of obtaining enhanced, and sometimes more effective than the model of supervised training method.

In summary, data enhancement methods can be used as a model for our training nlp powerful tool data quick fix imbalances or missing data.

More exciting content, please Tell me what the venue and the public micro-channel number " Xi Yao sell Meng small house ", there will be more exciting content waiting for you ohヘ| · ∀ · | Techno * ~ ●

Xi Yao small

Published 33 original articles · won praise 0 · Views 3269

Private letter concerns