[Engineering Practice] Using EDA (Easy Data Augmentation) for data enhancement

        In engineering projects, due to the insufficient amount of data, data enhancement technology is often needed, try to use EDA for data enhancement.  

1. Introduction to EDA

        EDA is a simple but very effective text data enhancement method, which was published by Protago Laboratory in the United States at the EMNLP-IJCNLP 2019 conference. EDA comes from the paper "EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks"

        For simple data enhancement techniques to improve the performance of text classification tasks, four data enhancement schemes are proposed in this paper, including synonym replacement , random insertion , random exchange , and random deletion . And on the deep learning model RNN and CNN, five data sets were used to do a comparative study of text classification experiments. In the experiment, the author divided the training set into 3 scales according to the size of the data set, which is used to compare the size of EDA technology in the training data set. on the impact. Experiments also show that EDA improves the effect of text classification.

2. Enhancement method

2-1. Synonym Replacement (SR)

        1. Randomly select n words from the text that do not belong to the stop word set, and randomly select their synonyms to replace them.

        2. Regardless of stopwords, randomly select n words in the sentence, and then randomly select synonyms from the dictionary of synonyms and replace them. Regarding synonyms, you can use the open source synonym list + domain custom vocabulary to create.

        Note: You need to use the synonyms library to complete the selection of synonyms

def synonym_replacement(words, n):
    new_words = words.copy()
    random_word_list = list(set([word for word in words if word not in stop_words]))
    random.shuffle(random_word_list)
    num_replaced = 0
    for random_word in random_word_list:
        synonyms = get_synonyms(random_word)
        if len(synonyms) >= 1:
            synonym = random.choice(synonyms)
            new_words = [synonym if word == random_word else word for word in new_words]
            num_replaced += 1
        if num_replaced >= n:
            break

    sentence = ' '.join(new_words)
    new_words = sentence.split(' ')

    return new_words

def get_synonyms(word):
    return synonyms.nearby(word)[0]

2-2. Random Insertion (Random Insertion, RI)

        Randomly select a word from the text that is not in the stop vocabulary, randomly select a word from its synonym word set, insert it at a random position in the sentence, and repeat this step n times.

def random_insertion(words, n):
    new_words = words.copy()
    for _ in range(n):
        add_word(new_words)
    return new_words

def add_word(new_words):
    synonyms = []
    counter = 0
    while len(synonyms) < 1:
        random_word = new_words[random.randint(0, len(new_words) - 1)]
        synonyms = get_synonyms(random_word)
        counter += 1
        if counter >= 10:
            return
    random_synonym = random.choice(synonyms)
    random_idx = random.randint(0, len(new_words) - 1)
    new_words.insert(random_idx, random_synonym)

2-3. Random Swap (RS)

        In a sentence, two words are randomly selected and their positions are swapped. This process can be repeated n times. (Two sequence subscripts are randomly generated in the swap_word function, if they are the same, they will be regenerated up to three times.)

def random_swap(words, n):
    new_words = words.copy()
    for _ in range(n):
        new_words = swap_word(new_words)
    return new_words


def swap_word(new_words):
    random_idx_1 = random.randint(0, len(new_words) - 1)
    random_idx_2 = random_idx_1
    counter = 0
    while random_idx_2 == random_idx_1:
        random_idx_2 = random.randint(0, len(new_words) - 1)
        counter += 1
        if counter > 3:
            return new_words
    new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1]
    return new_words

2-4. Random deletion (Random Deletion, RD)

        Randomly delete words from the text with probability p. If there is only one word in the sentence, return it directly. If all words in the sentence are dropped, return a word at random.

def random_deletion(words, p):
    if len(words) == 1:
        return words

    new_words = []
    for word in words:
        r = random.uniform(0, 1)
        if r > p:
            new_words.append(word)

    if len(new_words) == 0:
        rand_int = random.randint(0, len(words) - 1)
        return [words[rand_int]]

    return new_words

3. Problem summary

3-1. If multiple words in a sentence are changed, will the original label category of the sentence still be valid? ​​​​​​​​

        In order to verify whether the data generated by the EDA method is consistent with the characteristics of the original data, the author conducts a comparative analysis of the data on the Pro-Con data set.

        Specific method: first, use RNN to train on a data set that has not used EDA; then, perform EDA amplification on the test set, and each original sentence is amplified to 9 enhanced sentences, and these sentences are input as the test set into the RNN; finally, take the output vector from the last fully connected layer. Apply t-SNE technology to represent these vectors in two-dimensional form. As shown below. In the figure below, the big triangle and the big circle are the original sentences, and the small triangle and the small circle represent the sentences enhanced by EDA technology. It can be seen that most of the original data are consistent with the EDA enhanced data, that is, there is no semantic shift , so the four data augmentation techniques proposed in this paper will not affect the original label of the text.

3-2. For each method in EDA, what is the effect of individual promotion?

        In order to determine which of the four data enhancement methods, or which methods play the role of the performance improvement, and which method plays a greater role, the author did an ablation study-using each of them separately A data augmentation approach for experimental research. And get the following experimental results.

        In the figure above, the parameter α represents the ratio of the number of words changed in the four data enhancement methods to the length of the original text. In the experiment, α={0.05, 0.1, 0.2, 0.3, 0.4, 0.5}.

        For synonym replacement (SR), when α is small, the experimental performance is significantly improved, but the performance decreases when α becomes larger, probably because the meaning of the original text is changed when too many words are replaced;

        For random insertion (RI), the value of α in the above range keeps the experimental performance relatively stable, probably because the random insertion method keeps the order of words in the original text relatively stable;

        For random exchange (RS), when α ≤ 0.2, the experimental performance is significantly improved, and when α ≥ 0.3, the performance decreases, because too many word position exchanges disrupt the overall order of the original text and change the meaning of the text;

        For random deletion (RD), when α is small, the experimental performance can reach the highest, but when α becomes large, the experimental performance can be seriously reduced, because when too many words are deleted, the sentence is difficult to understand, and the text loses semantic information.

        The ablation experiments conclude that, for each method, the effect achieved on small datasets is more pronounced. If α is too large, it will even reduce the performance of the model, and α=0.1 seems to be the best value.

3-3. How to choose the appropriate number of enhanced sentences?

        On a smaller data set, the model is prone to overfitting, so generating a little more corpus can achieve better results. For larger data sets, generating more than 4 sentences per sentence does not help much in improving the effect of the model. Therefore, the author recommends some parameter selections in actual use as shown in the table below.

 naug: the number of enhanced sentences for each original sentence ; Ntrain: the size of the training set

3-4. What is the principle of EDA to improve the effect of text classification?

        1. Generating enhanced data similar to the original data introduces a certain degree of noise, which helps prevent overfitting;

        2. Using EDA, new vocabulary can be introduced through synonym replacement and random insertion operations, allowing the model to generalize to those words that are in the test set but not in the training set;

4. EDA data enhancement code implementation

4-1 Description

       In the code implementation, jieba word segmentation, stop vocabulary (HIT stop vocabulary is used by default), and a package that provides synonyms (Synonyms) are required.

4-2 Code implementation

import pandas as pd
import json
from tqdm import tqdm

# !/usr/bin/env python
# -*- coding: utf-8 -*-
import jieba
import re
import random
from random import shuffle
random.seed(2019)
import synonyms 
# 停用词列表,默认使用哈工大停用词表
f = open('/home/zhenhengdong/WORk/Classfier/Dates/stopWord.json', encoding='utf-8')
stop_words = list()
for stop_word in f.readlines():
    stop_words.append(stop_word[:-1])
# 文本清理
import re
def get_only_chars(line):
    #1.清除所有的数字

########################################################################
# 同义词替换
# 替换一个语句中的n个单词为其同义词
########################################################################

def synonym_replacement(words, n):
    new_words = words.copy()
    random_word_list = list(set([word for word in words if word not in stop_words]))
    random.shuffle(random_word_list)
    num_replaced = 0
    for random_word in random_word_list:
        synonyms = get_synonyms(random_word)
        if len(synonyms) >= 1:
            synonym = random.choice(synonyms)
            new_words = [synonym if word == random_word else word for word in new_words]
            num_replaced += 1
        if num_replaced >= n:
            break

    sentence = ' '.join(new_words)
    new_words = sentence.split(' ')

    return new_words

def get_synonyms(word):
    return synonyms.nearby(word)[0]

########################################################################
# 随机插入
# 随机在语句中插入n个词
########################################################################
def random_insertion(words, n):
    new_words = words.copy()
    for _ in range(n):
        add_word(new_words)
    return new_words


def add_word(new_words):
    synonyms = []
    counter = 0
    while len(synonyms) < 1:
        random_word = new_words[random.randint(0, len(new_words) - 1)]
        synonyms = get_synonyms(random_word)
        counter += 1
        if counter >= 10:
            return
    random_synonym = random.choice(synonyms)
    random_idx = random.randint(0, len(new_words) - 1)
    new_words.insert(random_idx, random_synonym)

########################################################################
# Random swap
# Randomly swap two words in the sentence n times
########################################################################

def random_swap(words, n):
    new_words = words.copy()
    for _ in range(n):
        new_words = swap_word(new_words)
    return new_words


def swap_word(new_words):
    random_idx_1 = random.randint(0, len(new_words) - 1)
    random_idx_2 = random_idx_1
    counter = 0
    while random_idx_2 == random_idx_1:
        random_idx_2 = random.randint(0, len(new_words) - 1)
        counter += 1
        if counter > 3:
            return new_words
    new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1]
    return new_words

########################################################################
# 随机删除
# 以概率p删除语句中的词
########################################################################
def random_deletion(words, p):
    if len(words) == 1:
        return words

    new_words = []
    for word in words:
        r = random.uniform(0, 1)
        if r > p:
            new_words.append(word)

    if len(new_words) == 0:
        rand_int = random.randint(0, len(words) - 1)
        return [words[rand_int]]

    return new_words

########################################################################
# EDA函数
def eda_func(sentence, alpha_sr = 0.35, alpha_ri = 0.35, alpha_rs = 0.35, p_rd = 0.35, num_aug = 12):
    seg_list = jieba.cut(sentence)
    seg_list = " ".join(seg_list)
    words = list(seg_list.split())
    num_words = len(words)

    augmented_sentences = []
    num_new_per_technique = int(num_aug / 4)
    n_sr = max(1, int(alpha_sr * num_words))
    n_ri = max(1, int(alpha_ri * num_words))
    n_rs = max(1, int(alpha_rs * num_words))

    # print(words, "\n")
    # 同义词替换sr
    for _ in range(num_new_per_technique):
        a_words = synonym_replacement(words, n_sr)
        augmented_sentences.append(''.join(a_words))
    # 随机插入ri
    for _ in range(num_new_per_technique):
        a_words = random_insertion(words, n_ri)
        augmented_sentences.append(''.join(a_words))
    #
    # 随机交换rs
    for _ in range(num_new_per_technique):
        a_words = random_swap(words, n_rs)
        augmented_sentences.append(''.join(a_words))
    #
    #
    # 随机删除rd
    for _ in range(num_new_per_technique):
        a_words = random_deletion(words, p_rd)
        augmented_sentences.append(''.join(a_words))

    # print(augmented_sentences)
    shuffle(augmented_sentences)

    if num_aug >= 1:
        augmented_sentences = augmented_sentences[:num_aug]
    else:
        keep_prob = num_aug / len(augmented_sentences)
        augmented_sentences = [s for s in augmented_sentences if random.uniform(0, 1) < keep_prob]

    # augmented_sentences.append(seg_list)

def Data_Augmentation(item,num):
    augmented_sentence_dataframe = pd.DataFrame()
    for join_class in tqdm(stations_dict[item]):
        for index in range(len(new_data)):
            if new_data.loc[index].联合分类 == join_class:
                augmented_sentences = eda_func(sentence = new_data.loc[index]['内容'])[:num]
                for augmented_sentence in augmented_sentences:
                    creat_new_data = pd.DataFrame()
                    creat_new_data['内容'] = [augmented_sentence]
                    creat_new_data['反馈类型'] = [new_data.loc[index]['反馈类型']]
                    creat_new_data['一级分类'] = [new_data.loc[index]['一级分类']]
                    creat_new_data['二级分类'] = [new_data.loc[index]['二级分类']]
                    creat_new_data['联合分类'] = [new_data.loc[index]['联合分类']]
                    augmented_sentence_dataframe = pd.concat([augmented_sentence_dataframe, creat_new_data], ignore_index=True)
    print(len(augmented_sentence_dataframe))          
    return augmented_sentence_dataframe

if __name__ == '__main__':
    new_data = pd.read_csv('./Temp_data.csv')
    stations_dict = {}
    for index,key_values in enumerate(new_data.联合分类.value_counts().items()):
        if 1500 > key_values[1] > 1000:
            stations_dict.setdefault('1000', []).append(key_values[0])
        if 1000 > key_values[1] > 800:
            stations_dict.setdefault('800', []).append(key_values[0])
        if 800 > key_values[1] > 600:
            stations_dict.setdefault('600', []).append(key_values[0])
        if 600 > key_values[1] > 500:
            stations_dict.setdefault('500', []).append(key_values[0])
        if 500 > key_values[1] > 400:
            stations_dict.setdefault('400', []).append(key_values[0])
        if 400 > key_values[1] > 300:
            stations_dict.setdefault('300', []).append(key_values[0])
        if 300 > key_values[1] > 0:
            stations_dict.setdefault('0', []).append(key_values[0])
    Temp_data = pd.DataFrame()
    for item in stations_dict:
        if item == '1000':#13642
            augmented_sentence_dataframe = Data_Augmentation(item,num = 2)
            Temp_data = pd.concat([Temp_data, augmented_sentence_dataframe], ignore_index=True)
        if item == '800':#16503
            augmented_sentence_dataframe = Data_Augmentation(item,num = 3)
            Temp_data = pd.concat([Temp_data, augmented_sentence_dataframe], ignore_index=True)
        if item == '600':#23684
            augmented_sentence_dataframe = Data_Augmentation(item,num = 4)
            Temp_data = pd.concat([Temp_data, augmented_sentence_dataframe], ignore_index=True)
        if item == '500':#15186
            augmented_sentence_dataframe = Data_Augmentation(item,num = 6)
            Temp_data = pd.concat([Temp_data, augmented_sentence_dataframe], ignore_index=True)
        if item == '400':#20400
            augmented_sentence_dataframe = Data_Augmentation(item,num = 8)
            Temp_data = pd.concat([Temp_data, augmented_sentence_dataframe], ignore_index=True)
        if item == '300':#7137
            augmented_sentence_dataframe = Data_Augmentation(item,num = 9)
            Temp_data = pd.concat([Temp_data, augmented_sentence_dataframe], ignore_index=True)
        if item == '0':#3897
            augmented_sentence_dataframe = Data_Augmentation(item,num = 9)
            Temp_data = pd.concat([Temp_data, augmented_sentence_dataframe], ignore_index=True)
    #将合并的data存储
    Temp_data.to_csv('./Temp_data_single_sample.csv',index = False,encoding='utf8')

Reference:

1.https://www.zhihu.com/question/341361292/answer/2916784123

2. Data enhancement in NLP: UDA, EDA_eda data enhancement_Happy Little Code Farmer's Blog-CSDN Blog

Guess you like

Origin blog.csdn.net/weixin_44750512/article/details/131675191