A Comparative Study of Synthetic Data Generation Methods for Grammatical Error Correction翻译

Summary

Grammatical Error Correction (GEC) is related to correcting grammatical errors in written texts. Current GEC systems, that is, those that utilize statistics and neural machine translation, require large amounts of annotated training data, which may be expensive or impractical. This study compared the synthetic data technology used by the two highest-scoring submitters on the restricted and low-resource tracks in the BEA-2019 grammatical error correction shared task.

1 Introduction

Grammatical Error Correction (GEC) is the task of automatically correcting grammatical errors in written texts. Recently, significant progress has been made in the framework of statistical machine translation (SMT) and neural machine translation (NMT) methods, especially in English GEC. The success of these methods can be partly attributed to the availability of several large training sets.
  In the latest Building Educational Applications (BEA) 2019 shared task, this task continues the tradition of the early GEC competition, and all 24 participating teams adopted NMT and/or SMT methods. One of the goals of BEA-2019 is to re-evaluate the field after a long outage, because the recent GEC system has become difficult to evaluate due to the lack of a standardized experimental environment: although it has been achieved since the end of the last CoNLL-2014 joint mission Significant progress has been made. Recently, various systems have been trained, adjusted and tested in different combinations of metrics and corpora. The BEA-2019 shared task also introduced a new data set that represents different methods of English language proficiency and domains, as well as separate evaluation trajectories, namely ``restricted'', ``unrestricted'' and ''Insufficient resources'' track. The unrestricted track allows the use of any resources; the "restricted" track restricts the use of the learner's corpus to a publicly available corpus, while the "low-resource" path significantly restricts the use of annotation data to encourage development without dependence A system for large amounts of manual annotation data.
  The two highest-scoring systems in the restricted and under-resourced track are UEDIN-MS and Kakao&Brain. These two systems are far ahead of the other teams on both tracks; in addition, both systems use artificial data to train them. NMT systems, but they generate artificial data in different ways. Interestingly, in the "restricted" track, the scores of the two systems are equal, while in the "low resource" track, Kakao & Brain showed a larger performance gap (a drop of 10 compared to the "restricted" track Points above), and UEDIN-MS is 4 points. Although the two teams use the same model architecture, that is, transformer-based neural machine translation (NMT), in addition to the differences in data generation methods, the system also uses different training scenarios, hyperparameter values, and training corpora.
  The purpose of this article is to compare the synthetic data generation techniques used by UEDIN-MS and Kakao & Brain systems . The method used in the UEDIN-MS system utilizes the confusion set generated by the spell checker, while the Kakao & Brain method relies on learner patterns extracted from a small number of annotated samples and POS-based confusion. From now on, we will refer to them as Inverted Spellchecker Inverted SpellcheckerI n v e r t e d S p e l l c h e c k e r method andPatterns + POS Patterns+POSPatterns+P O S method. In order to ensure fair comparison of methods, we control other variables such as model selection, hyperparameters, and raw data selection. We train the NMT system and evaluate our model on two learner corpora, which are the W&I+LOCNESS corpus and the FCE corpus introduced in BEA-2019. Using the automatic error type tool ERRANT, we also show the performance evaluation of error types on two corpora.
  This paper has made the following contributions: (1) We use two data sets to make a fair comparison of the methods of synthesizing parallel data of the two GEC systems; (2) We find that these two methods train different complementary systems and target Different types of errors: AlthoughI nverted S pellchecker Inverted SpellcheckerI n v e r t e d S p e l l c h e c k e r method is good at identifying spelling errors, whileP atterns + POS Patterns+POSPatterns+The P O S method is better at correcting errors related to grammar, such as singular and plural nouns, verb consistency, and verb tense; (3) Generally speaking, it is similar toInverted S pellchecker Inverted SpellcheckerCompared with the I n v e r t e d S p e l l c h e c k e r method,P atterns + POS Patterns+POSPatterns+P O S method performance plurality of training scenarios result in a stronger, parallel to these scenes including synthetic data, the synthetic data in the learner data and the synthetic data in the field outside the learner data field; (4) adding a ready spelling The checker is beneficial and especially useful for the Patterns + POS method.
  In the next section, we will discuss related work. Section 3 summarizes the W&I+LOCNESS and FCE learner data sets. Section 4 introduces the data synthesis method. Section 5 introduces the experiment. Section 6 analyzes the results. Section 7 summarizes the paper.

2. Related work

(1) English GEC progress
  Early GEC methods focused on the field of English as a second language learner, and used linear machine learning algorithms and classifiers for specific error types (such as article, preposition or noun number). This method can train the classifier on local English data, learner data or a combination thereof.
  CoNLL's common task in English grammar correction provides the first batch of annotated large-scale learning data corpus (NUCLE), and two test sets. All data are provided by learners who study English at the National University of Singapore (most of them are Chinese speakers). The statistical machine translation method was the first success in the CoNLL-2014 competition. And since then, SMT and NMT methods have obtained the best results on the CoNLL data set. The system is usually trained on a combination of the English part of the NUCLE and Lang-8 corpus, although the latter is known to contain noisy data because it is only partially corrected.
  (2) Minimally-Supervised and Data-Augmented GEC
  Recently, a lot of work has been done in generating synthetic training data. These methods can be broken down into attempts to use other resources (such as Wikipedia editors) or by adding noise to the correct English data. Boyd (2018) augmented the training data by editing content extracted from the German Wikipedia revision history. By categorizing the content editing, only the content related to GEC is retained. Among them, Wiki Edits is used to extract the revision history from the revision history. Wikipedia editor. The multi-layer convolutional encoder-decoder neural network model we used in this work demonstrates the contribution of the resulting editing. Mizumoto et al. (2011) extracted a corpus of Japanese learners from Lang-8's revision log (about one million sentences) and implemented a character-based machine translation model.
  Another method of generating parallel data can create human error in correctly formatted data. This method has proven to be effective within the classification framework.

3. Learner Corpus

Insert picture description here

4. Synthetic data generation method

In this section, we describe two methods for generating parallel data for training.

4.1 Inverted Spellchecker方法

The method of generating unsupervised parallel data used in the system submitted by the UEDIN-MS team is characterized by using obfuscated sets extracted from spell checkers. Then, these artificial data are used to pre-train the Transformer sequence-to-sequence model.
  (1) Overview of the noise adding method The
  reverse spelling checker method uses the Aspell spelling checker to generate a candidate list for a given word. The candidates are sorted according to the weighted edit distance of the candidate word to the input word and the distance between their phonetic equivalent words. Then, the system selects the first 20 candidates as the confusion set of the input words.
  For each sentence, determine the number of words to be changed according to the word error rate of the development set. For each selected word, perform one of the following actions. The word is replaced with a randomly selected word from the confusion set with a probability of 0.7, the word with a probability of 0.1 is deleted, and a random word is inserted with a probability of 0.1. When the probability is 0.1, the position of the word will be exchanged with the adjacent word . In addition, the above operations are performed for 10% of words at the character level to introduce spelling errors. It should be emphasized that although the "reverse spelling checker" method uses the confusion set from the spelling checker, the idea of ​​this method is to generate synthetic noise data to train the general GEC system to correct various grammatical errors.
  (2) Training details The
  UEDIN-MS system generates parallel data by applying the Inverted Spellchecker method to 100 million sentences sampled in the WMT news grabbing corpus. This data is used to pre-train the Transformer model on the restricted track and LowResource track; the main difference between these models is the data set used for fine-tuning.
  In the "restricted" track, all available annotation data from FCE, Lang-8, NUCLE and W&I+LOCNESS are used for fine-tuning. In the "low resource" track, a subset of the WikiEd corpus is used. The WikiEd corpus contains 56 million parallel sentences automatically extracted from the revised version of Wikipedia. The artificially annotated W&I+LOCNESS training data is used as the seed corpus to select the 2 million sentence pairs that best match the domain from the WikiEd corpus. Then, these 2 million sentences were used to fine-tune the model pre-trained on synthetic data.

4.2 Patterns+POS method

The Kakao&Brain system generates artificial data by introducing two noise-adding schemes: a character-based method (pattern) and a type-based method (POS). It is similar to the UEDIN-MS system, and then uses synthetic data to pre-train the Transformer model.
  (1) Overview of noise addition method
  This method first uses a small number of learner samples from W&I + LOCNESS training data to extract error patterns, that is, edits and their frequency. The editing information is used to construct a dictionary of commonly used editing. Then, by editing the grammatically correct sentences in reverse, use this dictionary to generate noisy data.
  For any characters in the training data that cannot be found in the edit mode dictionary, a type-based noise-adding scene will be applied. In the type-based approach, noise is added based on part of speech (POS). Here, only prepositions, nouns and verbs are noiseized. The probability of a single character is 0.15, as shown below: nouns can be replaced by singular/plural forms; verbs can be replaced by their morphological variants; one preposition can be replaced by another Preposition substitution.
  (2) Training details
  By applying the Patterns + POS method to the data sets from Gutenberg, Tatoeba and WikiText-103, the synthetic data of the Kakao & Brain system was generated. The final pre-training data set is a collection of 45 million sentence pairs, and the noise addition method is applied multiple times to each data set (1x Gutenberg, 12x Tatoeba, and 5x WikiText-103) to roughly balance the data from each source. In the "restricted" track and the "low resource" track, these 45 million sentence pairs are used for pre-training the model. The main difference between the various systems of these tracks is the different data sets. In the "restricted" track, all available annotation data from FCE, Lang-8, NUCLE, W&I + LOCNESS are used in the training steps. In the "low resource" track, training was conducted on a subset of 3000 sentences sampled from W&I + LOCNESS development data.

Guess you like

Origin blog.csdn.net/qq_28385535/article/details/112561283