Hong Kong Chinese & Soochow University released GrammarGPT, a large-scale Chinese grammar error correction model | It can achieve SOTA performance with only 1K data for instruction fine-tuning!

Title: GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning
PDF: arxiv.org/pdf/2307.13…
Code: github.com/freedominte…

guide

ChatGPT has an excellent ability in grammatical error correction, but it is closed source after all, and the cost of model reasoning is too high. This paper aims to build a large vertical domain model on Chinese grammatical error correction, thus proposing GrammarGPT. The core of GrammarGPT is to use ChatGPT to generate ungrammatical sentences by providing certain clues. For grammatical errors without clues, the author collects ungrammatical sentences from publicly available websites and manually corrects them. In addition, the author also adopted an error-invariant data augmentation method to enhance the model's ability to correct local Chinese grammatical errors.

In the end, the author constructed about 1K parallel data, and used the data to fine-tune the instructions of the open source LLMs (Phonenix released in Hong Kong Chinese). The experimental results show that: GrammarGPT is significantly better than the existing SOTA model. Although the model parameters are 20 times larger than the SOTA baseline, the amount of data required for instruction fine-tuning is 1200 times smaller, demonstrating the potential of open-source LLMs for native Chinese grammar error correction. GrammarGPT ranks 3rd on NLPCC2023 SharedTask1.

introduction

Grammatical Error Correction (GEC) aims to automatically correct ungrammatical sentences without changing their meaning. While past research has focused on obvious and naive grammatical errors made by foreign Chinese learners, more recent work has turned to more subtle and challenging grammatical errors made by native speakers.

Table 1.

Table 1 above lists the six main types of grammatical errors common to Chinese native speakers, which can be divided into two types: clued (w/) and uncued (w/o). These incorrect sentences are fluent in expression and conform to the usage habits of native Chinese, but they do not conform to Chinese grammar, so they are more difficult to correct.

This paper proposes GrammarGPT and investigates the potential of open-source LLMs in correcting grammatical errors in native Chinese through supervised fine-tuning. When fine-tuning LLMs for CGEC, a key challenge is to obtain high-quality parallel data generated by native speakers, which includes samples of grammatical errors. Since manually annotating such data is not only time-consuming but also extremely costly, automatic data annotation methods need to be explored. Some recent studies have successfully leveraged ChatGPT and reduced data from real-world datasets to fine-tune domain-specific LLMs, effectively reducing cost and also achieving superior performance. The work on instruction fine-tuning for specific tasks can be divided into three categories according to the data sources:

Data generated by ChatGPT
Manually labeled data
ChatGPT and human-mixed datasets

In this study, the authors, inspired by this research thread, propose a hybrid dataset containing different types of grammatical errors in native Chinese. First, the authors propose a heuristic for handling grammatical errors with clues, by providing clues to ChatGPT to generate ungrammatical sentences. Then, for errors without clues, the authors collected ungrammatical sentences from public websites and corrected them manually. In addition, the authors also propose an error-invariant data augmentation method, which enhances the diversity of the data by replacing named entities in the parallel data with similar entities, thereby improving the model's ability to correct grammatical errors in native Chinese. Finally, the authors construct 1k parallel datasets and use them to fine-tune LLMs with instruction fine-tuning.

实验结果表明，GrammarGPT在中文语法错误纠正方面明显优于现有的SOTA系统。尽管该模型参数的大小是SOTA基线的20倍，但用于微调的数据量却小了1200倍，这证明了开源LLMs在中文语法错误纠正上的潜力。

本文贡献如下：

本文是第一个探索开源LLMs结合指令微调用于母语中文语法错误纠正的研究。
本文构建了一个混合数据集，其中既包括由ChatGPT生成的数据，又包括手动注释的数据，有效地覆盖了母语中文语法错误，使得LLMs在语法检测方面表现更加出色。
本文还设计了一种错误不变的数据增强方法，通过用相似的命名实体替换并行数据中的命名实体，使模型在纠正语法错误方面更加准确。
实验结果表明，GrammarGPT在性能上明显优于最先进的系统，且用于指令微调的数据量仅为最SOTA系统的1/1200。

方法

figure 1.

上图1展示了GrammarGPT方法的框架，它涉及构建包含六种母语中文语法错误的并行数据，以便于对开源语言模型（LLMs）进行微调。虽然人工标注的数据提供了高质量的样本，但相关的高成本仍然是一个显著的问题。为了解决这个问题，本文采用了一种折中的方法。首先通过从互联网收集的线索来引导ChatGPT生成带有线索的不符合语法的句子，然后对从互联网收集的没有线索的不符合语法的句子进行标注。此外，本文还提出了一种错误不变的增强技术，用相似的命名实体替换并行数据中的命名实体，进一步增强模型纠正母语中文语法错误的能力。最后将并行数据转换为指令，然后利用这些指令对LLMs进行微调。

混合数据集构建

figure 2.

ChatGPT生成的数据

如上表1的前几行所示，带有线索的语法错误通过识别特定线索很容易被检测和纠正。例如，“more than”和“about”同时使用导致冗余成分，“The cause”和“caused by”同时使用导致结构混乱，“prompting”和“pace”同时使用导致不当搭配。相反，我们可以通过将这些线索插入到语法正确的句子中来构建不符合语法的句子。由于ChatGPT具有强大的能力，我们可以通过提供从公共网站收集的这些线索来指导ChatGPT生成符合我们要求的不符合语法的句子。上图2显示了ChatGPT生成相关训练数据的一个示例。

人工标注的数据

上表1的最后三行显示了一些类型的母语不符合语法错误很难被识别。我们可以发现这些不符合语法的句子在表达上是流利的，没有明显的语法错误线索可以帮助我们识别它们。对于这些类型的语法错误，作者主要从公开的网站上收集了不符合语法的句子，然后进行手动标注。

错误不变的数据增强

image 3.

为了优先考虑模型对母语语法错误的关注，并提高其鲁棒性，作者设计了一种错误不变的增强方法，如上图3所示。母语中文的语法错误通常是微妙的，而且很少出现在命名实体的位置。为了解决这个问题，作者采取了一种策略，即用相似的命名实体替换并行数据中的命名实体。通过使用这种增强方法，模型可以专注于识别未改变的错误而不是具体的名词，从而提高其纠正微妙和不易察觉的语法错误的性能。

指令微调

Table 2.

指令微调已成为微调LLMs的主流方法，通过提供明确的指令来增强模型对于指令的理解能力。在本文中，作者也同样使用指令微调来微调LLMs。指令的详细内容如上表2所示，主要包含四个组成部分。

任务后缀：该组件引导LLMs扮演一个AI助手的角色。
任务描述：在这里，概述了LLMs需要完成的具体任务。
输入：这对应于在微调过程中用作输入的不符合语法的句子。
输出：这表示符合语法的句子，它们在微调过程中作为期望的输出。

通过这些指令，我们能够有效地引导LLMs学习和纠正母语中文的语法错误，从而提高模型在中文语法错误纠正任务上的性能。

实验结果

table 3.

本文一共构建了1061个用于训练的并行数据样本，数据统计情况见上表3。其中约35%的数据是通过人工标注获得的，而其余65%是由ChatGPT生成的。为了评估模型性能，作者使用了NLPCC2023 SharedTask1网站上提供的验证集，其中包含500个并行数据样本。

table 5.

如上表5所示，S2S BART在1k混合数据集上训练后，在单词级别和字符级别上分别达到了17.57和18.16的F0.5，这与使用120万外语学习者数据的基线模型性能相当。我们将这归因于外语学习者和母语中文学习者之间的语法错误差异巨大，使得仅依靠外语学习者的数据很难有效地提高母语中文语法错误纠正的性能。这些结果进一步突显了作者构建的混合数据集中包含母语中文语法错误的有效性。

此外，GrammarGPT仅使用约1k个数据样本进行微调，就取得了显著的改进，在单词级别和字符级别上分别达到了32.56和35.84的F0.5，性能几乎是基线模型的两倍。这展示了开源LLMs在母语中文语法错误纠正中的显著潜力。最终的官方测试集结果显示，GrammarGPT排名第3。

Table 6.

同时，作者还对混合数据集和错误不变的增强方法进行了消融实验，如上表6所示，相关实验结论如下：

无论是否应用数据增强，使用ChatGPT生成的数据进行训练的模型始终优于使用人工标注数据进行训练的模型。作者将这个观察结果归因于两个主要因素：首先，由于人工标注的成本较高，所以人工标注的数据量比ChatGPT生成的数据要小；其次，没有线索的语法错误确实更难以纠正。
混合数据集展示了提高母语中文语法错误纠正性能的潜力，这一发现证实了我们构建混合数据集的方法在包含母语中文语法错误方面的有效性。
By employing error-invariant augmentation, it is observed that models trained on mixed datasets show significant improvements in recall and F0.5 metrics, but only slight improvements in precision. This suggests that this data augmentation technique enhances the model's ability to detect grammatical errors by forcing the model to pay more attention to grammatical errors in the augmented data.

in conclusion

In this paper, we introduce GrammarGPT, an open-source large language model (LLM) dedicated to grammatical error correction in native Chinese. The paper shows that high-quality training data, data construction skills, and what type of data to construct are so important for building vertical domain applications with LLMs.