[Natural Language Processing] Implementing text error correction based on pycorrector

Text Error Correction technology aims to automatically correct spelling, grammar, punctuation and other errors in input text to improve the accuracy, smoothness and standardization of the text. This technology can be realized through natural language processing technology, which analyzes and infers text based on context and language rules, finds errors in it, and gives correct replacement or modification suggestions.

pycorrector is an open source Chinese text error correction tool that supports the correction of phonetic similarities, similar shapes, and grammatical errors in Chinese texts. This tool was developed using Python3 and integrates multiple models such as Kenlm, ConvSeq2Seq, BERT, MacBERT, ELECTRA, ERNIE, and Transformer to implement text error correction functions. The official warehouse address of pycorrector is: pycorrector . The pycorrector installation command is as follows:

pip install -U pycorrector

This article aims to introduce how to call the function interface provided by pycorrector for text error correction. In fact, the pycorrector official repository has provided detailed usage tutorials to further understand how to use pycorrector. In addition, the article PyCorrector text correction tool practice and code details also systematically introduce the use of pycorrector.

# jupyter notebook环境去除warning
import warnings
warnings.filterwarnings("ignore")

1 pycorrector related background

Some content and pictures in this section come from: pycorrector source code interpretation .

Application background

The application background and common error types of Chinese text error correction tasks are shown in the figure below. pycorrector focuses on solving errors such as "phonetic similarity, morphology, grammar, proper name errors" and other types of errors.

data set

In text error correction tasks, the quality and quantity of the data set are often more important than the model itself. This is also a common problem faced by many practical scenario tasks. Because the differences between models are not large, the accuracy of text error correction models depends more on the size of the training data.

Technical ideas

There are two technical implementation methods for general text error correction tasks: rule-based text error correction and machine learning/deep learning-based text error correction. Since large language models work very well, existing text error correction methods only need to be understood.

The technical ideas of rule-based Chinese text error correction are as follows:

  1. Word segmentation: First use the word segmentation tool to segment the input text and split the sentence into words.
  2. Error detection: Use rule matching to detect errors in text after word segmentation. Common errors include spelling errors, word order errors, redundant words, etc. For example, a dictionary or corpus can be used to match commonly used words, and if a word is not in the dictionary or corpus, it is considered a possible error.
  3. Error correction: Once an error is discovered, it can be corrected according to the rules. Error correction methods include spelling correction, word order adjustment, word replacement, etc. For example, edit distance or pinyin approximate matching algorithm can be used for spelling correction; language model prediction probability can be used to determine whether the word order is correct; synonym dictionary can be used for word replacement.

The technical ideas of text error correction based on machine learning/deep learning are as follows:

  1. Based on the Sequence-to-Sequence model: Using a sequence-to-sequence model with an encoder-decoder structure, the input error text is used as the source sequence and the target text (correct text) is used as the target sequence for training. By minimizing the error, the parameters of the model are adjusted to correct the error text.
  2. Based on the Transformer model: The Transformer model is a deep learning model with an attention mechanism, which is widely used in natural language processing tasks, such as machine translation and text generation. In text error correction tasks, the Transformer model can be used to convert incorrect text into correct text, and the loss function can be used for training and optimization.
  3. Language model-based reinforcement learning method: Use language models to generate candidate error correction results, and use reinforcement learning algorithms to evaluate and select the best error correction suggestions. This method can continuously improve error correction performance by interacting with the external environment.
  4. Based on pre-trained models, such as GPT series models or BERT models. Among them, BERT is a pre-trained language model with good context understanding capabilities. In text error correction, you can use the BERT model for encoding and decoding operations, and train the model through self-supervised learning so that it has the ability to correct erroneous text.

Industrial solutions

Although large language models represented by GPT have shown potential in text error correction, they require high computing resources and require training data and powerful computing power. At present, the existing text error correction technologies in the industry still have advantages and wide application scenarios.

2 pycorrector usage instructions

2.1 Rule-based Chinese text error correction

The rule-based Chinese text error correction interface in pycorrector uses the Kenlm model by default. Specifically, pycorrector trained the Chinese NGram language model based on the Kenlm statistical language model tool. Combining the rule method and confusion set can quickly correct Chinese spelling errors, but the effect is average.

Text error correction

~/.pycorrector/datasets/zh_giga.no_cna_cmn.prune01244.klmThe correct function used for text error correction will load the kenlm language model file from the path . If it is detected that the file does not exist, the model will be automatically downloaded from the Internet. Of course, you can also manually download the klm model file (2.8G) and place it in this location.

import pycorrector
# include_symbol: 是否包含标点符号,默认为True
# threshold: 纠错阈值,默认为57
corrected_sent, detail = pycorrector.correct("人群穿流不息,少先队员因该为老人让坐!",include_symbol=True,threshold=60)
# corrected_sent: 纠错后的句子
# detail: 纠错信息,[wrong, right, begin_pos, end_pos]
corrected_sent, detail
('人群川流不息,少先队员应该为老人让座!',
 [('穿流不息', '川流不息', 2, 6), ('因该', '应该', 11, 13), ('坐', '座', 17, 18)])

error detection

pycorrector provides the detect function to detect and return possible language errors and error types in the input text.

import pycorrector

idx_errors = pycorrector.detect('人群穿流不息,少先队员因该为老人让坐!')
print(idx_errors)
[['穿流不息', 2, 6, 'proper'], ['因该', 11, 13, 'word'], ['坐', 17, 18, 'char']]

Correction of idioms and proper names

pycorrector provides functions specifically used to correct idioms and proper names, as shown below:

from pycorrector.proper_corrector import ProperCorrector
from pycorrector import config

m = ProperCorrector(proper_name_path=config.proper_name_path)
x = [
    '这块名表带带相传',
    '这块名表代代相传',
    '这场比赛我甘败下风',
    '这场比赛我甘拜下封',
    '早上在拼哆哆上买了点葡桃',
]

for i in x:
    print(i, ' -> ', m.proper_correct(i))
这块名表带带相传  ->  ('这块名表代代相传', [('带带相传', '代代相传', 4, 8)])
这块名表代代相传  ->  ('这块名表代代相传', [])
这场比赛我甘败下风  ->  ('这场比赛我甘拜下风', [('甘败下风', '甘拜下风', 5, 9)])
这场比赛我甘拜下封  ->  ('这场比赛我甘拜下风', [('甘拜下封', '甘拜下风', 5, 9)])
早上在拼哆哆上买了点葡桃  ->  ('早上在拼多多上买了点葡桃', [('拼哆哆', '拼多多', 3, 6)])

Custom confusion set

pycorrector supports users to correct known errors by loading a custom confusion set, which is actually string replacement.

from pycorrector import ConfusionCorrector, Corrector

if __name__ == '__main__':
    error_sentences = [
        '买iphonex,要多少钱',  # 漏召回
        '哪里卖苹果吧?请大叔给我让坐',  # 漏召回
        '共同实际控制人萧华、霍荣铨、张旗康',  # 误杀
        '上述承诺内容系本人真实意思表示',  # 正常
        '大家一哄而伞怎么回事',  # 成语
    ]
    m = Corrector()
    for i in error_sentences:
        print(i, ' -> ', m.detect(i), m.correct(i))

    print('*' * 42)
    
    # 自定义混淆集
    custom_confusion = {
    
    '得事': '的事', '天地无垠': '天地无限', '交通先行': '交通限行', '苹果吧': '苹果八', 'iphonex': 'iphoneX', '小明同学': '小茗同学', '萧华': '萧华',
                        '张旗康': '张旗康', '一哄而伞': '一哄而散', 'happt': 'happen', 'shylock': 'shylock', '份额': '份额', '天俺门': '天安门'}
    m = ConfusionCorrector(custom_confusion_path_or_dict=custom_confusion)
    for i in error_sentences:
        print(i, ' -> ', m.confusion_correct(i))
买iphonex,要多少钱  ->  [['钱', 12, 13, 'char']] ('买iphonex,要多少钱', [])
哪里卖苹果吧?请大叔给我让坐  ->  [] ('哪里卖苹果吧?请大叔给我让坐', [])
共同实际控制人萧华、霍荣铨、张旗康  ->  [['霍荣铨', 10, 13, 'word'], ['张旗康', 14, 17, 'word']] ('共同实际控制人萧华、霍荣铨、张启康', [('张旗康', '张启康', 14, 17)])
上述承诺内容系本人真实意思表示  ->  [['系', 6, 7, 'char']] ('上述承诺内容系本人真实意思表示', [])
大家一哄而伞怎么回事  ->  [['一哄', 2, 4, 'word'], ['伞', 5, 6, 'char']] ('大家一哄而散怎么回事', [('伞', '散', 5, 6)])
******************************************
买iphonex,要多少钱  ->  ('买iphoneX,要多少钱', [['iphonex', 'iphoneX', 1, 8]])
哪里卖苹果吧?请大叔给我让坐  ->  ('哪里卖苹果八?请大叔给我让坐', [['苹果吧', '苹果八', 3, 6]])
共同实际控制人萧华、霍荣铨、张旗康  ->  ('共同实际控制人萧华、霍荣铨、张旗康', [['萧华', '萧华', 7, 9], ['张旗康', '张旗康', 14, 17]])
上述承诺内容系本人真实意思表示  ->  ('上述承诺内容系本人真实意思表示', [])
大家一哄而伞怎么回事  ->  ('大家一哄而散怎么回事', [['一哄而伞', '一哄而散', 2, 6]])

Custom language model

pycorrector provides code for loading a custom language model as follows:

# 自定义模型路径
lm_path = './custom.klm'
model = Corrector(language_model_path=lm_path)

English spelling correction

pycorrector also provides English spelling correction, and the effect is very average.

sent = "what happending? how to speling it, can you gorrect it?"
corrected_text, details = pycorrector.en_correct(sent)
print(sent, '=>', corrected_text)
print(details)
what happending? how to speling it, can you gorrect it? => what happening? how to spelling it, can you correct it?
[('happending', 'happening', 5, 15), ('speling', 'spelling', 24, 31), ('gorrect', 'correct', 44, 51)]

pycorrect also supports customized word frequency dictionary settings to prevent false corrections. As shown below, shylock is corrected to shock. You can set the occurrence frequency of shylock to be higher than shock to avoid error correction.

from pycorrector.en_spell import EnSpell

# # 定义一个字符串变量
sent = "what is your name? shylock?"  
# 创建一个EnSpell类的实例对象
spell = EnSpell()  
corrected_text, details = spell.correct(sent) 
# shylock被纠错为shock
print(sent, '=>', corrected_text, details) 
print('-' * 42) 

# 定义一个包含词频信息的字典
# 设置shylock出现频次比shock高
my_dict = {
    
    'your': 120, 'name': 2, 'is': 1, 'shock': 2, 'shylock': 1, 'what': 1} 
# 创建一个EnSpell类的实例对象,并传入自定义词频字典
spell = EnSpell(word_freq_dict=my_dict)  
corrected_text, details = spell.correct(sent)  
print(sent, '=>', corrected_text, details)  
what is your name? shylock? => what is your name? shock? [('shylock', 'shock', 19, 26)]
------------------------------------------
what is your name? shylock? => what is your name? shylock? []

Simplified and Traditional Chinese interchange

pycorrector supports the interchange of simplified and traditional Chinese, as shown below:

import pycorrector

traditional_sentence = '學而時習之,不亦說乎'
simplified_sentence = pycorrector.traditional2simplified(traditional_sentence)
print(traditional_sentence, '=>', simplified_sentence)

simplified_sentence = '学而时习之,不亦说乎'
traditional_sentence = pycorrector.simplified2traditional(simplified_sentence)
print(simplified_sentence, '=>', traditional_sentence)
學而時習之,不亦說乎 => 学而时习之,不亦说乎
学而时习之,不亦说乎 => 學而時習之,不亦說乎

2.2 Chinese text error correction based on deep learning

pycorrector provides multiple Chinese text error correction models based on deep learning. Generally speaking, using deep learning for Chinese text error correction can achieve better results than rule-based error correction. pycorrector evaluated various deep learning models under the SIGHAN2015 data set, which is a classic public data set for Chinese text error correction tasks, and came to the following conclusions:

  • The best Chinese spelling correction model is MacBert-CSC , the model name is shibing624/macbert4csc-base-chinese , huggingface model: shibing624/macbert4csc-base-chinese
  • The best Chinese grammar error correction model is BART-CSC , the model name is shibing624/bart4csc-base-chinese , huggingface model: shibing624/bart4csc-base-chinese
  • The most potential model is Mengzi-T5-CSC , the model name is shibing624/mengzi-t5-base-chinese-correction , huggingface model: shibing624/mengzi-t5-base-chinese-correction , the model structure is not changed, only fine- The tune Chinese error correction data set is already SIGHAN 2015achieving results close to SOTA.
  • The error correction fine-tuning model based on ChatGLM-6B is also very effective. The model name is shibing624/chatglm-6b-csc-zh-lora , huggingface model: shibing624/chatglm-6b-csc-zh-lora . The large model can not only correct errors but also Can polish sentences, but the model is too large and the inference speed is slow

The code for calling the MacBert-CSC model in pycorrector for text error correction is as follows. The code will automatically load the error correction model provided by macbert4csc-base-chinese .

from pycorrector.macbert.macbert_corrector import MacBertCorrector
from pycorrector import ConfusionCorrector

if __name__ == '__main__':
    error_sentences = [
        '少先队员因该为老人让坐',
        '机七学习是人工智能领遇最能体现智能的一个分知',
    ]

    m = MacBertCorrector()
    # add confusion corrector for postprocess
    confusion_dict = {
    
    "喝小明同学": "喝小茗同学", "老人让坐": "老人让座", "平净": "平静", "分知": "分支"}
    cm = ConfusionCorrector(custom_confusion_path_or_dict=confusion_dict)
    for line in error_sentences:
        correct_sent, err = m.macbert_correct(line)
        print("query:{} => {} err:{}".format(line, correct_sent, err))
        correct_sent, err = cm.confusion_correct(correct_sent)
        if err:
            print("added confusion: {} err: {}".format(correct_sent, err))

In addition, pycorrector also recommends using PaddleNLP for text error correction. PaddleNLP provides multiple industrial-level NLP preset models covering text error correction. For information on the installation and use of PaddleNLP, see its official repository: PaddleNLP .

from paddlenlp import Taskflow
corrector = Taskflow("text_correction")

# 单条输入
corrector('遇到逆竟时,我们必须勇于面对,而且要愈挫愈勇。')
[{'source': '遇到逆竟时,我们必须勇于面对,而且要愈挫愈勇。',
  'target': '遇到逆境时,我们必须勇于面对,而且要愈挫愈勇。',
  'errors': [{'position': 3, 'correction': {'竟': '境'}}]}]
# 批量预测
corrector(['遇到逆竟时,我们必须勇于面对,而且要愈挫愈勇。', '人生就是如此,经过磨练才能让自己更加拙壮,才能使自己更加乐观。'])
[{'source': '遇到逆竟时,我们必须勇于面对,而且要愈挫愈勇。',
  'target': '遇到逆境时,我们必须勇于面对,而且要愈挫愈勇。',
  'errors': [{'position': 3, 'correction': {'竟': '境'}}]},
 {'source': '人生就是如此,经过磨练才能让自己更加拙壮,才能使自己更加乐观。',
  'target': '人生就是如此,经过磨炼才能让自己更加茁壮,才能使自己更加乐观。',
  'errors': [{'position': 10, 'correction': {'练': '炼'}},
   {'position': 18, 'correction': {'拙': '茁'}}]}]

In fact, whether it is a rule-based Chinese text error correction algorithm or other text error correction algorithms based on deep learning models, their effects are not as good as those of large deep learning models (such as ChatGPT). Even the error correction model provided by PaddleNLP has quite high accuracy in Chinese natural language processing tasks, but when faced with some simple error correction cases, the effect may not be good. As shown below, "the flow never ceases" is incorrectly corrected as "the flow never ceases". Therefore, in actual application, the corresponding model should be customized according to specific scenarios. If it is an industrial application, and the computing power requirements are not high, you should choose open source large language models as much as possible.

corrector('人群穿流不息,少先队员因该为老人让坐')
[{'source': '人群穿流不息,少先队员因该为老人让坐',
  'target': '人群传流不息,少先队员应该为老人让坐',
  'errors': [{'position': 2, 'correction': {'穿': '传'}},
   {'position': 11, 'correction': {'因': '应'}}]}]

3 Reference

Guess you like

Origin blog.csdn.net/LuohenYJ/article/details/133235908