Bert Independent Semantic Error Correction Practice Based on OCR

Abstract: In this case, we use the text detection and recognition model in video subtitle recognition, and add pre-trained Bert for error correction

This article is shared from Huawei Cloud Community " Bert's Special Tuning OCR ", author: Du Fu builds a house.

The original intention of doing this project is to find that when the picture is blurry/the detection frame is relatively long, OCR will have some misrecognition, so I want to correct the recognition result. A very natural idea is to use semantic information for error correction. In fact, there is a lot of work to add semantic information during OCR training. Interested friends can find out. In order to reuse existing projects to a greater extent, we decided to keep the existing ones. The OCR unit is then added with an independent semantic error correction module for error correction.

In this case, we use the text detection and recognition model in video subtitle recognition, and add pre-trained Bert for error correction. The final effect is as follows:

We use ModelBox Windows SDKit for development. If you haven’t installed it yet SDK, you can refer to ModelBox Device-Cloud Collaborative AI Development Kit (Windows) Device Registration and ModelBox Device-Cloud Collaborative AI Development Kit (Windows) SDK Installation to complete device registration and SDKinstallation.

skill development

The corresponding ModelBoxversion of this application has been made into a template and placed in HUAWEI CLOUD OBS, which can be solution.batdownloaded with the tools in SDK. Next, we give ModelBoxthe complete development process of this application in HUAWEI CLOUD:

1) Download the template

Execute .\solution.bat -lto see the currently public skill templates:

███@DESKTOP-UUVFMTP MINGW64 /d/DEMO/modelbox-win10-x64-1.5.1
$ ./solution.bat -l
start download desc.json
 3942.12KB/S, percent=100.00%
Solutions name:
mask_det_yolo3
...
doc_ocr_db_crnn_bert

The doc_ocr_db_crnn_bert in the result is the document recognition application template, download the template:

███@DESKTOP-UUVFMTP MINGW64 /d/DEMO/modelbox-win10-x64-1.5.1
$ ./solution.bat -s doc_ocr_db_crnn_bert
...

solution.batAmong the parameters of the tool, -l representative listmeans to list the names of the currently existing templates; -s representative solution-namemeans to download the template with the corresponding name. The downloaded template resources will be stored in the directory ModelBoxof the core library solution.

2) Create a project

Use to create a project in ModelBox the sdk directorycreate.batdoc_ocr

███@DESKTOP-UUVFMTP MINGW64 /d/DEMO/modelbox-win10-x64-1.5.1
$ ./create.bat -t server -n doc_ocr -s doc_ocr_db_crnn_bert
sdk version is modelbox-win10-x64-1.5.1
success: create doc_ocr in D:\DEMO\modelbox-win10-x64-1.5.1\workspace

create.batAmong the parameters of the tool, -t it indicates the category of the created transaction, including project ( server ), Python functional unit ( Python ), reasoning functional unit ( infer ), etc.; -n representative name, that is, the name of the created transaction; -s representative solution-name, indicating that the following parameter value will be used to represent Templates create projects instead of creating empty projects.

workspaceA project will be created under the directory doc_ocr, and the contents of the project are as follows:

doc_ocr
|--bin|--main.bat:应用执行入口|--mock_task.toml:应用在本地执行时的输入输出配置,此应用为http服务
|--CMake:存放一些自定义CMake函数
|--data:存放应用运行所需要的图片、视频、文本、配置等数据|--char_meta.txt:字形拆解文件,用来计算字形相似度|--character_keys.txt:OCR算法的字符集合|--GB2312.ttf:中文字体文件|--test_http.py:应用测试脚本|--text.jpg:应用测试图片|--vocab.txt:tokenizer配置文件
|--dependence|--modelbox_requirements.txt:应用运行依赖的外部库在此文件定义,本应用依赖pyclipper、Shapely、pillow等工具包
|--etc|--flowunit:应用所需的功能单元存放在此目录
│  │  |--cpp:存放C++功能单元编译后的动态链接库,此应用没有C++功能单元
│  │  |--bert_preprocess:bert预处理功能单元,条件功能单元,判断是否需要纠错
│  │  |--collapse_position:归拢单句纠错结果
│  │  |--collapse_sentence:归拢全文纠错结果
│  │  |--det_post:文字检测后处理功能单元
│  │  |--draw_ocr:ocr结果绘制功能单元
│  │  |--expand_img:展开功能单元,展开文字检测结果
│  │  |--expand_position:展开功能单元,展开bert预处理结果
│  │  |--match_position:匹配纠错结果
│  │  |--ocr_post:ocr后处理功能单元
|--flowunit_cpp:存放C++功能单元的源代码,此应用没有C++功能单元
|--graph:存放流程图|--doc_ocr.toml:默认流程图,http服务|--modelbox.conf:modelbox相关配置
|--hilens_data_dir:存放应用输出的结果文件、日志、性能统计信息
|--model:推理功能单元目录|--bert:Bert推理功能能单元
│  │  |--bert.toml:语义推理配置文件
│  │  |--bert.onnx:语义推理模型|--det:文字检测推理功能单元
│  │  |--det.toml:文字检测推理功能单元的配置文件
│  │  |--det.onnx:文字检测onnx模型|--ocr:文字识别推理功能单元
│  │  |--ocr.toml:文字识别推理功能单元的配置文件
│  │  |--ocr.onnx:文字识别onnx模型
|--build_project.sh:应用构建脚本
|--CMakeLists.txt
|--rpm:打包rpm时生成的目录,将存放rpm包所需数据
|--rpm_copyothers.sh:rpm打包时的辅助脚本

3) View the flowchart

doc_ocrThe flowchart is stored in the project graphdirectory. The default flowchart doc_ocr.tomlhas the same name as the project, and the flowchart is visualized:

In the diagram, the gray part is the preset functional unit, and the rest of the colors are the functional units we implemented. Among them, green is the general functional unit, red is the reasoning functional unit, blue is the conditional functional unit, and yellow is the unfolding and folding functional unit. After the HTTP received image is decoded, it is preprocessed, followed by text detection, and the model is post-processed to obtain a detection frame. After conditional function judgment, the image with text detected is sent to the expansion function unit, and the image is cut for text recognition. The text recognition result is sent to the bert pre-processing unit. The processing unit judges whether error correction is needed, and if error correction is required, it will start parallel semantic reasoning, and if error correction is not required, it will directly draw the result and return it. Frames with no text detected are returned directly.

4) Core logic

The text detection and recognition in the core logic of this application can refer to the relevant introduction in [ModelBox OCR Practical Camp] video subtitle recognition . This article focuses on the text error correction part.

First look at the error correction preprocessing functional unit bert_preprocess:

def process(self, data_context):
    in_feat = data_context.input("in_feat")
    out_feat = data_context.output("out_feat")
    out_bert = data_context.output("out_bert")

    for buffer_feat in in_feat:
        ocr_data = json.loads(buffer_feat.as_object())['ocr_result']
        score_data = json.loads(json.loads(buffer_feat.as_object())['result_score'])

        text_to_process = []
        text_to_pass = []
        err_positions = []
        for i, (sent, p) in enumerate(zip(ocr_data, score_data)):
            if not do_correct_filter(sent, self.max_seq_length):
                text_to_pass.append((i, sent))
            else:
                err_pos = find_err_pos_by_prob(p, self.prob_threshold)
                if not err_pos:
                	text_to_pass.append((i, sent))
                else:
                    text_to_process.append(sent)
                    err_positions.append(err_pos)
        if not text_to_process:
        	out_feat.push_back(buffer_feat)
        else:
            out_dict = []
            texts_numfree = [self.number.sub(lambda m: self.rep[re.escape(m.group(0))], s) for s in text_to_process]
            err_positions = check_error_positions(texts_numfree, err_positions)
            if err_positions is None:
            	err_positions = [range(len(d)) for d in texts_numfree]
            batch_data = BatchData(texts_numfree, err_positions, self.tokenizer, self.max_seq_length)
            input_ids, input_mask, segment_ids, masked_lm_positions = batch_data.data
            ...
    return modelbox.Status.StatusCode.STATUS_SUCCESS

do_correct_filterThe preprocessing unit judges the OCR result through the function, and only corrects the Chinese characters larger than 3 characters:

def do_correct_filter(text, max_seq_length):
    if re.search(re.compile(r'[a-zA-ZA-Za-z]'), text):
        return False
    if len(re.findall(re.compile(r'[\u4E00-\u9FA5]'), text)) < 3:
        return False
    if len(text) > max_seq_length - 2:
        return False
    return True

Use find_err_pos_by_probthe function to locate the characters that need error correction, and only correct the characters whose OCR confidence is less than the threshold:

def find_err_pos_by_prob(prob, prob_threshold):
    if not prob:
        return []
    err_pos = [i for i, p in enumerate(prob) if p < prob_threshold]
    return err_pos

If there are characters that need to be corrected, the sentence is encoded for semantic reasoning.

After semantic reasoning, by collapse_positiondecoding the reasoning result, the function match_positionis used in the functional unit shape_similarityto calculate the character similarity between the semantic reasoning result and the OCR result:

def shape_similarity(self, char1, char2):
    decomp1 = self.decompose_text(char1)
    decomp2 = self.decompose_text(char2)
    similarity = 0.0
    ed = edit_distance(safe_encode_string(decomp1), safe_encode_string(decomp2))
    normalized_ed = ed / max(len(decomp1), len(decomp2), 1)
    similarity = max(similarity, 1 - normalized_ed)
    return similarity

Among them, the decompose_text function encodes a single Chinese character into a stroke-level IDS, such as:

Hua: ⿱⿰⿰丿丨⿻乚丿⿻一丨


           +----+
           ||
           +----+
        化         十
      +----+     +----+
      ||    ||
      +----+     +----+
   亻       七  一      丨
 +----+    +----+
 ||   ||
 +----+    +----+
丿     丨  乚      丿

After calculating the similarity between the semantic reasoning result character and the original OCR result character, the comprehensive semantic reasoning confidence and similarity are used to judge whether to receive the error correction result:

def accept_correct(self, confidence, similarity):
    if confidence + similarity >= self.all_conf \
    	and confidence  >= self.confidence_conf \
        	and similarity >= self.similarity_conf:
        return True
    return False

5) Three-party dependent library

This application relies on toolkits such as pyclipper, Shapely, and pillow. The ModelBox application does not need to manually install the three-party dependency library. It only needs to be configured dependence\modelbox_requirements.txt, and the application will be automatically installed when compiling.

skill run

Execute and run the application in the project directory .\bin\main.bat. In order to facilitate the observation of error correction results, we switch the log to info:

███@DESKTOP-UUVFMTP MINGW64 /d/DEMO/modelbox-win10-x64-1.5.1/workspace/doc_ocr
$ ./bin/main.bat default info
...
[2022-12-27 15:20:40,043][ INFO][httpserver_sync_receive.cc:188 ] Start server at http://0.0.0.0:8083/v1/ocr_bert

Start another terminal, enter the project datadirectory, and run test_http.pythe script to test:

███@DESKTOP-UUVFMTP MINGW64 /d/DEMO/modelbox-win10-x64-1.5.1/workspace/doc_ocr/data
$ python test_http.py

Accepted error correction results can be observed in the skill run log:

[2022-12-27 15:22:40,700][ INFO][match_position\match_position.py:51  ] confidence: 0.99831665, similarity: 0.6470588235294117, 柜 ->

At the same time, datayou can see the result pictures returned by the application in the directory:

 

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/8632062