Software Engineering Application and Practice (12): Example-Based Paddle OCR Code Analysis

2021SC@SDUSC

Table of contents

1. Review of the past

1.1 PP-OCR text recognition strategy

1.2 This article introduces the strategy

2. PP-OCR identification process

2.1 Dataset Collection

2.2 Model configuration

2.3 Training preparation

Define label input and hyperparameters

 Instantiate the model and configure the optimization strategy

2.4 Training

2.5 Pre-prediction preparation

Define the Prediction Reader as you defined the Training Reader

forecast data

 2.6 Forecast

Forecast usage code

result

Summarize


1. Review of the past

1.1 PP-OCR text recognition strategy

       The selection of strategies is mainly used to enhance the model capability and reduce the model size. The following are the nine strategies employed by the PP-OCR text recognizer:

  • Light backbone, choose to use MobileNetV3 large x0.5 to balance accuracy and efficiency;
  • Data augmentation, BDA (Base Dataaugmented) and TIA (Luo et al. 2020);
  • Cosine learning rate attenuation effectively improves the text recognition ability of the model;
  • Discrimination and analysis of feature maps, adapting to multi-language recognition, and modifying the stride of downsampled feature maps;
  • Regularization parameters, weight decay to avoid overfitting;
  • Learning rate warm-up is also effective;
  • Light head, using a fully connected layer to encode sequence features into predicted characters, reducing the size of the model;
  • The pre-trained model is trained on a large data set such as ImageNet, which can achieve faster convergence and better accuracy;
  • PACT quantization, skipping the LSTM layer;

1.2 This article introduces the strategy

       In the previous article, we re-introduced the CRNN-CTC model of the Paddle-OCR text recognizer according to the model implementation process.

      Based on the previous selection of the overall model and the specific introduction of various strategy algorithm code implementations, this article will combine the actual text recognition scene to give an overall understanding of the Paddle-OCR text recognition process, and analyze the key codes used in the process.

2. PP-OCR identification process

2.1 Dataset Collection

__init__Combine the sum method in ppocr __getitem__to realize the reading of custom data set.

code location

Detailed code implementation and analysis:

import os

import PIL.Image as Image
import numpy as np
from paddle.io import Dataset

# 图片信息配置 - 通道数、高度、宽度
IMAGE_SHAPE_C = 3
IMAGE_SHAPE_H = 30
IMAGE_SHAPE_W = 70
# 数据集图片中标签长度最大值设置 - 因图片中均为4个字符,故该处填写为4即可
LABEL_MAX_LEN = 4


class Reader(Dataset):
    def __init__(self, data_path: str, is_val: bool = False):
        """
        数据读取Reader
        :param data_path: Dataset路径
        :param is_val: 是否为验证集
        """
        super().__init__()
        self.data_path = data_path
        # 读取Label字典
        with open(os.path.join(self.data_path, "label_dict.txt"), "r", encoding="utf-8") as f:
            self.info = eval(f.read())
        # 获取文件名列表
        self.img_paths = [img_name for img_name in self.info]
        # 将数据集后1024张图片设置为验证集,当is_val为真时img_path切换为后1024张
        self.img_paths = self.img_paths[-1024:] if is_val else self.img_paths[:-1024]

    def __getitem__(self, index):
        # 获取第index个文件的文件名以及其所在路径
        file_name = self.img_paths[index]
        file_path = os.path.join(self.data_path, file_name)
        # 捕获异常 - 在发生异常时终止训练
        try:
            # 使用Pillow来读取图像数据
            img = Image.open(file_path)
            # 转为Numpy的array格式并整体除以255进行归一化
            img = np.array(img, dtype="float32").reshape((IMAGE_SHAPE_C, IMAGE_SHAPE_H, IMAGE_SHAPE_W)) / 255
        except Exception as e:
            raise Exception(file_name + "\t文件打开失败,请检查路径是否准确以及图像文件完整性,报错信息如下:\n" + str(e))
        # 读取该图像文件对应的Label字符串,并进行处理
        label = self.info[file_name]
        label = list(label)
        # 将label转化为Numpy的array格式
        label = np.array(label, dtype="int32")

        return img, label

    def __len__(self):
        # 返回每个Epoch中图片数量
        return len(self.img_paths)

 The data read by the instance

2.2 Model configuration

       The model uses the simple CRNN-CTC structure introduced in our previous article .

       The image whose input shape is CHW passes through CNN->Flatten->Linear->RNN->Linear and outputs the probability of characters corresponding to each position in the image. Considering that the CTC decoder will not be able to align correctly when faced with a different number of elements in the image and repeated adjacent elements, an additional category is added to represent the "separator" for improvement.

Code location:

 

Specific code implementation and analysis: 

import paddle

# 分类数量设置 - 因数据集中共包含0~9共10种数字+分隔符,所以是11分类任务
CLASSIFY_NUM = 11

# 定义输入层,shape中第0维使用-1则可以在预测时自由调节batch size
input_define = paddle.static.InputSpec(shape=[-1, IMAGE_SHAPE_C, IMAGE_SHAPE_H, IMAGE_SHAPE_W],
                                   dtype="float32",
                                   name="img")

# 定义网络结构
class Net(paddle.nn.Layer):
    def __init__(self, is_infer: bool = False):
        super().__init__()
        self.is_infer = is_infer

        # 定义一层3x3卷积+BatchNorm
        self.conv1 = paddle.nn.Conv2D(in_channels=IMAGE_SHAPE_C,
                                  out_channels=32,
                                  kernel_size=3)
        self.bn1 = paddle.nn.BatchNorm2D(32)
        # 定义一层步长为2的3x3卷积进行下采样+BatchNorm
        self.conv2 = paddle.nn.Conv2D(in_channels=32,
                                  out_channels=64,
                                  kernel_size=3,
                                  stride=2)
        self.bn2 = paddle.nn.BatchNorm2D(64)
        # 定义一层1x1卷积压缩通道数,输出通道数设置为比LABEL_MAX_LEN稍大的定值可获取更优效果,当然也可设置为LABEL_MAX_LEN
        self.conv3 = paddle.nn.Conv2D(in_channels=64,
                                  out_channels=LABEL_MAX_LEN + 4,
                                  kernel_size=1)
        # 定义全连接层,压缩并提取特征(可选)
        self.linear = paddle.nn.Linear(in_features=429,
                                   out_features=128)
        # 定义RNN层来更好提取序列特征,此处为双向LSTM输出为2 x hidden_size,可尝试换成GRU等RNN结构
        self.lstm = paddle.nn.LSTM(input_size=128,
                               hidden_size=64,
                               direction="bidirectional")
        # 定义输出层,输出大小为分类数
        self.linear2 = paddle.nn.Linear(in_features=64 * 2,
                                    out_features=CLASSIFY_NUM)

    def forward(self, ipt):
        # 卷积 + ReLU + BN
        x = self.conv1(ipt)
        x = paddle.nn.functional.relu(x)
        x = self.bn1(x)
        # 卷积 + ReLU + BN
        x = self.conv2(x)
        x = paddle.nn.functional.relu(x)
        x = self.bn2(x)
        # 卷积 + ReLU
        x = self.conv3(x)
        x = paddle.nn.functional.relu(x)
        # 将3维特征转换为2维特征 - 此处可以使用reshape代替
        x = paddle.tensor.flatten(x, 2)
        # 全连接 + ReLU
        x = self.linear(x)
        x = paddle.nn.functional.relu(x)
        # 双向LSTM - [0]代表取双向结果,[1][0]代表forward结果,[1][1]代表backward结果,详细说明可在官方文档中搜索'LSTM'
        x = self.lstm(x)[0]
        # 输出层 - Shape = (Batch Size, Max label len, Signal) 
        x = self.linear2(x)

        # 在计算损失时ctc-loss会自动进行softmax,所以在预测模式中需额外做softmax获取标签概率
        if self.is_infer:
            # 输出层 - Shape = (Batch Size, Max label len, Prob) 
            x = paddle.nn.functional.softmax(x)
            # 转换为标签
            x = paddle.argmax(x, axis=-1)
        return x

2.3 Training preparation

Define label input and hyperparameters

Code location:

Code and analysis when using:

# 数据集路径设置
DATA_PATH = "./data/OCR_Dataset"
# 训练轮数
EPOCH = 10
# 每批次数据大小
BATCH_SIZE = 16

label_define = paddle.static.InputSpec(shape=[-1, LABEL_MAX_LEN],
                                    dtype="int32",
                                    name="label")

 Define CTC Loss

We have introduced the loss function in detail in the previous article.

Code location:

Code implementation and analysis:

class CTCLoss(paddle.nn.Layer):
    def __init__(self):
        """
        定义CTCLoss
        """
        super().__init__()

    def forward(self, ipt, label):
        input_lengths = paddle.full(shape=[BATCH_SIZE],fill_value=LABEL_MAX_LEN + 4,dtype= "int64")
        label_lengths = paddle.full(shape=[BATCH_SIZE],fill_value=LABEL_MAX_LEN,dtype= "int64")
        # 按文档要求进行转换dim顺序
        ipt = paddle.tensor.transpose(ipt, [1, 0, 2])
        # 计算loss
        loss = paddle.nn.functional.ctc_loss(ipt, label, input_lengths, label_lengths, blank=10)
        return loss

 Instantiate the model and configure the optimization strategy

We have also introduced the optimizer of the PP-OCR text recognizer one by one before.

Code location:

Code implementation and analysis:

# 定义优化器
optimizer = paddle.optimizer.Adam(learning_rate=0.0001, parameters=model.parameters())

# 为模型配置运行环境并设置该优化策略
model.prepare(optimizer=optimizer,
                loss=CTCLoss())

 

2.4 Training

The specific code of the training part:

# 执行训练
model.fit(train_data=Reader(DATA_PATH),
            eval_data=Reader(DATA_PATH, is_val=True),
            batch_size=BATCH_SIZE,
            epochs=EPOCH,
            save_dir="output/",
            save_freq=1,
            verbose=1,
            drop_last=True)

2.5 Pre-prediction preparation

Define the Prediction Reader as you defined the Training Reader

# 与训练近似,但不包含Label
class InferReader(Dataset):
    def __init__(self, dir_path=None, img_path=None):
        """
        数据读取Reader(预测)
        :param dir_path: 预测对应文件夹(二选一)
        :param img_path: 预测单张图片(二选一)
        """
        super().__init__()
        if dir_path:
            # 获取文件夹中所有图片路径
            self.img_names = [i for i in os.listdir(dir_path) if os.path.splitext(i)[1] == ".jpg"]
            self.img_paths = [os.path.join(dir_path, i) for i in self.img_names]
        elif img_path:
            self.img_names = [os.path.split(img_path)[1]]
            self.img_paths = [img_path]
        else:
            raise Exception("请指定需要预测的文件夹或对应图片路径")

    def get_names(self):
        """
        获取预测文件名顺序 
        """
        return self.img_names

    def __getitem__(self, index):
        # 获取图像路径
        file_path = self.img_paths[index]
        # 使用Pillow来读取图像数据并转成Numpy格式
        img = Image.open(file_path)
        img = np.array(img, dtype="float32").reshape((IMAGE_SHAPE_C, IMAGE_SHAPE_H, IMAGE_SHAPE_W)) / 255
        return img

    def __len__(self):
        return len(self.img_paths)

forecast data

png

 2.6 Forecast

Forecast usage code

# 编写简易版解码器
def ctc_decode(text, blank=10):
    """
    简易CTC解码器
    :param text: 待解码数据
    :param blank: 分隔符索引值
    :return: 解码后数据
    """
    result = []
    cache_idx = -1
    for char in text:
        if char != blank and char != cache_idx:
            result.append(char)
        cache_idx = char
    return result


# 实例化推理模型
model = paddle.Model(Net(is_infer=True), inputs=input_define)
# 加载训练好的参数模型
model.load(CHECKPOINT_PATH)
# 设置运行环境
model.prepare()

# 加载预测Reader
infer_reader = InferReader(INFER_DATA_PATH)
img_names = infer_reader.get_names()
results = model.predict(infer_reader, batch_size=BATCH_SIZE)
index = 0
for text_batch in results[0]:
    for prob in text_batch:
        out = ctc_decode(prob, blank=10)
        print(f"文件名:{img_names[index]},推理结果为:{out}")
        index += 1

result

consistent with the tested data.


 

Summarize

        This article combines the actual text recognition scene to give an overall understanding of the Paddle-OCR text recognition process, and analyzes the key codes used in the process.

Guess you like

Origin blog.csdn.net/pinkray_c/article/details/122004881