Interpreting the Unknown: Breakthroughs and Practical Applications of Text Recognition Algorithms

Interpreting the Unknown: Breakthroughs and Practical Applications of Text Recognition Algorithms

1. Text recognition algorithm theory

  • background introduction

Text recognition is a subtask of OCR (Optical Character Recognition), and its task is to recognize the text content of a fixed area. In the two-stage method of OCR, it is followed by text detection to convert image information into text information.

Specifically, the model inputs a well-located text line, and the model predicts the text content and confidence in the picture. The visualized results are shown in the following figure:

There are many application scenarios for text recognition, including document recognition, road sign recognition, license plate recognition, industrial number recognition, etc. According to actual scenarios, text recognition tasks can be divided into two categories: regular text recognition and irregular text recognition .

  • Regular text recognition: mainly refers to printed fonts, scanned text, etc., and considers that the text is roughly in the horizontal position

  • Irregular text recognition: It often appears in natural scenes, and due to the huge difference in text curvature, direction, deformation, etc., text is often not in a horizontal position, and there are problems such as bending, occlusion, and blurring.

The figure below shows the data styles of IC15 and IC13, which represent irregular text and regular text respectively. It can be seen that irregular text often has problems such as distortion, blurring, and large font differences. It is closer to the real scene and has greater challenges.

Therefore, all major algorithms are currently trying to obtain higher indicators on irregular data sets.


Sample IC15 image (irregular text)

Sample IC13 image (rule text)

When different recognition algorithms compare their capabilities, they are often compared on these two types of public data sets. Comparing the effects on multiple dimensions, the currently more common English evaluation sets are classified as follows:

1.1 Classification of Text Recognition Algorithms

In traditional text recognition methods, the task is divided into 3 steps, namely image preprocessing, character segmentation and character recognition. Specific scenarios need to be modeled and will fail once the scenario changes. In the face of complex text backgrounds and scene changes, methods based on deep learning have better performance.

Most existing recognition algorithms can be expressed in the following unified framework, and the algorithm process is divided into four stages:

We have sorted out the mainstream algorithm categories and main papers, refer to the following table:

Algorithm category main idea main paper
traditional algorithm Sliding window, character extraction, dynamic programming -
ctc Based on the method of ctc, the sequence is not aligned, and the identification is faster CRNN, Rosetta
Attention Attention-based method, applied to unconventional text RARE, DAN, PREN
Transformer Transformer-based method SRN, NRTR, Master, ABINet
Correction The rectification module learns text boundaries and rectifies them horizontally RARE, ASTER, SAR
Split Based on the segmentation method, the character position is extracted and then classified Text Scanner, Mask TextSpotter

1.1.1 Rule Text Recognition

There are two mainstream algorithms for text recognition, the algorithm based on CTC (Conectionist Temporal Classification) and the Sequence2Sequence algorithm. The difference is mainly in the decoding stage.

The algorithm based on CTC is to connect the sequence generated by the encoding to CTC for decoding; the method based on Sequence2Sequence is to connect the sequence to the Recurrent Neural Network (Recurrent Neural Network, RNN) module for cyclic decoding. Both methods are verified to be effective and mainstream. Two major approaches.


Left: CTC-based method, Right: Sequece2Sequence-based method

CTC-based algorithm

The most typical algorithm based on CTC is CRNN (Convolutional Recurrent Neural Network) [1], and its feature extraction part uses the mainstream convolution structure, commonly used are ResNet, MobileNet, VGG, etc. Due to the particularity of the text recognition task, there is a large amount of contextual information in the input data. The convolution kernel feature of the convolutional neural network makes it more focused on local information and lacks the modeling ability of long dependencies. Therefore, it is difficult to use only the convolutional network. Mining contextual connections between texts. In order to solve this problem, the CRNN text recognition algorithm introduces a bidirectional LSTM (Long Short-Term Memory) to enhance context modeling. Experiments prove that the bidirectional LSTM module can effectively extract context information in pictures. Finally, the output feature sequence is input to the CTC module to directly decode the sequence result. The structure is verified to be effective and widely used in text recognition tasks. Rosetta [2] is a recognition network proposed by FaceBook, which consists of a full convolution model and CTC. Gao Y[3] and others used CNN convolution to replace LSTM, with fewer parameters and equal performance improvement accuracy.

CRNN structure diagram

Sequence2Sequence algorithm

The Sequence2Sequence algorithm encodes all input sequences into a unified semantic vector by the encoder Encoder, and then decodes it by the decoder Decoder. During the decoding process of the Decoder, the output of the previous moment is continuously used as the input of the next moment, and the decoding is cyclic until the stop symbol is output. The general encoder is a RNN. For each input word, the encoder outputs a vector and hidden state, and uses the hidden state for the next input word, and loops to get the semantic vector; the decoder is another RNN, which receives the encoder Output a vector and output a sequence of words to create a transformation. Inspired by Sequence2Sequence in the field of translation, Shi [4] proposed an attention-based encoder-decoder framework to recognize text, in this way, RNN is able to learn character-level language models hidden in strings from training data.

Sequence2Sequence structure diagram

The above two algorithms have very good results on regular text, but due to the limitations of network design, such methods are difficult to solve the irregular text recognition tasks of bending and rotation. In order to solve such problems, some algorithm researchers have proposed a series of improved algorithms on the basis of the above two types of algorithms.

1.1.2 Irregular Text Recognition

  • Irregular text recognition algorithms can be divided into four categories: correction-based methods; Attention-based methods; segmentation-based methods; Transformer-based methods.

Calibration-based approach

Correction-based methods use some visual transformation modules to convert irregular text into regular text as much as possible, and then use conventional methods for recognition.

The RARE[4] model first proposed a correction scheme for irregular text. The entire network is divided into two main parts: a spatial transformation network STN (Spatial Transformer Network) and a recognition network based on Sequence2Squence. Among them, STN is the correction module. The irregular text image enters the STN and is transformed into a horizontal image through TPS (Thin-Plate-Spline). This transformation can correct the curved and transmissive transformed text to a certain extent, and send it to the sequence recognition after correction. The network decodes.

RARE structure diagram

The RARE paper pointed out that this method has great advantages in irregular text data sets. In particular, comparing the two data sets of CUTE80 and SVTP, it is more than 5 percentage points higher than CRNN, which proves the effectiveness of the correction module. Based on this [6] also combines the text recognition system of spatial transformation network (STN) and attention-based sequence recognition network.

Correction-based methods have better mobility. In addition to Attention-based methods such as RARE, STAR-Net [5] applies the correction module to CTC-based algorithms, which is also a good improvement compared to traditional CRNN.

Attention-based method

The Attention-based method mainly focuses on the correlation between the various parts of the sequence. This method was first proposed in the field of machine translation. It is believed that in the process of text translation, the result of the current word is mainly affected by certain words. Deterministic words are given greater weight. The same is true in the field of text recognition. When decoding the encoded sequence, each step selects the appropriate context to generate the next state, which is conducive to obtaining more accurate results.

R^2AM [7] introduced Attention into the field of text recognition for the first time. The model first extracts the encoded image features through the recursive convolutional layer of the input image, and then uses the implicitly learned character-level language statistics to decode the output through the recurrent neural network. character. In the decoding process, the Attention mechanism is introduced to realize soft feature selection to make better use of image features. This selective processing method is more in line with human intuition.

R^2AM structure diagram

Subsequently, a large number of algorithms have been explored and updated in the field of Attention. For example, SAR [8] extends 1D attention to 2D attention. The RARE mentioned in the correction module is also an Attention-based method. Experiments prove that the Attention-based method has a good accuracy improvement compared to the CTC method.

Segmentation-based methods

The segmentation-based method treats each character of a text line as an independent individual, and it is easier to recognize a single character after it is corrected than to recognize the entire text line. It tries to locate the position of each character from the input text image, and applies a character classifier to obtain these recognition results, which simplifies the complex global problem into a local problem solution, and has a relatively good effect in irregular text scenes. However, this method requires character-level annotation, and there are certain difficulties in data acquisition. Lyu[9] et al. proposed an instance segmentation model for word recognition, which uses a method based on FCN (Fully Convolutional Network) in its recognition part. [10] considered the text recognition problem from a two-dimensional perspective, and designed a character attention FCN to solve the text recognition problem. When the text is curved or severely distorted, the method has better localization results for both regular and irregular text.

Mask TextSpotter structure diagram

Transformer-based method

With the rapid development of Transformer, both the classification and detection fields have verified the effectiveness of Transformer in vision tasks. As mentioned in the rule text recognition section, CNN has limitations in long-term dependency modeling. The Transformer structure just solves this problem. It can focus on global information in the feature extractor and can replace the additional context modeling module (LSTM ).

Part of the text recognition algorithm uses Transformer's Encoder structure and convolution to jointly extract sequence features. The Encoder is composed of multiple MultiHeadAttentionLayer and Positionwise Feedforward Layer stacked blocks. The self-attention in MulitHeadAttention uses matrix multiplication to simulate the timing calculation of RNN, breaking the barrier of long-term dependence of timing in RNN. There are also some algorithms that use Transformer's Decoder module to decode, which can obtain stronger semantic information than traditional RNN, and parallel computing has higher efficiency.

The SRN[11] algorithm connects Transformer's Encoder module to ResNet50 to enhance 2D visual features. And a parallel attention module is proposed, which uses the read order as a query, making the computation independent of time, and finally outputs aligned visual features for all time steps in parallel. In addition, SRN also uses Transformer's Eecoder as a semantic module to fuse the visual information and semantic information of the picture, and has greater benefits in irregular text such as occlusion and blurring.

NRTR [12] uses a complete Transformer structure to encode and decode input images, and only uses a few simple convolutional layers for high-level feature extraction, which verifies the effectiveness of the Transformer structure in text recognition.

NRTR structure diagram

SRACN [13] used Transformer's decoder to replace LSTM, which once again verified the efficiency and accuracy advantages of parallel training.

1.2 Summary

This section mainly introduces the theoretical knowledge and mainstream algorithms related to text recognition, including CTC-based methods, Sequence2Sequence-based methods and segmentation-based methods, and lists the ideas and contributions of classic papers. The next section will explain the practical course based on the CRNN algorithm, and complete the entire training process from networking to optimization.

2. Text recognition in practice

2.1. Data preparation

  • Prepare dataset

PaddleOCR supports two data formats:

  • lmdbUsed to train the data set stored in lmdb format (LMDBDataSet);
  • 通用数据A dataset (SimpleDataSet) stored in a text file for training;

The default storage path of the training data is PaddleOCR/train_data, if you already have a dataset on disk, just create a soft link to the dataset directory:

#linux and mac os
ln -sf <path/to/dataset> <path/to/paddle_ocr>/train_data/dataset
#windows
mklink /d <path/to/paddle_ocr>/train_data/dataset <path/to/dataset>
  • custom data set

The following takes the general data set as an example to introduce how to prepare the data set:

  • Training set

It is recommended to put the training pictures into the same folder, and use a txt file (rec_gt_train.txt) to record the picture path and label. The contents of the txt file are as follows:

Note: In the txt file, please use \t to split the image path and image label by default. If you split it in other ways, it will cause a training error.

" 图像文件名                 图像标注信息 "

train_data/rec/train/word_001.jpg   简单可依赖
train_data/rec/train/word_002.jpg   用科技让复杂的世界更简单
...

The final training set should have the following file structure:

|-train_data
  |-rec
    |- rec_gt_train.txt
    |- train
        |- word_001.png
        |- word_002.jpg
        |- word_003.jpg
        | ...

In addition to the above-mentioned single image in one-line format, PaddleOCR also supports training on offline augmented data. In order to prevent the same sample from being sampled multiple times in the same batch, we can write the image path corresponding to the same label in one line. , given in the form of a list, during training, PaddleOCR will randomly select a picture in the list for training. Correspondingly, the format of the annotation file is as follows.

["11.jpg", "12.jpg"]   简单可依赖
["21.jpg", "22.jpg", "23.jpg"]   用科技让复杂的世界更简单
3.jpg   ocr

In the above example labeling file, "11.jpg" and "12.jpg" have the same label, both of which 简单可依赖, during training, for the line labeling, one of the pictures will be randomly selected for training.

  • validation set

Similar to the training set, the verification set also needs to provide a folder (test) containing all pictures and a rec_gt_test.txt. The structure of the verification set is as follows:

|-train_data
  |-rec
    |- rec_gt_test.txt
    |- test
        |- word_001.jpg
        |- word_002.jpg
        |- word_003.jpg
        | ...
  • data download
  • ICDAR2015

If you do not have a local dataset, you can download the ICDAR2015 data from the official website for quick verification. You can also refer to DTRB to download the lmdb format dataset required by the benchmark.

If you are using the icdar2015 public dataset, PaddleOCR provides a label file for training the ICDAR2015 dataset, which can be downloaded in the following ways:

#训练集标签
wget -P ./train_data/ic15_data  https://paddleocr.bj.bcebos.com/dataset/rec_gt_train.txt
#测试集标签
wget -P ./train_data/ic15_data  https://paddleocr.bj.bcebos.com/dataset/rec_gt_test.txt

PaddleOCR also provides a data format conversion script, which can convert the ICDAR official website label to a data format supported by PaddleOCR. The data conversion tool is here ppocr/utils/gen_label.py, here is the training set as an example:

#将官网下载的标签文件转换为 rec_gt_label.txt
python gen_label.py --mode="rec" --input_path="{path/of/origin/label}" --output_label="rec_gt_label.txt"

The format of the data style is as follows, (a) is the original picture, (b) is the Ground Truth text file corresponding to each picture:

  • Multilingual dataset

The training data set of the multilingual model is 100w synthetic data. The open source synthesis tool text_renderer is used . A small amount of fonts can be downloaded in the following two ways.

Finally, a dictionary ({word_dict_name}.txt) needs to be provided so that the model can map all occurrences of characters to the index of the dictionary during training.

Therefore, the dictionary needs to contain all the characters that are expected to be correctly recognized. {word_dict_name}.txt needs to be written in the following format and utf-8saved in the encoded format:

l
d
a
d
r
n

word_dict.txt has a single word per line, mapping characters with numeric indices, "and" will be mapped to [2 5 1]

  • built-in dictionary

PaddleOCR has a built-in dictionary that can be used on demand.

ppocr/utils/ppocr_keys_v1.txtis a Chinese dictionary with 6623 characters

ppocr/utils/ic15_dict.txtis an English dictionary with 36 characters

ppocr/utils/dict/french_dict.txtis a French dictionary with 118 characters

ppocr/utils/dict/japan_dict.txtis a Japanese dictionary with 4399 characters

ppocr/utils/dict/korean_dict.txtis a Korean dictionary with 3636 characters

ppocr/utils/dict/german_dict.txtis a German dictionary with 131 characters

ppocr/utils/en_dict.txtis an English dictionary with 96 characters

The current multilingual model is still in the demo stage, and will continue to optimize the model and add languages. You are very welcome to provide us with dictionaries and fonts in other languages .
If you are willing, you can submit the dictionary file to dict , and we will thank you in the Repo .

  • custom dictionary

If you need to customize the dic file, please configs/rec/PP-OCRv3/en_PP-OCRv3_rec.ymladd character_dict_patha field in the , pointing to your dictionary path.

  • Add space category

If you want to support recognition of the "whitespace" category, set the field in the yml file use_space_charto True.

  • data augmentation

PaddleOCR provides a variety of data augmentation methods, and data augmentation has been added to the default configuration file.

The default perturbation methods are: color space conversion (cvtColor), blur (blur), jitter (jitter), noise (Gasuss noise), random crop (random crop), perspective (perspective), color inversion (reverse), TIA data Augmented.

During the training process, each perturbation method is selected with a probability of 40%. For specific code implementation, please refer to: rec_img_aug.py

Due to OpenCV compatibility issues, perturbation operations only support Linux for the time being

2.2 Start training

PaddleOCR provides training scripts, evaluation scripts and prediction scripts. This section will take the PP-OCRv3 English recognition model as an example:

  • start training

First download the pretrain model, you can download the trained model and perform finetune on the icdar2015 data

cd PaddleOCR/
#下载英文PP-OCRv3的预训练模型
wget -P ./pretrain_models/ https://paddleocr.bj.bcebos.com/PP-OCRv3/english/en_PP-OCRv3_rec_train.tar
#解压模型参数
cd pretrain_models
tar -xf en_PP-OCRv3_rec_train.tar && rm -rf en_PP-OCRv3_rec_train.tar

Start training:

If you are installing the cpu version, please use_gpumodify the field in the configuration file to false

#GPU训练 支持单卡,多卡训练
#训练icdar15英文数据 训练日志会自动保存为 "{save_model_dir}" 下的train.log

#单卡训练(训练周期长,不建议)
python3 tools/train.py -c configs/rec/PP-OCRv3/en_PP-OCRv3_rec.yml -o Global.pretrained_model=./pretrain_models/en_PP-OCRv3_rec_train/best_accuracy

#多卡训练,通过--gpus参数指定卡号
python3 -m paddle.distributed.launch --gpus '0,1,2,3'  tools/train.py -c configs/rec/PP-OCRv3/en_PP-OCRv3_rec.yml -o Global.pretrained_model=./pretrain_models/en_PP-OCRv3_rec_train/best_accuracy

After starting the training normally, you will see the following log output:

[2022/02/22 07:58:05] root INFO: epoch: [1/800], iter: 10, lr: 0.000000, loss: 0.754281, acc: 0.000000, norm_edit_dis: 0.000008, reader_cost: 0.55541 s, batch_cost: 0.91654 s, samples: 1408, ips: 153.62133
[2022/02/22 07:58:13] root INFO: epoch: [1/800], iter: 20, lr: 0.000001, loss: 0.924677, acc: 0.000000, norm_edit_dis: 0.000008, reader_cost: 0.00236 s, batch_cost: 0.28528 s, samples: 1280, ips: 448.68599
[2022/02/22 07:58:23] root INFO: epoch: [1/800], iter: 30, lr: 0.000002, loss: 0.967231, acc: 0.000000, norm_edit_dis: 0.000008, reader_cost: 0.14527 s, batch_cost: 0.42714 s, samples: 1280, ips: 299.66507
[2022/02/22 07:58:31] root INFO: epoch: [1/800], iter: 40, lr: 0.000003, loss: 0.895318, acc: 0.000000, norm_edit_dis: 0.000008, reader_cost: 0.00173 s, batch_cost: 0.27719 s, samples: 1280, ips: 461.77252

The following information is automatically printed in the log:

field meaning
epoch current iteration round
iter current iteration count
lr current learning rate
loss current loss function
acc The accuracy of the current batch
norm_edit_dis The edit distance of the current batch
reader_cost Current batch data processing time-consuming
batch_cost The total time spent on the current batch
samples The number of samples in the current batch
ips Number of images processed per second

PaddleOCR supports alternate training and evaluation. You can configs/rec/PP-OCRv3/en_PP-OCRv3_rec.ymlmodify eval_batch_stepthe evaluation frequency in . By default, it evaluates once every 500 iter. By default, the best acc model will be saved as output/en_PP-OCRv3_rec/best_accuracy.

If the verification set is large, the test will be time-consuming. It is recommended to reduce the number of evaluations, or to evaluate after training.

Tip: You can use the -c parameter to select configs/rec/multiple model configurations under the path for training. The recognition algorithms supported by PaddleOCR can refer to the list of cutting-edge algorithms :

For training Chinese data, it is recommended to use ch_PP-OCRv3_rec_distillation.yml . If you want to try the effect of other algorithms on Chinese data sets, please refer to the following instructions to modify the configuration file:

Take ch_PP-OCRv3_rec_distillation.ymlfor example :

Global:
  ...
  #添加自定义字典,如修改字典请将路径指向新字典
  character_dict_path: ppocr/utils/ppocr_keys_v1.txt
  ...
  #识别空格
  use_space_char: True


Optimizer:
  ...
  #添加学习率衰减策略
  lr:
    name: Cosine
    learning_rate: 0.001
  ...

...

Train:
  dataset:
    #数据集格式,支持LMDBDataSet以及SimpleDataSet
    name: SimpleDataSet
    #数据集路径
    data_dir: ./train_data/
    # 训练集标签文件
    label_file_list: ["./train_data/train_list.txt"]
    transforms:
      ...
      - RecResizeImg:
          # 修改 image_shape 以适应长文本
          image_shape: [3, 48, 320]
      ...
  loader:
    ...
    # 单卡训练的batch_size
    batch_size_per_card: 256
    ...

Eval:
  dataset:
    #数据集格式,支持LMDBDataSet以及SimpleDataSet
    name: SimpleDataSet
    #数据集路径
    data_dir: ./train_data
    # 验证集标签文件
    label_file_list: ["./train_data/val_list.txt"]
    transforms:
      ...
      - RecResizeImg:
          # 修改 image_shape 以适应长文本
          image_shape: [3, 48, 320]
      ...
  loader:
    # 单卡验证的batch_size
    batch_size_per_card: 256
    ...

Note that the configuration file during prediction/evaluation must be consistent with the training.

2.2. Breakpoint training

If the training program is interrupted, if you want to load the interrupted model to resume training, you can specify the path of the model to be loaded by specifying Global.checkpoints:

python3 tools/train.py -c configs/rec/PP-OCRv3/en_PP-OCRv3_rec.yml -o Global.checkpoints=./your/trained/model

Note : Global.checkpointsThe priority of is higher Global.pretrained_modelthan that of , that is, when two parameters are specified at the same time, Global.checkpointsthe specified model will be loaded first. If Global.checkpointsthe specified model path is wrong, Global.pretrained_modelthe specified model will be loaded.

2.3. Replace Backbone Training

PaddleOCR divides the network into four parts, respectively under ppocr/modeling . The data entering the network will pass through these four parts in sequence (transforms->backbones->necks->heads).

├── architectures # 网络的组网代码
├── transforms    # 网络的图像变换模块
├── backbones     # 网络的特征提取模块
├── necks         # 网络的特征增强模块
└── heads         # 网络的输出模块

If the Backbone to be replaced has a corresponding implementation in PaddleOCR, Backboneyou can directly modify some parameters in the configuration yml file.

If you want to use a new Backbone, an example of replacing backbones is as follows:

  1. Create a new file under the ppocr/modeling/backbones folder, such as my_backbone.py.
  2. Add relevant code in the my_backbone.py file, the sample code is as follows:
import paddle
import paddle.nn as nn
import paddle.nn.functional as F


class MyBackbone(nn.Layer):
    def __init__(self, *args, **kwargs):
        super(MyBackbone, self).__init__()
        # your init code
        self.conv = nn.xxxx

    def forward(self, inputs):
        # your network forward
        y = self.conv(inputs)
        return y
  1. Import the added module in the ppocr/modeling/backbones/_init_.py file , and thenMyBackbone modify the Backbone configuration file in the configuration file to use it. The format is as follows :
Backbone:
name: MyBackbone
args1: args1

Note : If you want to replace other modules of the network, you can refer to the documentation .

2.4. Mixed precision training

If you want to further speed up the training, you can use automatic mixed precision training . Taking a single machine and a single card as an example, the command is as follows:

python3 tools/train.py -c configs/rec/PP-OCRv3/en_PP-OCRv3_rec.yml \
     -o Global.pretrained_model=./pretrain_models/en_PP-OCRv3_rec_train/best_accuracy \
     Global.use_amp=True Global.scale_loss=1024.0 Global.use_dynamic_loss_scaling=True

2.5. Distributed Training

During multi-machine multi-card training, --ipsset the IP address of the machine used by the parameter, and the --gpusGPU ID used by the parameter:

python3 -m paddle.distributed.launch --ips="xx.xx.xx.xx,xx.xx.xx.xx" --gpus '0,1,2,3' tools/train.py -c configs/rec/PP-OCRv3/en_PP-OCRv3_rec.yml \
     -o Global.pretrained_model=./pretrain_models/en_PP-OCRv3_rec_train/best_accuracy

Note: (1) When using multi-machine multi-card training, you need to replace the ips value in the above command with the address of your machine, and the machines need to be able to ping each other; (2) You need to start the command on multiple machines separately during training . The command to view the machine ip address is ifconfig; (3) For more information about the performance advantages of distributed training, please refer to: Distributed Training Tutorial .

2.6. Knowledge distillation training

PaddleOCR supports the training process of the text recognition model based on knowledge distillation. For more information, please refer to the knowledge distillation documentation .

2.7. Multilingual model training

PaddleOCR currently supports 80 (except Chinese) language recognition, and configs/rec/multi_languagesa multilingual configuration file template is provided under the path: rec_multi_language_lite_train.yml .

Divided by language family, the languages ​​currently supported by PaddleOCR are:

configuration file algorithm name backbone trans seq before language
rec_chinese_cht_lite_train.yml CRNN Mobilenet_v3 small 0.5 None BiLSTM ctc traditional Chinese
rec_en_lite_train.yml CRNN Mobilenet_v3 small 0.5 None BiLSTM ctc English (case sensitive)
rec_french_lite_train.yml CRNN Mobilenet_v3 small 0.5 None BiLSTM ctc French
rec_ger_lite_train.yml CRNN Mobilenet_v3 small 0.5 None BiLSTM ctc German
rec_japan_lite_train.yml CRNN Mobilenet_v3 small 0.5 None BiLSTM ctc Japanese
rec_korean_lite_train.yml CRNN Mobilenet_v3 small 0.5 None BiLSTM ctc Korean
rec_latin_lite_train.yml CRNN Mobilenet_v3 small 0.5 None BiLSTM ctc latin alphabet
rec_arabic_lite_train.yml CRNN Mobilenet_v3 small 0.5 None BiLSTM ctc Arabic alphabet
rec_cyrillic_lite_train.yml CRNN Mobilenet_v3 small 0.5 None BiLSTM ctc Cyrillic
rec_devanagari_lite_train.yml CRNN Mobilenet_v3 small 0.5 None BiLSTM ctc Sanskrit alphabet

For more supported languages, please refer to: Multilingual Model

If you want to tune based on the existing model effect, please refer to the following instructions to modify the configuration file:

Take rec_french_lite_trainfor example :

Global:
  ...
  # 添加自定义字典,如修改字典请将路径指向新字典
  character_dict_path: ./ppocr/utils/dict/french_dict.txt
  ...
  # 识别空格
  use_space_char: True

...

Train:
  dataset:
    # 数据集格式,支持LMDBDataSet以及SimpleDataSet
    name: SimpleDataSet
    # 数据集路径
    data_dir: ./train_data/
    # 训练集标签文件
    label_file_list: ["./train_data/french_train.txt"]
    ...

Eval:
  dataset:
    # 数据集格式,支持LMDBDataSet以及SimpleDataSet
    name: SimpleDataSet
    # 数据集路径
    data_dir: ./train_data
    # 验证集标签文件
    label_file_list: ["./train_data/french_val.txt"]
    ...

2.8. Other training environments

  • Windows GPU/CPU
    is slightly different from Linux platform on Windows platform:
    Windows platform only supports 单卡training and prediction, specifying GPU for training set CUDA_VISIBLE_DEVICES=0
    On Windows platform, DataLoader only supports single-process mode, so it needs to be set num_workersto 0;

  • macOS does not support GPU mode, which needs to be set to False
    in the configuration file , and the rest of the training, evaluation and prediction commands are exactly the same as those of Linux GPU.use_gpu


  • Environment variables need to be set to run on Linux DCU DCU devices export HIP_VISIBLE_DEVICES=0,1,2,3, and the rest of the training, evaluation and prediction commands are exactly the same as those of Linux GPU.

2.9 Model fine-tuning

In actual use, it is recommended to load the official pre-trained model and fine-tune it in your own data set. For the fine-tuning method of the recognition model, please refer to: Model Fine-tuning Tutorial .

2.10 Model Evaluation and Prediction

  • Index evaluation

The model parameters during training are saved in Global.save_model_dirthe directory by default. When evaluating indicators, it is necessary to set Global.checkpointspointing to saved parameter files. The evaluation dataset can be modified by configs/rec/PP-OCRv3/en_PP-OCRv3_rec.ymlmodifying the settings in Eval label_file_path.

#GPU 评估, Global.checkpoints 为待测权重
python3 -m paddle.distributed.launch --gpus '0' tools/eval.py -c configs/rec/PP-OCRv3/en_PP-OCRv3_rec.yml -o Global.checkpoints={path/to/weights}/best_accuracy
  • Test recognition effect

Using the model trained by PaddleOCR, the following script can be used for fast prediction.

The default prediction image is stored in infer_img, and -o Global.checkpointsthe trained parameter file is loaded through :

save_model_dirAccording to the and fields set in the configuration file save_epoch_step, the following parameters will be saved:

output/rec/
├── best_accuracy.pdopt  
├── best_accuracy.pdparams  
├── best_accuracy.states  
├── config.yml  
├── iter_epoch_3.pdopt  
├── iter_epoch_3.pdparams  
├── iter_epoch_3.states  
├── latest.pdopt  
├── latest.pdparams  
├── latest.states  
└── train.log

Among them, best_accuracy.* is the optimal model on the evaluation set; iter_epoch_x.* is the model saved save_epoch_stepat intervals; latest.* is the model of the last epoch.

#预测英文结果
python3 tools/infer_rec.py -c configs/rec/PP-OCRv3/en_PP-OCRv3_rec.yml -o Global.pretrained_model={path/to/weights}/best_accuracy  Global.infer_img=doc/imgs_words/en/word_1.png

Predicted image:

Get predictions for an input image:

infer_img: doc/imgs_words/en/word_1.png
        result: ('joint', 0.9998967)

The configuration file used for prediction must be consistent with the training. If you have python3 tools/train.py -c configs/rec/ch_ppocr_v2.0/rec_chinese_lite_train_v2.0.ymlcompleted the training of the Chinese model through ,
you can use the following command to predict the Chinese model.

#预测中文结果
python3 tools/infer_rec.py -c configs/rec/ch_ppocr_v2.0/rec_chinese_lite_train_v2.0.yml -o Global.pretrained_model={path/to/weights}/best_accuracy Global.infer_img=doc/imgs_words/ch/word_1.jpg

Predicted image:

Get predictions for an input image:

infer_img: doc/imgs_words/ch/word_1.jpg
        result: ('韩国小馆', 0.997218)
  • Model export and prediction

The inference model ( paddle.jit.savesaved model)
is generally a model training, a solidified model that saves the model structure and model parameters in a file, and is mostly used to predict deployment scenarios.
The model saved during the training process is the checkpoints model, and only the parameters of the model are saved, which are mostly used for recovery training.
Compared with the checkpoints model, the inference model will additionally save the structural information of the model, which has superior performance in predicting deployment and accelerating reasoning, is flexible and convenient, and is suitable for actual system integration.

The method of converting the recognition model to the inference model is the same as the detection method, as follows:

#-c 后面设置训练算法的yml配置文件
#-o 配置可选参数
#Global.pretrained_model 参数设置待转换的训练模型地址,不用添加文件后缀 .pdmodel,.pdopt或.pdparams。
#Global.save_inference_dir参数设置转换的模型将保存的地址。

python3 tools/export_model.py -c configs/rec/PP-OCRv3/en_PP-OCRv3_rec.yml -o Global.pretrained_model=./pretrain_models/en_PP-OCRv3_rec_train/best_accuracy  Global.save_inference_dir=./inference/en_PP-OCRv3_rec/

**Note:** If you are training the model on your own data set and adjusted the dictionary file of Chinese characters, please pay attention to modify the character_dict_pathcustom dictionary file in the configuration file.

After the conversion is successful, there are three files in the directory:

inference/en_PP-OCRv3_rec/
    ├── inference.pdiparams         # 识别inference模型的参数文件
    ├── inference.pdiparams.info    # 识别inference模型的参数信息,可忽略
    └── inference.pdmodel           # 识别inference模型的program文件
  • Custom Model Inference

    --rec_char_dict_pathIf the text dictionary is modified during training, you need to specify the dictionary path to use when using the inference model for prediction. For more information about the configuration and explanation of inference hyperparameters, please refer to: Model Inference Hyperparameter Explanation Tutorial .

    python3 tools/infer/predict_rec.py --image_dir="./doc/imgs_words_en/word_336.png" --rec_model_dir="./your inference model" --rec_image_shape="3, 48, 320" --rec_char_dict_path="your text dict path"
    

3. FAQ

Q1: After the training model is converted to the inference model, the prediction effect is inconsistent?

A : There are many such problems, and most of the problems are caused by the inconsistency between the preprocessing and postprocessing parameters of the trained model prediction and the preprocessing and postprocessing parameters of the inference model prediction. You can compare preprocessing, postprocessing, and prediction in the configuration files used for training.

Reference link:

https://aistudio.baidu.com/education/group/info/25207

https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.7

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/132590687