Text recognition based on deep learning [DB+CRNN+PyQt5 interface implementation]


foreword

  With the continuous development and improvement of computer technology, the use of computer image processing technology for automatic target recognition technology research has practical significance, such as text recognition system, text recognition method based on deep learning, using DB algorithm for text positioning and CRNN algorithm Text recognition, implemented using Paddle as a framework. First, the text location is based on the DB algorithm. The DB algorithm has the characteristics of adaptive threshold and label generation, which can effectively detect and locate the text area. By learning the text and non-text areas in the image, the DB algorithm can generate accurate text bounding boxes to provide accurate input for subsequent text recognition. Secondly, character recognition is performed based on the CRNN algorithm.
  This article chooses Paddle as the deep learning framework. Paddle has good scalability and high performance. At the same time, it develops a visual interactive interface based on PyQt5, supports user batch image recognition, draws any area with a rectangular mouse and performs text recognition, and supports text recognition results to be displayed on the image.

关键词:文字识别;DB;CRNN;深度学习;卷积神经网络;PyQt5;PaddlePaddle


Video demonstration effect

Chinese and English text recognition based on deep learning

Code download link

  If you want to obtain all the complete program files involved in the blog post (including test pictures, py files, ocr model weight files, debugging instructions, etc.), they have been packaged and uploaded to the blogger’s bread multi-platform. For details, please refer to blogs and videos , all the involved files have been packed into it at the same time. There are specific instructions for software installation and debugging. We have professional debugging technicians who will assist customers in debugging remotely. For details安装条件说明.txt , please see the screenshot of the complete file as follows:
insert image description here

1. DB text positioning

1.1 DB overview

  Differentiable Binarization (DB) is a deep learning algorithm for text positioning, proposed by the first author Minghui Liao et al. in 2018. The goal of the DB algorithm is to realize the binarization (binary classification) of text regions in a differentiable way, so that it can be trained and optimized end-to-end.
  Traditional text localization methods usually rely on manually designed features and thresholds for binarization, and there are many difficulties and limitations. The DB algorithm replaces the traditional binarization operation by introducing a differentiable continuous function, allowing gradient feedback and end-to-end training process. The block diagram of DB text positioning is shown in the figure below.
insert image description here

1.2 DB algorithm principle

1.2.1 Overall framework

  The network framework of the DB text positioning algorithm is shown in the figure below, and the input image is input into the skeleton after the feature pyramid. Second, pyramidal features are upsampled to the same scale and concatenated to produce feature F. Then, the feature F is used to predict the probability map P and the threshold map T. Afterwards, an approximate binary map is computed from P and F. During the training cycle supervision is applied to the probability map, namely the threshold map and the approximate binary map, where the probability map and the approximate binary map share the same supervision. In the inference stage, bounding boxes can be easily mapped from approximate binary probability maps.
insert image description here

1.2.2 Feature extraction network Resnet

  ResNet (Residual Neural Network) is a deep convolutional neural network architecture proposed by Kaiming He et al. in 2015. The design goal of ResNet is to solve the problem of gradient disappearance and network degradation in deep network training, so that deeper networks can be trained and optimized more easily.
  The core idea of ​​ResNet is to build a deep network by introducing residual connections. Specifically, ResNet's feature extraction network consists of a series of residual blocks, each of which contains multiple convolutional layers and batch normalization layers. Inside the residual block, the input features are transformed by a series of convolutions and activation functions, added to the residual connection, and then passed through the activation function to obtain the final output features. This skip connection allows the network to learn the residual part while preserving the information of the original features.

# 定义ResNet网络
class ResNet(nn.Layer):
    def __init__(self, block, layers, num_classes=1000):
        super(ResNet, self).__init__()
        self.in_channels = 64
        self.conv1 = nn.Conv2D(3, 64, kernel_size=7, stride=2, padding=3, bi-as_attr=False)
        self.bn1 = nn.BatchNorm2D(64)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2D(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
        self.avgpool = nn.AdaptiveAvgPool2D((1, 1))
        self.fc = nn.Linear(512, num_classes)
    def _make_layer(self, block, out_channels, blocks, stride=1):
        downsample = None
        if stride != 1 or self.in_channels != out_channels:
            downsample = nn.Sequential(
                nn.Conv2D(self.in_channels, out_channels, kernel_size=1, stride=stride, bias_attr=False),
                nn.BatchNorm2D(out_channels)
            )
        layers = [block(self.in_channels, out_channels, stride, downsample)]
        self.in_channels = out_channels

        for _ in range(1, blocks):
            layers.append(block(out_channels, out_channels))
        return nn.Sequential(*layers)

1.2.3 Adaptive Threshold

  In the DB algorithm, the adaptive threshold is an important principle for character positioning. Traditional binarization methods usually use a fixed threshold to segment an image into foreground (text) and background parts. However, since the text in the image may have different brightness and contrast, the method of using a fixed threshold may lead to inaccurate segmentation results, and the adaptive threshold effect is shown in the figure below.
insert image description here
  By introducing an adaptive threshold mechanism, the DB algorithm can dynamically determine the binarization threshold according to the local information of each pixel, so as to achieve more accurate text positioning. The following is the principle of adaptive threshold in the DB algorithm:
  First, feature extraction, using CNN to extract N-dimensional features from the input image.
  Threshold estimation, the prediction network usually consists of a series of convolutional layers and activation functions, and the most obvious parameter value in the feature map is used as the threshold.
  Binarization, use the adaptive threshold to compare the threshold of the feature map to perform binarization.
By using an adaptive threshold, the DB algorithm can achieve more precise text positioning according to the brightness and contrast changes of different areas in the image.

1.2.4 Text area annotation generation

  In the DB algorithm, annotation generation refers to the generation of a binary segmentation mask (mask) for training, which is used to indicate whether each pixel in the image belongs to the text area. These binary segmentation masks can be used as supervisory signals to help the network learn correct text positioning. The following is the principle of annotation generation in the DB algorithm: Data preparation: First, a training
  image dataset with text area annotations needs to be prepared.
Annotation Transformation: For each image, the annotation information of the text region is converted into a binary segmentation mask.
  Data augmentation: In order to increase the diversity and robustness of training samples, data augmentation techniques, such as random rotation, scaling, cropping, etc., can be applied to generate multiple transformed images and corresponding binary segmentation masks.
  Segmentation mask prediction: During the training process of the DB algorithm, the network receives an input image and predicts a binary segmentation mask for the text region.
  Loss calculation: Calculate the loss function by comparing the binary segmentation mask predicted by the network with the real binary segmentation mask.
  Through annotation generation, the DB algorithm can generate a corresponding binary segmentation mask for each training sample, and compare it with the prediction results of the network, thereby realizing supervised learning. The network can gradually adjust the parameters by optimizing the loss function, so that the prediction result is close to the real binary segmentation mask, and the accuracy of text positioning can be improved. The text area labeling process is shown in Figure 3-4 below:

insert image description here

1.2.5 DB text positioning model training

  The principle and process of DB text positioning algorithm model training are as follows:

  1. Data preparation: First, the dataset for training needs to be prepared. The dataset should contain positive and negative samples. Positive samples are annotated images that contain text, and negative samples are images that do not contain text.
  2. Model design: Next, it is necessary to design the network structure of the DB text positioning algorithm model. The model is usually built using a deep convolutional neural network (CNN).
  3. Loss function definition: In order to train the model, an appropriate loss function needs to be defined to measure the performance and error of the model. In text localization tasks, commonly used loss functions include binary classification cross-entropy loss and bounding box regression loss.
  4. Model training: use the prepared data set and the defined loss function to start training the DB text positioning algorithm model.
  5. Model evaluation: During the training process, the performance of the model needs to be evaluated periodically. You can use some evaluation indicators, such as accuracy rate, recall rate, F1 score, etc., to measure the performance of the model on the text positioning task.

2. CRNN text recognition

2.1 Overview of CRNN

  CRNN is a deep learning algorithm for text recognition tasks. It can simultaneously process the spatial features and sequence information of images, making text recognition more efficient and accurate. The overall process of CRNN is shown in the figure below. The principle of CRNN algorithm is as follows:
  First, feature extraction, the input text image is extracted through the feature extraction of the convolutional layer.
  Recurrent Sequence Modeling: The extracted features are input into a recurrent neural network (RNN) for sequence modeling.
  Sequence Classification: In the last step of CRNN, the features output by RNN are mapped to the probability distribution of text through the fully connected layer.
  The advantage of the CRNN algorithm is that it can simultaneously capture the local features of the image and the context information of the sequence. The convolutional layer can extract the low-level features of the image, such as edges, textures, etc., while the cyclic neural network can model the sequence of features and capture the semantic and contextual information of the text.
insert image description here

2.2 Principle of CRNN

2.2.1 Implementation of CRNN network architecture

  The overall processing logic flow of CRNN is shown in the figure below
insert image description here

# 定义CRNN模型
class CRNN(nn.Layer):
    def __init__(self, num_classes):
        super(CRNN, self).__init__()

        self.cnn = nn.Sequential(
            nn.Conv2D(1, 64, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2D(kernel_size=2, stride=2),
            nn.Conv2D(64, 128, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2D(kernel_size=2, stride=2),
            nn.Conv2D(128, 256, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2D(256),
            nn.ReLU(),
            nn.Conv2D(256, 256, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2D(kernel_size=(2, 1), stride=(2, 1)),
            nn.Conv2D(256, 512, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2D(512),
            nn.ReLU(),
            nn.Conv2D(512, 512, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2D(kernel_size=(2, 1), stride=(2, 1)),
            nn.Conv2D(512, 512, kernel_size=2, stride=1),
            nn.ReLU()
        )

2.2.2 CNN structure

  In the CRNN framework, CNN (Convolutional Neural Network) is used for feature extraction of text images. CNN performs filtering and feature extraction on the input text image through convolution operation to capture local information and texture features in the image. The overall process of CNN in the CRNN framework is shown in the figure below.

insert image description here
  Through the feature extraction of CNN, CRNN can learn feature representation with semantic and distinguishing ability from the original text image, and provide useful input for subsequent sequence modeling and text recognition. In the CRNN framework, the parameters of CNN are usually learned through end-to-end training to maximize the extraction and representation of text features in the input image.

2.2.3 RNN structure

  In the CRNN framework, RNN (Recurrent Neural Network) is used to perform sequence modeling on the features extracted by CNN. RNN can capture the context information in the sequence data, and process the input sequence of indefinite length through the iteration of the time step. The overall process of RNN in the CRNN framework is shown in the figure below.
insert image description here

  The following is the basic principle of RNN implementation in the CRNN framework:
  Feature sequence: In CRNN, the feature map extracted by CNN is converted into a feature sequence so that RNN can process sequential data step by step.
  RNN unit: RNN unit is the basic component of RNN, which processes sequence data through a loop structure.
  Hidden state transfer: The hidden state of the RNN is transferred between each time step. The hidden state contains the context information of the sequence data, which can remember the information of the previous time step and affect the calculation of the subsequent time step.
  Loop iteration: The RNN unit in CRNN will perform multiple loop iterations according to the length of the sequence.
  Sequence feature modeling: Through the cyclic iteration of RNN, the context information in the sequence data is modeled.
  In the CRNN framework, the parameters of RNN are usually learned through end-to-end training to maximize the extraction and representation of text features in sequence data.

3. Implementation and analysis of text recognition system

3.1 The overall function of the text recognition system

  The framework of the deep learning text recognition system designed in this paper is shown in the figure below.
insert image description here
  Its specific functions have the following functions:
  Text detection: Text detection is one of the key functions of the deep learning text recognition system. It automatically detects and locates text regions that appear in images.
  Text positioning: Text positioning refers to determining the exact position and bounding box of the text in the image.
  Text recognition: Text recognition is the core function of the deep learning text recognition system. Analyze and recognize text images using deep learning models.
  Text correction: The deep learning text recognition system can also perform text correction and post-processing operations to improve the accuracy of recognition results. Text correction can correct errors that may exist in the text recognition process, such as character recognition errors, character omissions, etc. Post-processing operations can further optimize and correct the recognition results, such as text error correction and continuous text segmentation based on language models.
  Chinese and English support: Deep learning text recognition systems usually have the function of multi-language support, and can handle text recognition tasks in different languages ​​and character sets. It can adapt to the characteristics and rules of different languages ​​and characters, and provide accurate recognition results.
  Batch processing: Deep learning text recognition systems usually support batch processing and can process multiple text images at the same time to improve processing efficiency and performance.
  Visualization and interactive interface: Some deep learning text recognition systems also provide a visual and interactive interface, which is convenient for users to upload, preview and edit text images, view recognition results and make necessary modifications and corrections.

3.2 System main interface design

  The deep learning text recognition system in this article adopts the PyQt5 framework design, including menu bar, status bar, title bar, file list box, image display box, image zoom scroll bar, text positioning, text recognition function buttons, recognition result list display box, text The detection position display box, and its overall interface is shown in the figure below.

insert image description here

3.3 Image data loading function

  The deep learning text recognition system in this paper supports batch image loading and recognition. First, the user selects a path containing image datasets, and then clicks to load the directory.
  Select the image directory function as shown in the figure below, the system will automatically load all the image format files under this path to the text recognition system, after the loading is completed, the image will automatically display its file name in the image file list, the user selects the corresponding After the image is displayed, the current image will be automatically displayed in the image control.

insert image description here
insert image description here

3.4 Image automatic recognition function

  The advantages and disadvantages of combining the two algorithms of DB and CRNN are as follows:
  The advantages are: first, combining the two tasks of detection and recognition, the DB algorithm is mainly used for text detection, which can accurately locate and extract the text area in the image; the CRNN algorithm It is mainly used for text recognition, and can identify the specific text content in the detected text area. Combining the two, the text detection and recognition tasks can be completed at the same time, and the overall text recognition accuracy can be improved. Secondly, the powerful feature learning ability, the CRNN algorithm can effectively learn the text features in the image through the combination of convolution and cyclic neural network. The fused algorithm can make full use of the feature learning ability of CRNN to improve the accuracy and robustness of character recognition. Finally, the use of context information, the CRNN algorithm can model the context information of sequence data through the cyclic neural network, which is very useful for text recognition tasks. The fused algorithm can input the text area detected by DB into CRNN for recognition, and effectively use the context information to improve the accuracy and semantic consistency of text recognition.
  But there are also disadvantages: First, the complexity is high. The integration of DB and CRNN algorithms requires model integration and joint training, which will increase the complexity of the algorithm and the difficulty of training. Secondly, the training data requirements are high, and the training of the fusion algorithm requires a large amount of labeled data, including labeling information for text detection and recognition. Obtaining high-quality labeled data is a challenging task. The last point is that the running time is long, and the fused algorithm needs to perform calculations in two stages of text detection and recognition in sequence, which will increase the overall running time. Especially when dealing with large-scale image or video data, the time overhead will be greater.
  After selecting the image path, the system automatically loads the image data set, click the OCR recognition button, the system will automatically recognize all image files in batches, and its operation interface is shown in the figure below.
insert image description here
  After the system automatically recognizes all image data sets, click the OK button, and the system will automatically display all text recognition results in batches, including text areas and recognition results. Among them, the system will automatically display the corresponding recognition results on the image one by one, which is more intuitive , and its interface is shown in the figure below. The user selects any result in the recognition result list, and the system will automatically display the corresponding numbered recognition result.

insert image description here

3.5 Image drawing rectangle area recognition function

  When the system loads the specified image, the system supports the user to draw any area with the mouse for image text recognition. Click the rectangle text positioning button, and the user can draw a rectangle in the image area.
insert image description here

3.6 Image drawing arbitrary deformation region recognition function

  When the system loads the specified image, the system supports the user to draw any area with the mouse to perform image text recognition. Click the polygon text positioning button, and the user can draw a rectangle in the image area. After the polygon area is drawn, the upper left corner of the polygon area will display to be recognized. Its operation interface is shown in the figure below
insert image description here

references

  1. Text Recognition Based on Fusion of Deep Learning and Dense SIFT[J]. Peng Yuqing, Wang Weihua, Liu Xuan, Zhao Xiaosong, Wei Ming. Journal of University of Science and Technology of China. 2019(02)
  2. Facial Text Recognition Based on Cross-Connection LeNet-5 Network[J]. Li Yong, Lin Xiaozhu, Jiang Mengying. Acta Automatica Sinica. 2018(01)
  3. A Review of Micro-text Recognition[J]. Xu Feng, Zhang Junping. Acta Automatica Sinica. 2017(03)
  4. Research Progress on Character Recognition[J]. Huang Jian, Li Wenshu, Gao Yujuan. Computer Science. 2016 (S2) Research on Vehicle Sight Distance Detection Algorithm Based on Feature Point Extraction Technology[J]. Li Ting, Lai Yannian, Fang Tengyuan, Yuan Tianyu, Liu Yongtao. Journal of Chongqing University of Technology (Natural Science). 2019(07)
  5. Application of Distributed Training of Convolutional Neural Network in Text Recognition [J]. Dong Feiyan. Software. 2020(01)
  6. Text recognition based on multi-feature fusion convolutional neural network [J]. Wang Jianxia, ​​Chen Huiping, Li Jiaze, Zhang Xiaoming. Journal of Hebei University of Science and Technology. 2019(06)
  7. Text Recognition Fusion of Convolutional Neural Network and Support Vector Machine[J]. Wang Zhongmin, Li Hena, Zhang Rong, Heng Xia. Computer Engineering and Design. 2019(12)
  8. New progress in character recognition research[J]. Liu Xiaomin, Tan Huachun, Zhang Yujin. Chinese Journal of Image and Graphics. 2006(10)
  9. Research on Several Algorithms of Text Recognition Based on Nonlinear Dimensionality Reduction [D]. Huang Mingwei. Wuyi University 2011
  10. Text Recognition Based on LBP and PCA[J]. Fan Lihong, Ning Yuan. New Industrialization. 2019(02)
  11. DNA Sequence Similarity Analysis Based on Multiscale Entropy[J]. Zhang Jing, Zhou Xiaoan, Zhao Yu. Intelligent Computer and Application. 2019(01)
  12. A Review of Color Image Grayscale Algorithms[J]. Gu Meihua, Su Binbin, Wang Miaomiao, Wang Zhilei. Computer Application Research. 2019(05)
  13. Feature Fusion Method in Pattern Classification[J]. Liu Weibin, Zou Zhiyuan, Xing Weiwei. Journal of Beijing University of Posts and Telecommunications. 2017(04)
  14. Application of Feature Extraction Algorithm Based on Permutation Entropy and Multifractal Index in Emotion Recognition[J]. Li Xin, Qi Xiaoying, Tian Yanxiu, Sun Xiaoqi, Fan Mengdi, Cai Erjuan. High Technology Communication. 2016(07)
  15. Research on EEG emotion feature extraction based on IMF energy entropy[J]. Lu Miao, Zou Junzhong, Zhang Jian, Xiao Shuyuan, Wei Zuochen. Biomedical Engineering Research. 2016(02)
  16. Real-time monitoring and recognition of ships based on YOLOv3[J]. Qu Wenyi. Computer and Modernization. 2020(03)
  17. Research on Aircraft Detection Algorithm of DS-YOLO Network in Remote Sensing Images[J]. Wu Jie, Duan Jin, He Liqun, Li Chaochao, Zhu Wentao. Computer Engineering and Application. 2021(01)
  18. Convolutional Neural Network Algorithm Analysis and Image Processing Example[J]. Liu Zhongyu. Computer Knowledge and Technology. 2019(34)
  19. A Verification Method of Generalization Ability of Neural Network Model after Pruning[J]. Liu Chongyang, Liu Qinrang. Computer Engineering. 2019(10)
  20. Research on Text Recognition Algorithm Based on Convolutional Neural Network [J]. Zha Zhihua, Deng Hongtao, Tian Min. Information and Computer (Theoretical Edition). 2019(15)
  21. Radio Signal Recognition Based on Image Deep Learning[J]. Zhou Xin, He Xiaoxin, Zheng Changwen. Journal of Communications. 2019(07)
  22. Deep Facial Expression Recognition: A Survey[J] . Li Shan,Deng Weihong. IEEE Trans-actions on Affective Computing . 2020
  23. Facial expression recognition sensing the complexity of testing samples[J] . Tianyuan Chang,Huihui Li,Guihua Wen,Yang Hu,Jiajiong Ma. Applied Intelligence . 2019 (12)
  24. RTCRelief-F: an effective clustering and ordering-based ensemble pruning algorithm for facial expression recognition[J] . Danyang Li,Guihua Wen,Zhi Hou,Eryang Huan,Yang Hu,Huihui Li. Knowledge and Information Systems . 2019 (1)
  25. Hierarchical committee of deep convolutional neural networks for robust facial expression recognition[J] . Bo-Kyeong Kim,Jihyeon Roh,Suh-Yeon Dong,Soo-Young Lee. Journal on Multimodal User Interfaces . 2016 (2)
  26. Very Deep Convolutional Networks for Large-Scale Image Recognition.[J] . Karen Simo-nyan,Andrew Zisserman. CoRR . 2014
  27. [7]Graph-preserving sparse nonnegative matrix factorization with application to facial ex-pression recognition.[J] . Zhi Ruicong,Flierl Markus,Ruan Qiuqi,Kleijn W Bastiaan. IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics : a publication of the IEEE Systems, Man, and Cybernetics Society . 2011 (1)
  28. Facial expression recognition based on Local Binary Patterns: A comprehensive study[J] . Caifeng Shan,Shaogang Gong,Peter W. McOwan. Image and Vision Computing . 2008 (6)

Summarize

  Text recognition based on deep learning has made remarkable progress in recent years, which mainly consists of two key steps: text positioning and text recognition. Text positioning uses the DB (Differentiable Binarization) algorithm, while text recognition uses the CRNN (Convolutional Recurrent Neural Network) algorithm. This review will summarize the application of these methods in the field of text recognition and introduce the rationale for implementing them using the PaddlePaddle framework.
  Text positioning is the pre-step of text recognition, and its goal is to accurately locate the text area in the image. The DB algorithm solves the problem of fixed threshold in the traditional binarization method by means of adaptive threshold. It utilizes a deep learning network to predict whether each pixel belongs to a text region, generating a binary segmentation mask. The method has good robustness and adaptability, and can accurately locate text according to the brightness and contrast changes of different images. Text recognition is the process of converting images of words into understandable text. The CRNN algorithm combines the advantages of convolution and RNN, and can simultaneously process the spatial features and sequence information of images.
  In terms of deep learning framework selection, this paper uses the domestic PaddlePaddle for model development and testing. The interface of the deep learning text recognition system uses the PyQt5 implementation library. The overall system includes text detection, text positioning, text recognition, image batch processing, visualization and interactive interface functions.

Guess you like

Origin blog.csdn.net/weixin_40280870/article/details/132250413