OCR review - continuously updated

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/wsp_1138886114/article/details/100040857

OCR is the whole: text detection , text recognition is one of two key areas

Currently OCR application scenarios can be divided into the following three aspects:
1, multi-form text detection and recognition in natural scenes
2, handwritten text detection and recognition
3, text detection and recognition of the document (layout analysis)

OCR current technology and literature review

First, the text detection

  1. Timed Colored Petri Net (based Faster RCNN): more mature text frame detection accuracy is good. However, the detection time is longer, there is a lot of room for optimization.
    [Document] Detecting Text in Natural Image with Connectionist Text Proposal Network

  2. TextBoxes, TextBoxes ++ (based on SSD): Anchor adjust the aspect ratio for text elongated features, but for the small text will be missed.
    [Document] TextBoxes: A Fast Text A Single Detector with Deep Neural Network
    TextBoxes ++: A Single-Shot Text Oriented Scene Detector

  3. SegLink (CTPN + SSD): generally used for natural scenes, multi-angle detecting text.
    [Document] Detecting Oriented Text in Natural Images by Linking Segments

  4. DMPNet: selected non-rectangular quadrilateral Anchor detected, is calculated by the Monte-Carlo method is recalculated marked area coincidence degree vertex coordinates after rotation of the rectangular frame candidates and candidate blocks, vertex coordinates to obtain a non-rectangular quadrilateral. For text detection in natural scenes.
    [Document] Deep Matching Prior Network: Toward Tighter Multi -oriented Text Detection

  5. YOLO: short text detection time, good accuracy. But for the small target the general effect, likely to cause widespread missed.
    [Document] YOLOv3: An Incremental Improvement

  6. EAST: FCN take the idea to do feature extraction and feature fusion, local perception NMS detection phase is complete. Simple network so that the detection accuracy and speed are further enhanced. (Use more for natural scenes)
    [document] EAST: An Efficient and Accurate Scene Text Detector

  7. Pixel-Anchor: Anchor for more than the number of problems caused by the loss of text to appear, Pixel feel the lack of long text field caused by the loss, the combination of the two respective advantages, for detecting the scene long line of Chinese has better adaptability. Network structure can be divided into two parts, wherein the pixel-based method for the improvement of the EAST, anchor-based method for the improvement of the SSD. The former medium in order to detect the text, the latter mainly in order to detect smaller text and line length.
    [Document] Pixel-Anchor: A Fast Oriented Scene Text Detector with Combined Networks

  8. IncepText: change for large-scale, aspect ratio and direction, drawing GoogLeNet the inception modules to solve these problems. In inception structure different sizes of text detection convolution kernel designed to meet different sizes and aspect ratios, while introducing Deformable layer convolution operations and improve the detection layer deformable PSROI pooling effect the text in any direction.
    [Document] IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection

Second, the text recognition

Text recognition model what elements need to have:

  1. The first is to read the input image, the image feature extraction, and therefore, requires a convolution layer for reading an image and extracting features.
  2. Since the text sequence length is not fixed, and therefore need to be introduced in the model RNN (recurrent neural network) is generally used to address the problem LSTM bidirectional predicted sequence of variable length.
  3. In order to enhance the applicability of the model, it is best not required to enter a character segmentation, can directly end to end training, which can reduce a lot of work division mark, then we should model the introduction of CTC (Connectionist temporal classification, connection time classification) , to solve the problem by splitting the sample aligned.
  4. Finally, according to certain rules of the model outputs correct processing, output the correct result.

The following describes the identification model

  1. CNN + RNN + CTC (eg CRNN): currently most widely used text recognition framework. You need to build your own word lexicon (containing commonly used words, all kinds of characters, etc.).
    [Document] An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

  2. CNN (such as Densenet) + CTC: not much information, the general effect, poor generalization ability. Well not a member of the RNN effect.
    [DOCUMENT yet found, refer to GitHub

  3. Tesserocr (Tesseract): the more widely used to identify an open source framework that supports multi-language multi-platform. Tesseract in identifying a clear standard Chinese fonts results were OK, slightly more complex the situation is very bad (multi-font, etc.), but also spent a lot of time.

  4. Good results primarily to identify the text image deformation, the natural scene for text recognition: RARE.
    [Document] Robust Scene Text Recognition with Automatic Rectification

  5. FOTS (EAST + CRNN): OCR-end model, the detection and feature recognition tasks sharing convolution layer, saving computing time, the image feature to learn more than two stages training mode. Introducing the rotation area of interest (RoIRotate), may be produced from a convolutional wherein FIG oriented text region, is inclined to support the identification of text.
    [Document] FOTS: Fast Oriented Text Spotting with a Unified Network

Guess you like

Origin blog.csdn.net/wsp_1138886114/article/details/100040857