Click on the card below to follow the " CVer " public account
AI/CV heavy-duty information, delivered as soon as possible
Click to enter -> [OCR and Transformer] communication group
Reprinted from: CSIG Document Image Analysis and Recognition Committee
This article briefly introduces the paper "Towards Robust Scene Text Image Super-resolution via Explicit Location Enhancement" accepted by IJCAI 2023. This article notices that the background area in the scene text image is not very useful for downstream text recognition tasks. At the same time, the existence of complex background will also interfere with the reconstruction results of the super-resolution model. Based on this observation, this paper proposes to explicitly model character positions in text images to give more attention to character regions during super-resolution reconstruction. Experiments show that using the explicit position modeling scheme proposed in this article can further improve the accuracy of the super-resolution model in downstream recognition tasks, while showing strong robustness to complex samples.
1. Research background
Scene text recognition is an important computer vision task and has wide applications in fields such as autonomous driving and document recognition. Despite impressive progress, current scene text recognition methods still struggle with low-resolution images. Therefore, customizing super-resolution networks for scene text images has become a popular research topic.
To this end, many scene text image super-resolution methods have been proposed in recent years and have achieved promising results. For example, Chen et al. [1] proposed a position-aware loss function to consider the spatial distribution of characters. By applying character probability distributions, TPGSR [2] demonstrated the importance of using language knowledge as guidance in STISR tasks. In order to deal with spatially irregular text, Ma et al. [3] proposed the TATT model. In addition, C3-STISR[4] further improves the performance of the model by using clues from three perspectives.
Although existing methods have made a lot of efforts, most of the current methods treat the character area and background equally in model design, while ignoring the adverse effects of complex backgrounds. Intuitively, non-character backgrounds are often uninformative for downstream recognition tasks, and thus there is no need to reconstruct the texture details of the background. In addition, complex backgrounds can cause interference to the reconstruction process. On the one hand, the background may be mistaken for characters, resulting in erroneous reconstructions (see Figure 1(a)). On the other hand, the background may affect the model's accurate positioning of characters, resulting in poor reconstruction results (see Figure 1 (b)). Therefore, existing methods often suffer from performance degradation when faced with complex backgrounds, thus limiting practical applications.
Figure 1 Complex background brings challenges to scene text image super-resolution. (a) The "R" in "SUPER" may be incorrectly reconstructed as "B" or "S". (b) Complex background will lead to inaccurate character positioning and poor reconstruction results.
2. Brief description of method
Figure 2 Detailed block diagram
As shown in Figure 2, the model consists of a super-resolution reconstruction branch and a priori generation branch. The super-resolution reconstruction branch is responsible for the reconstruction of high-definition text images, and the prior generation branch includes a position enhancement module and a multi-modal alignment module to provide a priori guidance for the super-resolution backbone by mining character position features and semantic features.
The position enhancement module performs character position enhancement on the attention map sequence generated by the text recognizer . Let L be the length of the character sequence, indicating the attention map corresponding to the j-th character. Here, meaningful attention maps are selected to generate , then they are spliced along the channel dimension and the Max operator is used to reduce the channel dimension to 1. The result of this process is recorded as :
Then C convolution kernels are used to extract different feature patterns and finally the Softmax function is used for normalization to obtain the final result :
Use the instance-normalized image features and multiply them to obtain position-enhanced features:
where is the original image feature, represents the Hadamard product, and IN is instance normalization.
Since it contains pixel-level character confidence scores, its top K large elements are selected to obtain the foreground coordinate set:
However, using it directly to index image features will lose out-of- center neighborhood pixel features, making the model susceptible to attention drift. In order to fully consider neighborhood information, neighborhood feature weighting is used here to characterize neighborhood information:
where represents a set of eight neighborhood pixels and represents the weight corresponding to each position in the neighborhood.
The multi-modal alignment module is first generated by extracting semantic information from text distribution through projection operators and self-attention blocks . To make it easier to align visual modalities with language modalities, this project proposes a bidirectional alignment strategy to facilitate cross-modal alignment. This strategy mainly consists of two levels of progressive alignment. First, image-to-text alignment uses Query as the attention mechanism, using image features as both Key and Value. This design allows each character to find its corresponding image area:
Represents when n=1 , otherwise it represents the output of the previous block.
The second-level text-to-image alignment uses image features as Query, and the first-level alignment result as Key and Value. This allows each element in the image modality to find the text features that should be focused on.
Finally, the adaptive fusion module is used to introduce different degrees of prior guidance into different super-resolution modules. Specifically, given image features , which can be shallow features or the output of the previous hyper-block, and guidance generated by the prior generation branch, the adaptive fusion module first splices the sum guidance along the channel dimension . This is followed by three parallel 1×1 convolutions to project into three different feature spaces, and represent them respectively as . Then the channel attention mechanism is performed on the feature map , and the obtained attention scores are multiplied with to generate features after channel attention, which will be used and added to obtain the final fusion result:
Where GDWConv represents global depth separable convolution.
By designing a text super-resolution model with text character position prior enhancement, text-specific fine-grained priors are added to the super-resolution backbone, and character areas and non-character backgrounds are treated differently during the model inference process. This improves the performance of the model in super-resolution reconstruction under complex backgrounds.
3. Experimental results
Table 1 shows the quantitative comparison between the method proposed in this article and other methods on downstream text recognition tasks. It can be seen that because this article pays more attention to the character area, the generated high-resolution image can be more easily recognized by the downstream recognizer. .
Table 1 Quantitative comparison on downstream recognition tasks
Figure 3 also shows the qualitative comparative experimental results.
Table 3 Qualitative comparison results
This is followed by ablation experiments on the modules proposed in this article. The first is to verify the effectiveness of explicit position enhancement, that is, the proposed position enhancement module. It can be seen that using character position attention and feature selection technology to discard background features not only does not lead to performance degradation, but also improves performance by reducing background interference.
Table 2 Effectiveness of location enhancement module
This is followed by a verification of the effectiveness of the proposed multi-module bidirectional alignment module. Experimental results show that bidirectional feature alignment has better effects than misalignment and unidirectional alignment.
Table 2 Effectiveness of multimodal alignment module
At the same time, the author also provides ablation experiments between the three proposed modules, as shown in Table 3.
Table 3 Effectiveness of different module combinations
Finally, the author gives the comparison results of the accuracy of downstream recognition tasks of different models when facing scene text images of different character lengths, as shown in Figure 4. Due to the explicit position enhancement strategy, the model can enhance character region features when faced with long character text, thereby alleviating the long-term forgetting problem associated with long character text.
Figure 4 Comparison of images with different character lengths
4. Summary
This paper proposes the position-enhanced multi-modal network LEMMA to address the challenges faced by existing STISR methods. This method focuses more on character regions through explicit positional enhancement. The position enhancement module uses character position attention and feature selection techniques to extract character region features from all pixels. The multimodal alignment module adopts a bidirectional progressive strategy to facilitate cross-modal alignment. The adaptive fusion module adaptively integrates the generated high-level guidance into different reconstruction blocks. Through the performance on the TextZoom data set and four other challenging scene text recognition benchmark data sets, it is proved that this method can further improve the accuracy of downstream text recognition tasks and provide a basis for robust scene text image super-resolution technology. Development has taken an important step.
5. Related resources
Paper download address:
https://arxiv.org/pdf/2307.09749.pdf
Code address:
https://github.com/csguoh/LEMMA
references
[1] Jingye Chen, Haiyang Yu, Jianqi Ma, Bin Li, and Xiangyang Xue. Text gestalt: Stroke-aware scene text image super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 285–293, 2022.
[2] G Ma J, Guo S, Zhang L. Text prior guided scene text image super-resolution. IEEE Transactions on Image Processing, 32: 1341-1353, 2023.
[3] Jianqi Ma, Zhetong Liang, and Lei Zhang. A text attention network for spatial deformation robust scene text image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5911–5920, 2022.
[4] Minyi Zhao, Miao Wang, Fan Bai, Bingjia Li, Jie Wang, and Shuigeng Zhou. C3-stisr: Scene text image super-resolution with triple clues. international joint conference on artificial intelligence, 2022.
Original author: Hang Guo, Tao Dai, Guanghao Meng, and Shu-Tao Xia
Written by: Guo Hang Arranged by: Gao Xue
Reviewer: Lian Zhouhui Publisher: Jin Lianwen
Click to enter -> [OCR and Transformer] communication group
ICCV/CVPR 2023 paper and code download
Backstage reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers
后台回复:ICCV2023,即可下载ICCV 2023论文和代码开源的论文合集
OCR和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-OCR或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer、NeRF等。
一定要备注:研究方向+地点+学校/公司+昵称(如OCR或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群
▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!
▲扫码进星球
▲点击上方卡片,关注CVer公众号
整理不易,请点赞和在看