IJCAI 2023 | Tsinghua proposed: Robust scene text image super-resolution network with explicit position enhancement

Click on the card below to follow the " CVer " public account

AI/CV heavy-duty information, delivered as soon as possible

Click to enter -> [OCR and Transformer] communication group

Reprinted from: CSIG Document Image Analysis and Recognition Committee

470b841e885614b58acc7d5a5a3f2495.png

This article briefly introduces the paper "Towards Robust Scene Text Image Super-resolution via Explicit Location Enhancement" accepted by IJCAI 2023. This article notices that the background area in the scene text image is not very useful for downstream text recognition tasks. At the same time, the existence of complex background will also interfere with the reconstruction results of the super-resolution model. Based on this observation, this paper proposes to explicitly model character positions in text images to give more attention to character regions during super-resolution reconstruction. Experiments show that using the explicit position modeling scheme proposed in this article can further improve the accuracy of the super-resolution model in downstream recognition tasks, while showing strong robustness to complex samples.

1. Research background

Scene text recognition is an important computer vision task and has wide applications in fields such as autonomous driving and document recognition. Despite impressive progress, current scene text recognition methods still struggle with low-resolution images. Therefore, customizing super-resolution networks for scene text images has become a popular research topic.

To this end, many scene text image super-resolution methods have been proposed in recent years and have achieved promising results. For example, Chen et al. [1] proposed a position-aware loss function to consider the spatial distribution of characters. By applying character probability distributions, TPGSR [2] demonstrated the importance of using language knowledge as guidance in STISR tasks. In order to deal with spatially irregular text, Ma et al. [3] proposed the TATT model. In addition, C3-STISR[4] further improves the performance of the model by using clues from three perspectives.

Although existing methods have made a lot of efforts, most of the current methods treat the character area and background equally in model design, while ignoring the adverse effects of complex backgrounds. Intuitively, non-character backgrounds are often uninformative for downstream recognition tasks, and thus there is no need to reconstruct the texture details of the background. In addition, complex backgrounds can cause interference to the reconstruction process. On the one hand, the background may be mistaken for characters, resulting in erroneous reconstructions (see Figure 1(a)). On the other hand, the background may affect the model's accurate positioning of characters, resulting in poor reconstruction results (see Figure 1 (b)). Therefore, existing methods often suffer from performance degradation when faced with complex backgrounds, thus limiting practical applications.

419b0143d269c45344befd47739b52d6.png

Figure 1 Complex background brings challenges to scene text image super-resolution. (a) The "R" in "SUPER" may be incorrectly reconstructed as "B" or "S". (b) Complex background will lead to inaccurate character positioning and poor reconstruction results.

2. Brief description of method

d4b5c32a0c94a7ac0095ee19c4e384c8.pngFigure 2 Detailed block diagram

As shown in Figure 2, the model consists of a super-resolution reconstruction branch and a priori generation branch. The super-resolution reconstruction branch is responsible for the reconstruction of high-definition text images, and the prior generation branch includes a position enhancement module and a multi-modal alignment module to provide a priori guidance for the super-resolution backbone by mining character position features and semantic features.

79af2598ee01e72d183ef1e7d94067e3.pngThe position enhancement module performs character position enhancement on the attention map sequence generated by the text recognizer . Let L be the length of the character sequence, cad9afa871814b4cf6f22efc46c3ade7.pngindicating the attention map corresponding to the j-th character. Here, meaningful attention maps are selected to generate 546b99372305298b4bf105512ee52263.png, then they are spliced ​​along the channel dimension and the Max operator is used to reduce the channel dimension to 1. The result of this process is recorded as 0613b5025b29e1b0216541d6b57b6cbe.png:

68f43363c61a34472b650cd2957747c5.png

Then C convolution kernels are used to extract different feature patterns and finally the Softmax function is used for normalization to obtain the final result b7d570d834738a246b1aad06e52bed77.png:

77d47ab295963a9b0ed2a716ef762a7b.png

Use the instance-normalized image features and multiply them to obtain position-enhanced features:

1df31446771687c49db1207e087f2ddd.png

where 55a25f377fd84eca94b05fff017dff09.pngis the original image feature, 935c5a5c30f927b70d350dbc932ab08f.pngrepresents the Hadamard product, and IN is instance normalization.

Since 1e5f4d87e6cc454e0c395fe8baabce65.pngit contains pixel-level character confidence scores, its top K large elements are selected to obtain the foreground coordinate set:

209841769f092175bee623c0aff97a55.png

However, using it directly 99a4be2a2733fb10b710cf1e264a8f1c.pngto index image features will lose out-of- 3c03ce5daf8166a391e0cfc28d114d8f.pngcenter neighborhood pixel features, making the model susceptible to attention drift. In order to fully consider neighborhood information, neighborhood feature weighting is used here to characterize neighborhood information:

3783f4a4e81219e3374345ec028f67eb.png

where 884ddd4410b09a0de103b8c2f078b480.pngrepresents a set of eight neighborhood pixels and ac33b9c046a6431b6d07c35a03bd1f8a.pngrepresents the weight corresponding to each position in the neighborhood.

The multi-modal alignment module is first generated by extracting semantic information from text distribution through projection operators and self-attention blocks b06d1c59653066498114f9ea6c80e15b.png. To make it easier to align visual modalities with language modalities, this project proposes a bidirectional alignment strategy to facilitate cross-modal alignment. This strategy mainly consists of two levels of progressive alignment. First, image-to-text alignment uses 9cc706a0c6d765f0e86b5d1c69a629c1.pngQuery as the attention mechanism, using image features as both Key and Value. This design allows each character to find its corresponding image area:

2bb6761b3b59936ba024dd703379fb59.png

024591c97b4e8ef7fe99dc577ac27f40.pngRepresents when n=1 07a3f403e81953a2cc01b7455ad9f075.png, otherwise it represents the output of the previous block.

The second-level text-to-image alignment uses image features as Query, and the first-level alignment result as Key and 890df2d99396f74a68f859c320dc42dd.pngValue. This allows each element in the image modality to find the text features that should be focused on.

Finally, the adaptive fusion module is used to introduce different degrees of prior guidance into different super-resolution modules. Specifically, given image features 5fd4ad0c2dfbd85b8e83fc533c54c2d0.png, which can be shallow features or the output of the previous hyper-block, and guidance generated by the prior generation branch, the adaptive fusion module first splices the sum guidance along the channel dimension d8ade38ba4ee5ee54a6964ea11e54d4c.png. This is followed by three parallel 1×1 convolutions to c61d20670de21ba8010f0806960fd81a.pngproject into three different feature spaces, and represent them respectively as 05250e40b60c0c1471835b9a91b4d8e2.png. Then the channel attention mechanism is performed on the feature map 677d0c318b833294ce4fc2f7686d7930.png, and the obtained attention scores are 7b4455099647d2f5d79eed34afea5a42.pngmultiplied with to generate features after channel attention, which will be used and 3d9d61773aa3b605bfeb61d4f270c443.pngadded to obtain the final fusion result:

50e356a0d3d4c80dabfd6349bc0a55a8.png

Where GDWConv represents global depth separable convolution.

By designing a text super-resolution model with text character position prior enhancement, text-specific fine-grained priors are added to the super-resolution backbone, and character areas and non-character backgrounds are treated differently during the model inference process. This improves the performance of the model in super-resolution reconstruction under complex backgrounds.

3. Experimental results

Table 1 shows the quantitative comparison between the method proposed in this article and other methods on downstream text recognition tasks. It can be seen that because this article pays more attention to the character area, the generated high-resolution image can be more easily recognized by the downstream recognizer. .

Table 1 Quantitative comparison on downstream recognition tasks

84e21d0aad614f8d7cf347af33a5a34a.png

Figure 3 also shows the qualitative comparative experimental results.

Table 3 Qualitative comparison results

cf153aebca3964f229e1fed790965b1f.png

This is followed by ablation experiments on the modules proposed in this article. The first is to verify the effectiveness of explicit position enhancement, that is, the proposed position enhancement module. It can be seen that using character position attention and feature selection technology to discard background features not only does not lead to performance degradation, but also improves performance by reducing background interference.

Table 2 Effectiveness of location enhancement module

ffe55863e9512b084f37b6adb02cb1ca.png

This is followed by a verification of the effectiveness of the proposed multi-module bidirectional alignment module. Experimental results show that bidirectional feature alignment has better effects than misalignment and unidirectional alignment.

Table 2 Effectiveness of multimodal alignment module

66b02402ba366d83e4e05c319ffafbf1.png

At the same time, the author also provides ablation experiments between the three proposed modules, as shown in Table 3.

Table 3 Effectiveness of different module combinations

0688a4388dda3df3027b86910cb039e7.png

Finally, the author gives the comparison results of the accuracy of downstream recognition tasks of different models when facing scene text images of different character lengths, as shown in Figure 4. Due to the explicit position enhancement strategy, the model can enhance character region features when faced with long character text, thereby alleviating the long-term forgetting problem associated with long character text.

a2a79a097b1ec698c48bb9bd72c3e940.png

Figure 4 Comparison of images with different character lengths

4. Summary

This paper proposes the position-enhanced multi-modal network LEMMA to address the challenges faced by existing STISR methods. This method focuses more on character regions through explicit positional enhancement. The position enhancement module uses character position attention and feature selection techniques to extract character region features from all pixels. The multimodal alignment module adopts a bidirectional progressive strategy to facilitate cross-modal alignment. The adaptive fusion module adaptively integrates the generated high-level guidance into different reconstruction blocks. Through the performance on the TextZoom data set and four other challenging scene text recognition benchmark data sets, it is proved that this method can further improve the accuracy of downstream text recognition tasks and provide a basis for robust scene text image super-resolution technology. Development has taken an important step.

5. Related resources

Paper download address:

https://arxiv.org/pdf/2307.09749.pdf

Code address:

https://github.com/csguoh/LEMMA

references

[1] Jingye Chen, Haiyang Yu, Jianqi Ma, Bin Li, and Xiangyang Xue. Text gestalt: Stroke-aware scene text image super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 285–293, 2022.

[2] G Ma J, Guo S, Zhang L. Text prior guided scene text image super-resolution. IEEE Transactions on Image Processing, 32: 1341-1353, 2023.

[3] Jianqi Ma, Zhetong Liang, and Lei Zhang. A text attention network for spatial deformation robust scene text image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5911–5920, 2022.

[4] Minyi Zhao, Miao Wang, Fan Bai, Bingjia Li, Jie Wang, and Shuigeng Zhou. C3-stisr: Scene text image super-resolution with triple clues. international joint conference on artificial intelligence, 2022.


Original author: Hang Guo, Tao Dai, Guanghao Meng, and Shu-Tao Xia

Written by: Guo Hang Arranged by: Gao Xue

Reviewer: Lian Zhouhui Publisher: Jin Lianwen 

Click to enter -> [OCR and Transformer] communication group

ICCV/CVPR 2023 paper and code download

 
  

Backstage reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

后台回复:ICCV2023,即可下载ICCV 2023论文和代码开源的论文合集
OCR和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-OCR或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer、NeRF等。
一定要备注:研究方向+地点+学校/公司+昵称(如OCR或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号
整理不易,请点赞和在看

Guess you like

Origin blog.csdn.net/amusi1994/article/details/132930293