RestoreDet: Object detection in low-resolution images

Author: Institute of Computer Vision

Editor: 3D Vision Developer Community

foreword

When the true degradation is unknown or different from the hypothesis, both preprocessing modules and subsequent high-level tasks such as object detection fail. Here, we propose a new framework, RestoreDet, to detect objects in degraded low-resolution images. RestoreDet exploits downsampling degradation as a transformation of self-supervised signals to explore equivariant representations for various resolutions and other degradation conditions.

Specifically, this intrinsic visual structure is learned by encoding and decoding the degenerated transformation of a pair of original and randomly degraded images. This framework can further utilize an advanced SR architecture with an arbitrary resolution restoration decoder to reconstruct the original correspondences from degraded input images. Both representation learning and object detection are jointly optimized in an end-to-end training fashion. RestoreDet is a general framework that can be implemented on any mainstream object detection architecture. Extensive experiments demonstrate that the CenterNet-based framework achieves superior performance compared to existing methods in the face of metamorphic degradation scenarios. Code will be posted soon.

background

High-level vision tasks (i.e., image classification, object detection, and semantic segmentation) have achieved great success due to large-scale datasets. The images in these datasets are mainly captured by commercial cameras with higher resolution and signal-to-noise ratio (SNR). After training and optimizing on these high-quality images, Advanced Vision's performance on low-resolution or low-quality images degrades. To improve the performance of vision algorithms on degraded low-resolution images, Dai et al. [Is image super-resolution helpful for other vision tasks?] present the first comprehensive study advocating the use of super-resolution (SR) algorithms for Images are preprocessed. Other high-level tasks, such as face recognition, face detection, image classification, and semantic segmentation, also benefit from the restoration module to extract more discriminative features.

New Framework Analysis

Instead of explicitly enhancing the input image with a restoration module under strict assumptions, we leverage an intrinsic equivariant representation for various resolutions and degradation states. Based on the encoded representation shown above, the researchers propose RestoreDet, an end-to-end model for object detection in degraded LR images. To capture complex patterns of visual structures, groups of downsampled degenerate transformations are utilized as self-supervised signals. During training, a degraded LR image t(x) is generated from the original HR image x by a random degraded transformation t. As shown in the figure above, the pair of images is fed into the encoder E to obtain their latent features E(x) and E(t(x)).

To train the encoder E to learn degenerate equivariant representations, the researchers first introduce a transform decoder Dt to represent the degenerate transform t applied by E(x) and E(t(x)) decoding. If transformations can be reconstructed, it means that as much as possible the dynamics of how they change under different transformations should be captured.

To further take advantage of the rapidly growing SR research, we introduce the Arbitrary Resolution Restoration Decoder (ARRD) Dr. ARRD reconstructs raw HR data x from representations E(t(x)) of various degraded LR images t(x). ARRD Dr will supervise the encoder E to encode the detailed image structure that will help the subsequent tasks. Based on the encoded representation E(t(x)), the object detection decoder Do then performs detection to obtain the location and category of the object. During inference, the target image is directly detected by the encoder E and the target detection decoder Do in the above figure. Our inference pipeline is more computationally efficient compared to methods based on preprocessing modules.

To cover various degradations in real scenes, the degraded t(x) is generated by transforming t by random sampling according to the actual downsampled degradation model. As shown in the figure above, the transformation t is characterized by the downsampling rate s, the degradation kernel k and the noise level n in the following equation.

The above picture (a) is the CenterNet of the anchor free framework. Figure (b) illustrates how to implement RestoreDet based on CenterNet. The detailed training process is given in Algo.1. When training RestoreDet, the original HR image x and the transformed degraded LR image t(x) are sent to the encoder E to encode the degraded equivariant representation. Here, CenterNet's encoder E is directly used, but copied into a shared-weight Siamese structure that receives HR and LR images separately.

Something.1

Experimentation and Visualization

Performance comparison on MS COCO and KITTI datasets

(a)/(b) is CenterNet trained on normal images and tested on normal/degraded down4 testset, (c)/(d)/(e) is CenterNet tested on the degraded image restored by individual SR algorithm RRDB/RealSR/BSRGan. (f) is the detection result of our RestoreDet and we use the output of ARRD Dr as background images.

Copyright statement: This article is only for academic sharing, and the copyright belongs to the original author. If any infringing content is involved, please contact to delete the article.

The 3D vision developer community is a sharing and communication platform created by Obi Zhongguang for all developers, aiming to open 3D vision technology to developers. The platform provides developers with free courses in the field of 3D vision, Obi Zhongguang exclusive resources and professional technical support.

Join [3D Vision Developer Community] to learn cutting-edge knowledge of the industry and empower developers to improve their skills! Join [3D Vision AI Open Platform] to experience AI algorithm capabilities and help developers implement visual algorithms!

Previous ·  Recommended

1. The 3rd 3D Vision Innovation Application Contest of Orbi Zhongguang & Nvidia ended successfully!
2. Hurry up! The final of the 3rd 3D Vision Innovation Application Competition 2023 is about to start!
3. DeepMIM: A deep supervision method introduced in MIM
​4. SPM: A plug-and-play shape prior module!

Guess you like

Origin blog.csdn.net/limingmin2020/article/details/130996554