Top issue IJCV 2023! Beijing Institute of Technology & Princeton proposed: Dark light instance segmentation

Click on the card below to follow the " CVer " public account

AI/CV heavy-duty information, delivered as soon as possible

Click to enter -> [Image Segmentation and Transformer] communication group

Author: Uno Whoiam (Source: Zhihu, authorized) | Editor: CVer public account

https://zhuanlan.zhihu.com/p/656570195

Reply in the background of CVer WeChat public account: Dark light instance segmentation, you can download the pdf, code and data set of this paper

b98126e073843a85d87c00527abe1563.png

Instance Segmentation in the Dark

Unit: Beijing Institute of Technology & Princeton

Author: Linwei Chen · Ying Fu · Kaixuan Wei · Dezhi Zheng · Felix Heide

Paper: arxiv.org/abs/2304.14298

https://link.springer.com/article/10.1007/s11263-023-01808-8

Code: https://github.com/Linwei-Chen/LIS

TL;DR Too long to read

This paper opens up a new research direction on instance segmentation in low-light conditions. It is the first paper that systematically establishes a training and testing verification framework for instance segmentation under low-light conditions (emphasis added, it can be a new direction for volume!).

This paper collects and produces a Low-light Instance Segmentation (LIS) data set, which includes four sets of data of low light, normal exposure, paired JPEG and RAW, and provides 8 types of instance pixel annotations, which can be used for instance segmentation, Target detection task. (New dataset!)

This paper observes that RAW images have better potential than JPEG images to achieve higher instance segmentation accuracy, and the author further analyzes that this is related to the fact that RAW can provide more bit-depth information (RAW is all you need!).

This paper observes that under dark light conditions, image noise will cause high-frequency disturbances to the features in deep neural networks, which is an important reason why existing instance segmentation methods perform poorly under dark light conditions (Noise is the key !).

How's the effect? Using the method framework of this paper, based on the results of Mask R-CNN-ResNet50, compared with the large model Segment Anything trained on a large amount of data, the method proposed in this paper still performs well.

Summary

Existing instance segmentation techniques mainly target high-quality image input under normal lighting, but their performance drops significantly in extremely low-light environments. In this work, we delve into instance segmentation in low-light conditions and introduce several techniques that significantly improve low-light inference accuracy. The proposed method is based on the observation that noise in low-light images introduces high-frequency interference to the feature maps of neural networks, thereby significantly degrading performance. To suppress this "feature noise", we propose a novel learning method that relies on adaptive weighted downsampling layers, smooth directional convolution blocks, and interference suppression learning. These components effectively reduce feature noise during downsampling and convolution operations, enabling the model to learn perturbation-resistant features. Additionally, we found that high-bit-depth RAW images better retain richer scene information in low-light conditions than typical camera sRGB output, thus supporting the use of RAW input algorithms. Our analysis shows that high bit depth is crucial for low-light instance segmentation. To alleviate the scarcity of annotated RAW datasets, we utilize a low-light RAW compositing pipeline to generate realistic low-light data. Furthermore, to facilitate further research in this direction, we capture a real-world low-light instance segmentation dataset consisting of more than two thousand paired low-light/normal-light images with instance-level pixel-level annotations. Notably, without any image pre-processing, we achieve satisfactory instance segmentation performance (4% higher AP than state-of-the-art competitors) in very low light conditions while providing future Research opens up new opportunities. Our code and datasets are publicly available to the community (https://github.com/Linwei-Chen/LIS).

1. Observation and motivation

3e82a1ff3e721c00b42c4b925515ac36.png 2d246c0f4afead37e9a33a017d23692b.png

Two key observations:

a. Feature map degradation under low light. For clear normal light images, the instance segmentation network is able to clearly capture both low-level (e.g., edges) and high-level (i.e., semantic responses) features of objects in both shallow and deep layers. However, in noisy low-light images, shallow features may be contaminated and full of noise, while deep features are less semantically responsive to objects.

b. Comparison between camera’s sRGB output and RAW image in darkness. Due to the significantly reduced signal-to-noise ratio, the 8-bit camera output loses a lot of scene information, for example, the seat back structure is almost unrecognizable in the camera output, while it is still identifiable in the RAW image (zoomed in for better detail).

Illustration of our key observations under dark regimes that drive our method design: a Degraded feature maps under low light. For clean normal-light images, the instance segmentation network is able to clearly capture the low-level ( e.g., edges) and high-level ( i.e., semantic responses) features of objects in shallow and deep layers, respectively. However, for noisy low-light images, shallow features can be corrupted and full of noise, and the deep features show lower semantic responses to objects. b Comparison between camera sRGB output and RAW image in the dark. Due to significantly low SNR, the 8-bit camera output loses much of the scene information, for example, the seat backrest structure is barely discernible, whereas is still recognizable in the RAW counter- part (Zoom in for better details)

2. Challenges and methods

The overall approach is as follows

c38dfd6c66dae01968b548a0eaefb35b.png

2.1 Low-Light RAW Synthetic Pipeline

Challenge: Training a segmentation model requires massive data with instance segmentation annotations, and currently there is no such data set. It is expensive to collect additional dark-light images and label large-scale dark-light data sets. At the same time, existing RAW data is also very scarce.

Solution: Design a pipeline from RGB images to noisy dark-light RAW images, so that the existing instance segmentation data set can be used to train an instance segmentation model for RAW input under dark-light conditions at zero cost.

Our low-light RAW compositing pipeline consists of two steps, raw and noise injection:

Reverse processing. Collecting large-scale RAW image datasets is expensive and time-consuming, so we considered leveraging existing sRGB image datasets (Everingham et al., 2010; Lin et al., 2014). sRGB images are obtained through a series of image transformation operations of the camera's internal image signal processing (ISP), such as tone mapping, gamma correction, color correction, white balance, and demosaicing. Through unprocessed operations (Brooks et al. 2019), we can invert these image processing transformations and thus obtain RAW images. In this way, we can create a RAW dataset at zero cost.

Noise injection. After obtaining a clean RAW image through unprocessed operations, in order to simulate a real noisy low-light image, we need to inject noise into the RAW image. To produce more accurate complex noise results, we adopt a recently proposed physics-based noise model (Wei et al., 2020, 2021) instead of the widely used Poissonian-Gaussian noise model (i.e., the heteroscedastic Gaussian model (Foi et al., 2008)). The model can accurately describe the real noise structure, taking into account many noise sources, including photon shooting noise, read noise, strip pattern noise and quantization noise.

Our low-light RAW synthetic pipeline consists of two steps, i.e., unproccessing and noise injection. We introduce them one by one.
Unprocessing. Collecting a large-scale RAW image dataset is expensive and time-consuming, hence we consider utilizing existing sRGB image datasets (Everingham et al., 2010; Lin et al., 2014). The sRGB image is obtained from RAW images by a series of image transformations of on-camera image signal processing (ISP),  e.g., tone mapping, gamma correction, color correction, white balance, and demosaicking. With the help of the unprocessing operation (Brooks et al., 2019), we can invert these image processing transforma- tions, and RAW images can be obtained. In this way, we can create a RAW dataset with zero cost. 
Noise injection. After obtaining clean RAW images by unprocessing, to simulate real noisy low-light images, we need to inject noise into RAW images. To yield more accurate results for real complex noise, we employ a recently proposed physics-based noise model (Wei et al., 2020, 2021), instead of the widely used Poissonian-Gaussian noise model (i.e., heteroscedastic Gaussian model (Foi et al., 2008)). It can accurately characterize the real noise structures bytakingg into account many noise sources, including photon shot noise, read noise, banding pattern noise, and quantization noise.

2.2 Adaptive Weighted Downsampling Layer

Challenge: How to solve the characteristic "noise disturbance" caused by image noise

aa4e4c58c813ba4c6a237906070a9877.png

A simple observation is that downsampling the image using low-pass filtering reduces the noise level due to the smooth a priori nature of the image.

Considering that existing networks often have multiple downsampling layers, wouldn't it be possible to take full advantage of these downsampling processes? Experiments have shown that simply inserting a mean filter can achieve almost no-cost performance in dark-light instance segmentation.

d33ff90d661263e9682e5c04e9a62cb3.png

Although effective, fixed filters such as mean filter cannot be adaptively adjusted according to features, and thus may erase detailed information. In this regard, the author proposes Adaptive Weighted Downsampling Layer, AWD, which adaptively predicts low-pass filtering on features channel by channel and point by point. This increases the intensity of the low-pass in the noisy areas, while reducing the low-pass level in the detail areas to preserve details.

23d2959fed961737508af76ffe100d86.png

After looking at the source code, FC is replaced by Depth-wise, and the effect is equivalent. The formulas are not listed here. If you are interested, you can read the original text.

2.3 Smooth-Oriented Convolutional Block

In order to further reduce the disturbance of features by high-frequency noise in images, the author also proposed to use re-parameterization technology to simultaneously train a set of smooth convolution kernels during training and fuse them into the original convolution kernels during inference. In this way Convolution is more robust in the face of noise. It is worth noting that this does not increase the amount of parameters and calculations during inference! Pure prostitution!

53a0f490210bb0962a7c9243af5532b9.png

2.4 Disturbance Suppression Learning

At the same time, the author also made some adjustments during model learning, allowing the model to learn clean and noisy images at the same time, and constraining the features of the input to be noisy images to be closer to clean images, which is somewhat similar to knowledge distillation, but does not require a teacher. This not only improves the robustness of the model under dark light, but also improves images with normal lighting. This is very consistent with the actual application scenario, that is, a model can cope with both day and night applications.

48c4955558f264fadee50f9cb886fe85.png

3. LIS data set

d8e75d088d7a9bc279b9e0fe36035f71.png

The data set was captured by Canon EOS 5D Mark IV and has the following characteristics:

- Paired samples. In the LIS dataset, we provide images in sRGB-JPEG (typical camera output) and RAW formats, each format including paired short-exposure low-light and corresponding long-exposure normal-light images. We refer to these four types of images as sRGB-Dark, sRGB-Normal, RAW-Dark and RAW-Normal. To ensure they were aligned at the pixel level, we mounted the camera on a sturdy tripod and controlled it remotely via a mobile app to avoid vibration.

- Diverse scenarios. The LIS dataset consists of 2230 pairs of images collected in various scenes, including indoors and outdoors. To add variety to low-light conditions, we shot long-exposure reference images using a range of ISO levels (e.g. 800, 1600, 3200, 6400), and deliberately through a range of low-light factors (e.g. 10, 20, 30, 40, 50 , 100) Reduce exposure time and shoot short exposure images to simulate very low light conditions.

- Instance-level pixel-level labels. For each pair of images, we provide accurate instance-level pixel-level labels annotating instances of the 8 most common object categories in our daily lives (bicycles, cars, motorcycles, buses, bottles, chairs, dining tables, televisions) . We note that LIS contains images taken in different scenes (indoor and outdoor) and under different lighting conditions. In Figure 7, object occlusion and densely distributed objects make LIS more challenging outside of low-light conditions.

– Paired samples. In the LIS dataset, we provide images in both sRGB-JPEG (typical camera output) and RAW formats, each format consists of paired short-exposure low-light and corresponding long-exposure normal-light images. We term these four types of images as  sRGB- dark, sRGB-normal, RAW-dark, and RAW-normal. To ensure they are pixel-wise aligned, we mount the camera on a sturdy tripod and avoid vibrations by remote control via a mobile app. 
– Diverse scenes. The LIS dataset consists of 2230 image pairs, which are collected in various scenes, including indoor and outdoor. To increase the diversity of low-light conditions, we use a series of ISO levels ( e.g., 800, 1600, 3200, 6400) to take long-exposure reference images, and we deliberately decrease the exposure time by a series of low-light factors ( e.g., 10, 20, 30, 40, 50, 100) to take short-exposure images for simulating very low-light conditions. 
– Instance-level pixel-wise labels. For each pair of images, we provide precise instance-level pixel-wise labels annotate instances of 8 most common object classes in our daily life (bicycle, car, motorcycle, bus, bottle, chair, dining table, tv). We note that LIS contains images captured in different scenes (indoor and outdoor), and different illuminationconditionss. In Fig.7, object occlusion and densely distributed objects make LIS more challenging besides the low light.

4. Experimental results

3.1 Ablation

41c4d57567a41907a1f3b559f77da737.png 21ba87eeddd8ceb4854d2aa52828e403.png

3.2 Main results

It provides the use of Mask R-CNN, PointRend, Mask2Former, Faster R-CNN methods and the backbone network ResNet-50, Swin-Transformer, and ConvNeXt to prove the effectiveness on two tasks: instance segmentation and target detection:

ad7eec754d5f6807fe4009814f4cfbbb.png 8cc76111a3fb30cb48e13dd5e5aa4988.png 87fe7a77c88ba7bc4946ebcdc19f04d9.png 647ba47851b314d7c46e506b72a89caf.png e91a7a49814f75ad58f87a0adbee9b7c.png

3.3 Visualization results

c1748f861da232e18f8745675be700ca.png

5. Inspiration

Detection and segmentation is a hot mainstream research direction, and papers are emerging one after another. If you want to explore in the red ocean, the best way is to create a blue ocean yourself. This paper finds a new application scenario for instance segmentation, builds a framework for instance segmentation training and verification under dark light conditions, and obtains a series of simple but effective methods from the observation of key phenomena. It can be seen that observation is the basis of scientific research. Good observation can allow people to discover interesting phenomena and problems, thereby leading to new problems and opening up new paths.

Reply in the background of CVer WeChat public account: Dark light instance segmentation, you can download the pdf, code and data set of this paper

Click to enter -> [Image Segmentation and Transformer] communication group

ICCV/CVPR 2023 paper and code download

 
  

Backstage reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

后台回复:ICCV2023,即可下载ICCV 2023论文和代码开源的论文合集
图像分割和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-图像分割或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer、NeRF等。
一定要备注:研究方向+地点+学校/公司+昵称(如图像分割或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
 
  
▲点击上方卡片,关注CVer公众号
It’s not easy to organize, please like and watch4d94bb2c50baf3ccf36ebdb347b697a6.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/133004419