Why is your phone's rear camera getting uglier? ECCV2022 this paper tells you

This article was first published on the WeChat public account CVHub, and may not be reproduced to other platforms in any form. It is only for learning and communication, and offenders will be held accountable!

Paper: https://arxiv.org/pdf/2210.02785.pdf

preamble

This article involves a variety of 3D sensing related work, including monocular, binocular, ToF and other sensors, involving computational photography, ToF depth measurement, binocular depth estimation, multi-view depth estimation, camera offline calibration, camera online calibration, NeRF and other technologies can be described as a master of 3D vision

background

Here is a brief introduction to the two 3D imaging technologies that will be mentioned in this article: binocular stereo vision, ToF

  • Binocular Stereo Vision : Mimics human vision using cameras of two objects with a certain distance between them. Images are acquired from these cameras and then used to perform visual feature extraction and matching to obtain disparity maps between camera views. Disparity information is inversely proportional to depth and can be easily used to obtain a depth map
  • Time of Flight (ToF) : As the name implies, it emits an infrared laser, records the time it takes for the emitted infrared light to hit the surface of the object and returns, that is, the time of light flight, and then calculates the distance from the camera to the surface of the object according to the speed of light

The follow-up official account will update the 3D vision series of articles, and introduce in detail the implementation of 3D vision-related technical solutions and related applications. Welcome to pay attention

guide

(a) Multimodal camera model. (b) RGB reference image. (c) ToF depth. (d) Depth of our method.

High-precision depth information is crucial for computational photography. Some smartphones have a time of flight sensor to measure depth, which is combined with a variety of RGB cameras to form a multi-modal camera system for mobile phones. This paper discusses how to solve the problem of RGB binocular depth and ToF depth fusion in multimodal camera systems . This fusion helps to overcome the limitations of consumer-grade ToF sensors such as low resolution and low power of the active laser light source.
This paper first proposes an online calibration method based on dense 2D/3D matching , which can estimate the internal parameters, external parameters and distortion parameters of the RGB sensor with optical image stabilization from a single snapshot, and then use the correlation volume to fuse the two Mesh depth and ToF depth information . During the fusion training process, a deep learning-based neural reconstruction method is used to obtain a deep construction training dataset of real scenes. During evaluation, a test dataset is obtained using a commercial high-power depth camera, and it is shown that our method achieves higher accuracy than existing methods.

background

ToF depth information brings significant improvement to mobile phone background blur

Advances in computational photography have brought many new application scenarios, such as 3D reconstruction, perspective synthesis, augmented reality, etc. For example, by back-projecting high-resolution camera RGB images into three-dimensional space to create virtual objects, it brings excellent background blur effects. Capturing high-precision pixel depth is critical to these algorithms, so smartphones now have camera systems with multiple sensors, including lenses with different focal lengths, ultra-wide-angle lenses, monochrome lenses, and time-of-flight (ToF) sensors, among others. Phase-based ToF sensors provide depth information by measuring the time-of-flight of infrared active light using a gated infrared sensor. At present, there are two challenges in obtaining high-precision pixel-by-pixel depth on mobile phones:

  • ToF sensors have orders of magnitude lower resolution than their companion RGB cameras . The resolution of RGB cameras on smartphones has increased significantly in recent years, usually between 12 and 64 megapixels, while ToF sensors are usually only 500,000 to 3 megapixels. It is natural to think that fusing depth information from ToF with binocular depth information from two RGB cameras is a good strategy to improve depth resolution. At the same time, fusion also helps us overcome the problem of low signal-to-noise ratio of the ToF signal due to the weak energy of the battery-powered laser light source. In order to fuse sensor information from two or two modalities, we need to know exactly the relative pose relationship of all sensors and lenses in the camera system. This introduces the second problem.

The image above demonstrates the impact of the OIS system on photography

  • With the improvement of RGB spatial resolution, smartphones are now generally equipped with optical image stabilization (OIS) function : when the internal gyroscope of the optical image stabilization module detects external vibrations, a stabilizer composed of a floating lens compensates for the vibration of the camera body. Vibration, which offsets the image shift caused by camera shake, ensuring that the camera can still maintain stable imaging in a shaken environment. During the process of the floating lens compensating for movement, the pose of the gimbal cannot currently be measured or read out electronically. Therefore, in a camera system with an optical image stabilization lens, it is necessary to calibrate the floating lens in each exposure online in order to use the fusion strategy to play a role

In this paper, a floating fusion algorithm is proposed to provide high-precision depth estimation from a camera system consisting of an optical image stabilization RGB lens, a wide-angle RGB camera, and a ToF sensor. Using ToF measurement and dense optical flow matching between RGB camera pairs, an online calibration method is designed to form a 2D/3D correspondence for each snapshot, in a scale-determined manner rather than scale-determined (up to scale) Restore camera intrinsic, extrinsic and lens distortion parameters . This makes it suitable for camera dynamic environments. Then, in order to fuse multimodal sensor information, we established a correlation volume that integrates ToF and binocular RGB information, and then predicts the disparity through the neural network to obtain depth information . But there are few large multimodal datasets to train this network, and synthetic data is expensive to create and has a gap with real data. We use multiple views to capture real-world scenes, and use ToF depth data to supervise and optimize Neural Radiation Field (Nerf). The resulting depth map has lower noise and better details than depth sensors, providing high-quality training data . In the verification phase, this paper uses Kinect Azure to construct a test data set, and shows that the method of this paper is better than the traditional RGB-D imaging method, and also better than other data-driven methods.

method

We use an off-the-shelf Samsung Galaxy S20+ smartphone, the main camera is a 12-megapixel wide-angle RGB lens with optical image stabilization (OIS), and the secondary camera includes a 12-megapixel 120° ultra-wide-angle RGB lens Lens and a 48-megapixel telephoto lens, in addition to a 3-megapixel ToF sensor and infrared emitter.

In the entire camera system, the positions of the ultra-wide-angle lens and the ToF module are fixed, so an offline calibration method based on a checkerboard can be used to calibrate their internal reference KUW K_{UW}KUW K T o F K_{ToF} KToF,外参 [ R ∣ t ] U W [R|t]_{UW} [Rt]UW [ R ∣ t ] T o F [R|t]_{ToF} [Rt]ToFand lens distortion parameters. We used a checkerboard calibration plate with similar absorbance in the visible spectrum for ToF and RGB. However, since the camera shakes between each snapshot, the floating main camera optical image stabilization (OIS) introduces lens movement in the x, y directions for stabilization and in the z direction for focus, and the z direction also Causes changes in lens distortion. The camera also tilts (pitch/yaw rotation) due to gravity. As a result, the internal parameters of the floating main camera will change every time a snapshot is taken, and the relative pose between the floating main camera and other cameras will also change. Therefore, we must estimate a new intrinsic parameter matrix KFM K_{FM} at each snapshotKFM, the new external parameter matrix [ R ∣ t ] FM [R|t]_{FM}[Rt]FMand three radial and two tangential distortion coefficients { k 1 , k 2 , k 3 , p 1 , p 2 } FM {\{k1,k2,k3,p1,p2\}}_{ FM}{ k 1 ,k 2 ,k 3 ,p 1 ,p 2 }FM. To address this challenge, the paper proposes a method for estimating these parameters under scale certainty (rather than scale uncertainty "up to scale").

ToF depth estimation

ToF dephasing

A ToF system modulates its infrared light source with 20MHz and 100MHz square waves, alternating between the two frequencies. Each frequency of the sensor passes the Four-Sampling-Bucket algorithm to obtain four sampling signals Q f , θ Q_{f,\theta} with phase delays of 0°, 90°, 180° and 270°Qf , i. Using two frequencies, we get eight raw ToF measurements per snapshot (a above):

For each modulation frequency, first calculate the phase difference between the transmitted signal and the received signal ϕ f \phi_fϕf(measured phase/raw phase), calculate the distance df d_f from each pixel to the camera by calculating the phase difference between the emitted light and the reflected lightdf

In the ToF system, the extracted phase is wrapped in a periodic phase, which is not the real phase, so it is necessary to expand the real phase to obtain real data, that is, to solve the kk of the above formulak . Meanwhile, the ToF depth error is inversely proportional to the modulation frequency. Therefore, ToF cameras with higher modulation frequencies have lower depth noise and smaller depth errors; however, the wrapping range is shorter, which means that the distance that can be accurately measured is shorter (100MHz≈1.5m, 20MHz≈7.5m). Since the infrared laser of the mobile ToF is low power, the signal to the object is further weakened, making the phase estimation highly unreliable at lower frequencies. From this we can have a coarse but unwrapped phase map (above b) and a finer but wrapped phase map (above c). Therefore we use the 20MHz phase to unwrap the 100MHz phase to estimate a more accurate depth map.

ToF depth confidence

From the estimated depth, we plot A_f A_f based on the magnitude of the signalAf d 20 M H z d_{20MHz} d20MHzand d 100 MHz z d_{100MHz}d100MHzThe consistency between and local depth changes to assign a confidence map ωω . First, we assign lower scores when the signal is weaker:

We also take into account the difference between the estimated distances to the two frequencies:

Since ToF is less reliable at depth discontinuities, we consider regions with larger depth gradients to be less reliable:

Ultimately, the full confidence score is:

where σ A = 20 \sigma_A = 20pA=20 σ d = 0.05 \sigma_d = 0.05 pd=0.05 , andσ ∇ = 20 \sigma_{∇} = 20p=20

Online Calibration

In order to calibrate the floating main camera, sufficient correspondence needs to be found between known 3D world points and 2D projections of these points. Even if the ToF and the main camera do not share a spectral response interval, a way must be found to map the 3D points of the ToF to the main camera. Our overall strategy is to exploit the known fixed relationship between the ToF and the ultra-wide-angle camera, and exploit the 2D color correspondence of the optical flow between the ultra-wide-angle and the floating main camera. In this way, we can map the 2D coordinates of the main camera to the 2D coordinates of the super wide-angle camera and then map to the 3D coordinates of the ToF camera. Although ToF may be noisy, it still provides enough points to reliably calibrate the intrinsic, extrinsic and lens distortion parameters of the main camera.

From ToF to Ultrawide

The first step is to set the ToF depth d T o F d_{ToF}dToFProject to ultra-wide-angle camera 1. Convert ToF depth to ToF point cloud:

  1. The ToF point cloud is transferred to the ultra-wide-angle camera coordinate system to obtain an ultra-wide-angle point cloud

  1. The ultra-wide-angle point cloud is projected onto the ultra-wide-angle image, and finally the corresponding relationship between the ToF depth point and the ultra-wide-angle RGB pixel coordinates (sub-pixel) is obtained

From Ultrawide to Floating Main

The next step is to precisely calibrate the ultra-wide-angle camera with the floating main camera. Since the two cameras are located close to each other, they have similar viewpoints, so there is no need for sparse scale- and rotation-invariant feature matching . Based on this, this paper uses dense optical flow to perform feature matching to find the corresponding relationship. In order to use optical flow, the super wide-angle camera is first de-distorted, and then the result of offline calibration is used for epipolar correction (Rectify). Since the extrinsic and extrinsic parameters of the floating main camera are always changing, this correction is wrong, but since the optical flow is designed for a small range of unconstrained image-to-image correspondences, this correction method will still find useful correspondences. As output we get the matching 2D vector field FUW → FM F_{UW→FM} for both camerasFUWFM

From ToF to Floating Main

After getting the corresponding relationship between ToF and super wide-angle, and the corresponding relationship between super wide-angle and floating main camera, you can now get the pixels in the floating camera corresponding to each ToF depth point:

After getting the 2D points of the floating main camera and the 3D points in the ultra-wide-angle camera space, the floating camera can be calibrated.

  1. Outliers were removed by solving with RANSAC. Due to the small baseline of a binocular system composed of a floating main camera and a wide-angle lens, RANSAC helps us avoid noise from occlusions, incorrect optical flow estimation, and ToF
  2. Solve the extrinsic parameter [ R , t ] FM − > UW [R,t]_{FM->UW} between two RGB cameras through the PnP algorithm based on LM (Levenberg-Marquardt) optimization[R,t]FM > U W, and the internal reference KFM of the main camera K_{FM}KFMand its distortion coefficient

In addition, although this work relies on a wide-angle RGB lens with a fixed relative position and a ToF module as a bridge, the floating RGB camera and ToF are calibrated. However, a series of works have proved that it is also feasible to directly calibrate floating RGB and ToF. Although the two spectral domains (RGB/IR) are different, effective feature matching can also be performed. So our method works even without a second wide-angle lens.

Binocular, ToF deep fusion

Deep fusion baseline

Based on the above work, the calibrated binocular RGB image and ToF depth image are obtained, and the next step is to fuse them to obtain an accurate high-resolution depth image.

Stereo Matching

Create a correlation volume C c C_c from the two RGB images of the floating main camera and the ultra-wide-angle cameraCc C c C_c CcCoordinate points in [ u , v , u ′ ] [u,v,u^′][u,v,u ][ u , v ] [u, v]in the reference image[u,v ] and pixel[ u ′ , v ] [u^′,v][u,v ] at a certain disparity. Therefore, the shape of the correlation volume is (width × height × width), since the disparity is distributed along the horizontal direction of the image. Specifically, we construct a 3D correlation volume by extracting 256-dimensional image features from each view using the feature encoder of RAFT-stereo, and then computing the dot product of two feature maps. It should be noted that the feature extraction process downsamples the image four times to recover the original resolution disparity map through convex upsampling of RAFT-stereo. For ease of evaluation, a fixed ultra-wide-angle camera is used here as a reference image.

Correlation Volume Refinement

Get the ToF 3D point PT o F P_{ToF} according to the previous methodPToF2D coordinates corresponding to the super wide-angle camera [ u UW , v UW ] [u_{UW},v_{UW}][uUW,vUW] and [ u FM , v FM ] [u_{FM},v_{FM}]in the floating main camera[uFM,vFM] , and at the same time get the confidenceω ωω . Utilize these ToF points by adding correlation to where these ToF points are located in the correlation volume. Specifically, for a point in the ToF depth, its position in the correlation volume is[ u UW , v UW , u FM ] [u_{UW},v_{UW},u_{FM}][uUW,vUW,uFM] . However, since the super-wide-angle pixel coordinates corresponding to the calculated ToF depth point are at the sub-pixel level, the correlation volumew position corresponding to the ToF depth point has eight integer neighbors. This article uses linear weights to inject ToF depth points into the correlation volume calculated by RAFT-stereo, and one of the neighbor points is:

⌊ u ⌋ ⌊u⌋ u means round down,⌊ u ⌋ ⌊u⌋u means round up. Do the same for all 8 integer fields. If several ToF points affect the same point in the correlation volume, their contributions accumulate. τ ττ is a scalar parameter optimized during network training. We then estimate the disparity using the updated correlation volume. Based on this method, binocular information and ToF measurements can be effectively and robustly fused to obtain more accurate depth

Dataset generation

Background

Learning-based binocular/ToF fusion methods require training data. To this end, we optimize a Neural Radiation Field (Nerf) using multi-view RGB images. These data allow querying the density and color of each point in the scene, allowing us to Render estimated depth. Specifically, it is the color item ci c_i in the following formula (Nerf's voxel rendering formula)ciReplaced by the depth of the point and then rendered

Design Choices and Our Approach

This article uses MipNeRF as the baseline for multi-view depth. This handles the difference in resolution between ToF and RGB cameras quite naturally. For depth supervision, this paper uses a simple method, similar to the idea that VideoNeRF uses the inverse depth map of multi-view 3D reconstruction for supervision. We added a loss term using depth samples in the optimization:

Ohω is the ToF confidence level above. In addition to this supervision, we also use the 6D three-dimensional rotation representation method to replace common three-dimensional rotation representations such as rotation matrix, Euler angle, quaternion, etc., and realize the refinement of external parameters and internal parameters. Our training and validation datasets have 8 scenes, and each scene has about 100 snapshots. These scenes have different depth ranges, backgrounds, objects and materials.

experiment

evaluation data

To evaluate our method, we built a real-world dataset using Kinect Azure and obtained ground truth depth. Because the power consumption of this depth camera is higher, the noise will be reduced, and the depth quality will be much better than the ToF module of the mobile phone. After fixing the mobile phone and Azure Kinect on a joint mount, the ToF module and the ultra-wide-angle camera of the mobile phone are calibrated with the Kinect depth camera respectively. Once the calibration is complete, the Kinect depth map can be re-projected to the ultra-wide-angle camera and the phone ToF camera for comparison. We captured 4 scenes for a total of 200 snapshots.

In addition to this RGB-D dataset, this paper also uses the traditional checkerboard offline calibration method to calibrate the floating main camera to provide ground truth for the online calibration evaluation of the four scenes.

training data generation

Multiview depth estimation

The above figure shows that our method can effectively preserve the detailed structure of objects, and the above table confirms that our method can effectively fuse data from ToF measurement and multi-view. Although T o ¨ RFT{\ddot{o}}RFTo¨ RFis designed to handle ToF input, but its performance is worse than our method when ToF and RGB resolutions are different. Comparing CVD and RCVD, it is experimentally observed that the accuracy drops when increasing the image size.

Final RGB-D Imaging

calibration

The figure above compares other calibration methods. Gao et al. calibrated two GoPro cameras and one Kinect RGB-D camera. In our comparison, we replaced the RGB-D camera with an ultra-wide-angle RGB plus ToF module and the GoPro camera with a floating main camera. The calibration accuracy of Gao et al. is much lower than ours, even though the method uses the state-of-the-art feature matcher DFM. Furthermore, only our method is capable of calibrating camera intrinsics and distortion parameters.

Stereo/ToF Fusion

Fusion evaluation

In this paper, we use the real-world RGB-D dataset to evaluate the fusion method proposed in this paper. In the first row, other methods for RGB/ToF fusion are evaluated. Since their stereo matching method is not robust to noise and imaging artifacts, their fusion results are very inaccurate. In the second row, we replace their less robust stereo matching with the state-of-the-art method RAFT-stereo. Since other calibration methods are not as accurate, we use our calibration method for all the above methods to emphasize the importance of our shot-by-shot calibration. The results show that our fusion method outperforms existing methods. At the same time, we also evaluate whether we can ignore OIS: we use a checkerboard to calibrate the main camera offline when the phone is stationary, and then we move the phone to capture our test scene. The results in the third row of Table 3 above show a significant drop in depth accuracy. Therefore, online calibration is necessary and effective.

 ToF/stereo fusion results.

The illustration above shows that our method has better edge and hole preservation ability, and has strong robustness at the same time.

dependence on equipment

In devices with short baselines such as smartphones, the method should be generalized since it allows accurate optical flow estimation between two RGB cameras. Furthermore, we show the results of dataset fusion based on different hardware: ZED stereo camera and Microsoft Kinect v2 ToF depth camera, and two LTTM5 calibrated BASLER scA1000 RGB camera and MESA SR4000 ToF camera.

limit

While our method is suitable for indoor environments, the reliance on ToF and binoculars prevents its application in certain scenarios. First, the ToF module cannot accurately estimate depth at large distances due to its low power. Second, ToF depth estimation is not reliable under strong infrared ambient light (such as direct sunlight). Since our calibration relies directly on ToF measurements, it becomes inaccurate if the ToF depth cannot be estimated. Furthermore, some materials—especially translucent or specular materials—are challenging for both ToF and binocular depth estimation, and cannot be resolved by our fusion method.

Summarize

Optical image stabilization lenses are common nowadays, but pose estimation problems exist when trying to fuse information from multiple sensors in a camera system. This limits our ability to estimate high-quality depth maps from a single snapshot. Our approach is designed for consumer-grade devices, targeting indoor environments that enable efficient calibration and robust sensor fusion. Since our method uses only one snapshot and does not exploit camera motion for pose estimation, the acquisition is fast and can be used for dynamic scenes. Evaluated on real-world inputs, our method obtains more accurate depth maps than state-of-the-art ToF and binocular fusion methods.

write at the end

If you are also interested in the full-stack field of artificial intelligence and computer vision, it is strongly recommended that you pay attention to the informative, interesting, and loving public account "CVHub", which brings you high-quality original, multi-field, and in-depth cutting-edge scientific papers every day Interpretation and industrial mature solutions! Welcome to add the editor's WeChat account: cv_huber, let's discuss more interesting topics together!

Guess you like

Origin blog.csdn.net/CVHub/article/details/129741011