DEMM:Depth Estimation Matters Most: Improving Per-Object Depth Estimation for Monocular 3D Detection

1. Frontier

  • Monocular 3D perception methods, including detection and tracking, tend to perform poorly compared to lidar-based techniques. Through system analysis, it was found thateach goalThe accuracy of depth estimation is the main factor affecting the performance .

  • Of all the properties in the image including rotation, size, depth and non-modal center, only the depth of each object, i.e. the depth of the vehicle's 3D center, was found to play a role
    insert image description here

  • Recent work (e.g., 3D monocular detection) has mainly focused on learning directly from raw RGB images , or exploiting pseudo lidar representations extracted from predicted dense depth maps .

  • The above two representations may be complementary in estimating the depth of each object, and learning from either alone may be suboptimal

    • RGB images actually encode appearance , texture , and 2D geometry , etc., but do not contain 3D direct information. It is difficult to learn how to accurately map RGB features to depth without fitting irrelevant information.
    • Pseudo-lidar characterizations directly model the 3D structure of objects through the estimated dense depth map, which makes it simple to learn the depth of each object, however the estimated dense depth map is usually noisy (typically with at least 8% average relative error )

2. Overall framework

insert image description here

● The figure above shows an overview of the multi-stage fusion framework for object-by-object depth estimation: first conduct 2D object detection and cross-frame tracking detection , build a tracklet for each object; then, construct a pseudo-lidar representation of the object across frames , and The RGB image features of the current frame ; self-motion compensation is applied to all pseudo lidar patches of each tracklet, and converted to the same coordinate system; finally, the RGB image features of the current frame and the time-fused pseudo lidar features are fused, to generate target depth by target.

  • The extraction process of the pseudo lidar representation consists of three steps:
    • (1) Dense depth estimation for each image
    • (2) Improve the predicted dense depth to pseudo lidar
      • Upscale each pixel of the entire depth map to a point cloud based on the camera model
    • (3) Extract pseudo lidar representation with neural network
      • Target bt pseudo lidar patch P t P_t based on 2D bounding box pair timestamp tPtFor clipping, where P t P_tPtis box bt b_tbtThe set of pseudo-lidar points in .
      • with another feature encoder F p F_pFpextract target bt b_tbtThe pseudo lidar feature PL P_LPL

3. T-fusion method with self-motion compensation based on pseudo lidar representation

  • Point of departure
    • A simple approach is to directly fuse image features across frames , howeverIt may not be ideal to directly fuse RGB features of different frames, because RGB features couple camera ego-motion and object motion together, and it is difficult to learn motion and temporal consistency from 2D image sequences
    • For efficient temporal fusion of depth estimates, camera motion must be compensated to ensure that features from different frames lie in the same coordinate system. Fortunately, the ego-motion of the camera can be easily compensated in 3D space by a pseudo lidar representation . Therefore, a T-fusion method with self-motion compensation based on pseudo lidar representation is proposed.
    • Guess: Estimate the motion of the own car through the pseudo point cloud of the front and rear frames, and after compensating the motion of the own car, cut out the depth map of other cars on each frame, and then fuse it with the RGB image (obtained by track)

4. Network details

  • RGB feature extraction usesCenterNetandCenterTrack
    • Objects as points
    • Tracking objects as points
  • Pseudo lidar feature extraction usesPatchNet
  • Tracking 2D detection to form trackle: based onKalman filtertracker
    • Simple online and realtime tracking
    • The author says that using a more advanced one might improve:
      • Fairmot: On the fairness of detection and re-identification in multiple object tracking
      • Soda: Multi-object tracking with soft data association
      • Towards real-time multi-object tracking

Guess you like

Origin blog.csdn.net/qq_35759272/article/details/132567900