Seeing Through Fog Without Seeing Fog:Deep Multimodal Sensor Fusion in Unseen Adverse Weather (翻)

Title:Seeing Through Fog Without Seeing Fog:Deep Multimodal Sensor Fusion in Unseen Adverse Weather

Seeing through the fog without seeing it: deep multimodal sensor fusion in unseen adverse weather

Abstract:

The fusion of multimodal sensor streams,such as cam-era,lidar,and radar measurements,plays a critical role inobject detection for autonomous vehicles,which base theirdecision making on these inputs.While existing methodsexploit redundant information in good environmentalconditions,they fail in adverse weather where the sensorystreams can be asymmetrically distorted.These rare“edge-case”scenarios are not represented in available datasets,and existing fusion architectures are not designed tohandle them.To address this challenge we present a novelmultimodal dataset acquired in over 10,000 km of drivingin northern Europe.Although this dataset is the first largemultimodal dataset in adverse weather,with 100k labels forlidar,camera,radar,and gated NIR sensors,it does not fa-cilitate training as extreme weather is rare.To this end,wepresent a deep fusion network for robust fusion without alarge corpus of labeled training data covering all asymmet-ric distortions.Departing from proposal-level fusion,wepropose a single-shot model that adaptively fuses features,driven by measurement entropy.We validate the proposedmethod,trained on clean data,on our extensive validationdataset.Code and data are available here https://github.com/princeton-computational-imaging/SeeingThroughFog.

Abstract: The fusion of multimodal sensor streams, such as camera, lidar, and radar measurements, plays a crucial role in object detection for autonomous vehicles, whose decisions are based on these inputs. Existing methods exploit redundant information in good environmental conditions, but asymmetrically distort the perceptual flow in bad weather. These rare "edge case" scenarios are not represented in available datasets, nor are they handled by existing fusion architectures. To address this challenge, we present a new multimodal dataset obtained on over 10,000 km of driving in Northern Europe. Although this dataset is the first large-scale multimodal dataset in severe weather with 100k labels (forlidar, camera, radar and gated NIR sensor), it is not easy to train due to the rare extreme weather. To this end, we propose a deep fusion network for robust fusion without requiring extensive labeled training data covering all asymmetric distortions. Starting from proposal-level fusion, we propose a one-shot model for adaptive fusion of features driven by measurement entropy. We validate the proposed method on our extensive validation dataset, training on clean data. See https://github.com/princeton-computational-imaging/SeeingThroughFog for code and data.

1.Introduction

Object detection is a fundamental computer vision prob-lem in autonomous robots,including self-driving vehiclesand autonomous drones.Such applications require 2D or3D bounding boxes of scene objects in challenging real-world scenarios,including complex cluttered scenes,highlyvarying illumination,and adverse weather conditions.Themost promising autonomous vehicle systems rely on redun-dant inputs from multiple sensor modalities[58,6,73],in-cluding camera,lidar,radar,and emerging sensor such asFIR[29].A growing body of work on object detection us-ing convolutional neural networks has enabled accurate 2Dand 3D box estimation from such multimodal data,typicallyrelying on camera and lidar data[64,11,56,71,66,42,35].While these existing methods,and the autonomous sys-tem that performs decision making on their outputs,per-form well under normal imaging conditions,they fail inadverse weather and imaging conditions.This is becauseexisting training datasets are biased towards clear weatherconditions,and detector architectures are designed to relyonly on the redundant information in the undistorted sen-sory streams.However,they are not designed for harsh sce-narios that distort the sensor streams asymmetrically,seeFigure.1.Extreme weather conditions are statistically rare.For example,thick fog is observable only during 0.01% of typical driving in North America,and even in foggy re-gions,dense fog with visibility below 50 m occurs only upto 15 times a year[61].Figure 2 shows the distributionof real driving data acquired over four weeks in Swedencovering 10,000 km driven in winter conditions.The nat-urally biased distribution validates that harsh weather sce-narios are only rarely or even not at all represented in avail-able datasets[65,19,58].Unfortunately,domain adaptationmethods[44,28,41]also do not offer an ad-hoc solution asthey require target samples,and adverse weather-distorteddata are underrepresented in general.Moreover,existingmethods are limited to image data but not to multisensordata,e.g.including lidar point-cloud data.

1 Introduction

Object detection is a fundamental computer vision problem in autonomous robotics, including self-driving vehicles and autonomous drones. Such applications require 2D or 3D bounding boxes of scene objects in challenging real-world scenarios, including complex cluttered scenes, highly variable lighting, and harsh weather conditions. The most promising autonomous driving systems rely on redundant inputs from multiple sensor modalities [58, 6, 73], including emerging sensors such as cameras, lidar, radar, and infrared [29]. A growing body of work using convolutional neural networks for object detection has been able to accurately estimate 2D and 3D boxes from these multimodal data, often relying on camera and lidar data [64, 11, 56, 71, 66 , 42, 35]. While these existing methods, and the autonomous systems that enforce decisions on their outputs, perform well under normal imaging conditions, they fail under adverse weather and imaging conditions. This is because existing training datasets are biased towards sunny weather conditions, while the design of the detector architecture relies only on redundant information in the undistorted perceptual stream. However, they are not designed for harsh scenarios that distort the sensor flow asymmetrically, see Figure 1. Extreme weather conditions are statistically rare. For example, dense fog can only be observed during periods when 0.01 % of dense fog with visibility below 50 m occurs only 15 times a year in North America, even in foggy areas. Figure 2 shows the distribution of actual driving data obtained over a four-week period in Sweden for driving 10,000 km in winter. Naturally skewed distributions validate that severe weather scenarios have little to no representation in available datasets [65, 19, 58]. Unfortunately, domain adaptation methods [44, 28, 41] also do not provide an ad-hoc solution, since it requires target samples, and unfavorable weather-distorted data are usually underrepresented. Furthermore, existing methods are limited to image data and are not suitable for multi-sensor data such as LiDAR point cloud data.

 Existing fusion methods have been proposed mostly forlidar-camera setups[64,11,42,35,12],as a result of thelimited sensor inputs in existing training datasets[65,19,58].These methods do not only struggle with sensor distor-tions in adverse weather due to the bias of the training data.Either they perform late fusion through filtering after inde-pendently processing the individual sensor streams[12],orthey fuse proposals[35]or high-level feature vectors[64].The network architecture of these approaches is designedwith the assumption that the data streams are consistent andredundant,i.e.an object appearing in one sensory streamalso appears in the other.However,in harsh weather condi-tions,such as fog,rain,snow,or extreme lighting condition,including low-light or low-reflectance objects,multimodalsensor configurations can fail asymmetrically.For exam-ple,conventional RGB cameras provide unreliable noisymeasurements in low-light scene areas,while scanning li-dar sensors provide reliable depth using active illumina-tion.In rain and snow,small particles affect the color im-age and lidar depth estimates equally through backscatter.Adversely,in foggy or snowy conditions,state-of-the-artpulsed lidar systems are restricted to less than 20 m rangedue to backscatter,see Figure 3.While relying on lidarmeasurements might be a solution for night driving,it isnot for adverse weather conditions.

Due to the limited input from sensors in existing training datasets [65, 19, 58], most existing fusion methods are proposed for camera-to-camera settings. These methods not only struggle with sensor failures in bad weather due to biases in the training data. They either undergo post-fusion via filtering [12] after processing individual sensor streams independently, or fuse proposals [35] or high-level feature vectors [64]. The network structures of these methods are designed under the assumption that the data streams are consistent and redundant, i.e., objects that appear in one perceptual stream also appear in another perceptual stream. However, multimodal sensor configurations may asymmetrically fail under harsh weather conditions such as fog, rain, snow, or extreme lighting conditions, including low-light or low-reflective objects. For example, traditional RGB cameras provide unreliable noise measurements in low-light scene regions, while scanning lidar sensors provide reliable depth with active lighting. In rain and snow, small particles contribute equally to color imagery and lidar depth estimation through backscattering. In contrast, in foggy or snowy conditions, advanced pulsed lidar systems are limited to within 20 m due to the effect of backscatter, see Figure 3. While relying on lidar measurements may be a solution for night driving, it is not suitable for poor weather conditions.

In this work,we propose a multimodal fusion method forobject detection in adverse weather,including fog,snow,and harsh rain,without having large annotated trainingdatasets available for these scenarios.Specifically,we han-dle asymmetric measurement corruptions in camera,lidar,radar,and gated NIR sensor streams by departing from ex-isting proposal-level fusion methods:we propose an adap-tive single-shot deep fusion architecture which exchangesfeatures in intertwined feature extractor blocks.This deepearly fusion is steered by measured entropy.The proposedadaptive fusion allows us to learn models that generalizeacross scenarios.To validate our approach,we address thebias in existing datasets by introducing a novel multimodaldataset acquired on three months of acquisition in northernEurope.This dataset is the first large multimodal driving dataset in adverse weather,with 100k labels for lidar,cam-era,radar,gated NIR sensor,and FIR sensor.Although theweather-bias still prohibits training,this data allows us tovalidate that the proposed method generalizes robustly tounseen weather conditions with asymmetric sensor corrup-tions,while being trained on clean data.Specifically,we make the following contributions:

In this paper, we propose a multimodal fusion method for object detection in severe weather such as fog, snow, and severe rainfall, without requiring large labeled training datasets for these scenarios. Specifically, we handle asymmetric measurement corruption in camera, lidar, radar, and gated NIR sensor streams by differing from existing proposal-level fusion methods: we propose an adaptive one-shot deep fusion architecture, Swap features in interleaved feature extractor blocks. This deep early fusion is guided by measured entropy. The proposed adaptive fusion allows us to learn models that generalize across scenes. To validate our method, we address the bias of existing datasets by introducing a new multimodal dataset acquired over three months in northern Europe. This dataset is the first large-scale multimodal driving dataset in severe weather with 100k labels for lidar, camera, radar, gated NIR sensor and FIR sensor. While weather bias still prohibits training, this data allows us to verify that the proposed method is robust to unknown weather conditions with asymmetric sensor failures while being trained on clean data.

Specifically, we make the following contributions:

•We introduce a multimodal adverse weather datasetcovering camera,lidar,radar,gated NIR,and FIR sen-sor data.The dataset contains rare scenarios,such asheavy fog,heavy snow,and severe rain,during morethan 10,000 km of driving in northern Europe.

• We introduce a multimodal adverse weather dataset covering camera, lidar, radar, gated NIR and FIR sensor data. This dataset contains rare scenes such as heavy fog, heavy snow and heavy rain during driving over 10,000 km in Northern Europe.

•We propose a deep multimodal fusion network whichdeparts from proposal-level fusion,and instead adap-tively fuses driven by measurement entropy.

• We propose a deep multimodal fusion network that moves away from proposal-level fusion to adaptive fusion driven by measurement entropy.

•We assess the model on the proposed dataset,validat-ing that it generalizes to unseen asymmetric distor-tions.The approach outperforms state-of-the-art fu-sion methods more than 8%AP in hard scenarios in-dependent of weather,including light fog,dense fog,snow,and clear conditions,and it runs in real-time.

• We evaluate the model on the proposed dataset, verifying that it generalizes to unseen asymmetric distortions. The method outperforms existing fusion methods by more than 8% AP in weather-independent hard scenes, including light fog, dense fog, snow, and clear conditions, and runs in real-time.

2.Related

WorkDetection in Adverse Weather Conditions Over the lastdecade,seminal work on automotive datasets[5,14,19,16,65,9]has provided a fertile ground for automotiveobject detection[11,8,64,35,40,20],depth estima-tion[18,39,21],lane-detection[26],traffic-light detec-tion[32],road scene segmentation[5,2],and end-to-enddriving models[4,65].Although existing datasets fuel thisresearch area,they are biased towards good weather con-ditions due to geographic location[65]and captured sea-son[19],and thus lack severe distortions introduced by rarefog,severe snow,and rain.A number of recent worksexplore camera-only approaches in such adverse condi-tions[51,7,1].However,these datasets are very small withless than 100 captured images[51]and limited to camera-only vision tasks.In contrast,existing autonomous driv-ing applications rely on multimodal sensor stacks,includ-ing camera,radar,lidar,and emerging sensor,such as gatedNIR imaging[22,23],and have to be evaluated on thou-sands of hours of driving.In this work,we fill this gap andintroduce a large scale evaluation set in order to develop afusion model for such multimodal inputs that is robust tounseen distortions.

In the last decade, seminal work on automotive datasets [5, 14, 19, 16, 65, 9] has contributed to automotive object detection [11, 8, 64, 35, 40, 20], depth estimation [18, 39 , 21], lane line detection [26], traffic light detection [32], road scene segmentation [5, 2] and end-to-end driving models [4, 65] provide fertile ground. Although existing datasets fuel this study area, they are biased towards good weather conditions due to geographic location [65] and season of capture [19], thus lacking severe distortion. Some recent work has explored camera-only methods for such adverse conditions. However, these datasets are very small, with less than 100 captured images [51], and are limited to camera vision tasks. In contrast, existing autonomous driving applications rely on multimodal sensor stacks, including cameras, radar, lidar, and emerging sensors such as gated near-infrared imaging [22, 23], and require time to evaluate. In this work, we fill this gap and introduce a large-scale evaluation set to develop a fusion model robust to unobservable distortions for these multimodal inputs.

Data Preprocessing in Adverse Weather A large body ofwork explores methods for the removal of sensor distor-tions before processing.Especially fog and haze removalfrom conventional intensity image data have been exploredextensively[67,70,33,53,36,7,37,46].Fog results ina distance-dependent loss in contrast and color.Fog re-moval methods have not only been suggested for display application[25],it has also been proposed as preprocess-ing to improve the performance of downstream semantictasks[51].Existing fog and haze removal methods rely onscene priors on the latent clear image and depth to solve theill-posed recovery.These priors are either hand-crafted[25]and used for depth and transmission estimation separately,or they are learned jointly as part of trainable end-to-endmodels[37,31,72].Existing methods for fog and visibilityestimation[57,59]have been proposed for camera driver-assistance systems.Image restoration approaches have alsobeen applied to deraining[10]or deblurring[36].

Preprocessing of Data in Bad Weather Much work has explored ways to remove sensor faults before processing. In particular, fog and haze removal in traditional intensity image data is extensively explored. Fog causes a distance-dependent loss of contrast and color. Dehazing methods have been proposed not only for display applications [25], but also as preprocessing to improve the performance of downstream semantic tasks [51]. Existing dehazing and dehazing methods rely on latent sharp images and depth scene priors to address the ill-posed restoration problem. These priors are either hand-designed [25] for depth and transport estimation separately, or jointly learned as part of trainable end-to-end models [37, 31, 72]. Existing fog and visibility estimation methods [57, 59] have been proposed for camera-assisted driving systems. Image restoration methods have also been applied to derain [10] or deblur [36].

Domain Adaptation Another line of research tackles theshift of unlabeled data distributions by domain adaptation[60,28,50,27,69,62].Such methods could be appliedto adapt clear labeled scenes to demanding adverse weatherscenes[28]or through the adaptation of feature representa-tions[60].Unfortunately,both of these approaches struggleto generalize,because,in contrast to existing domain trans-fer methods,weather-distorted data in general,not only la-beled data,is underrepresented.Moreover,existing meth-ods do not handle multimodal data.

Domain Adaptation Another class of research addresses shifts in unlabeled data distributions by domain adaptation [60, 28, 50, 27, 69, 62]. Such methods can be applied to adapt clearly annotated scenes to harsh severe weather scenes [28] or through feature representation adaptation [60]. Unfortunately, both of these approaches are difficult to generalize, since in general weather distorted data, not just labeled data, are underrepresented compared to existing domain transfer methods. Furthermore, existing methods do not handle multimodal data.

Multisensor Fusion Multisensor feeds in autonomous ve-hicles are typically fused to exploit varying cues in the mea-surements[43],simplify path-planning[15],to allow forredundancy in the presence of distortions[47],or solvefor joint vision tasks,such as 3D object detection[64].Existing sensing systems for fully-autonomous driving in-clude lidar,camera,and radar sensors.As large automotivedatasets[65,19,58]cover limited sensory inputs,existingfusion methods have been proposed mostly for lidar-camerasetups[64,55,11,35,42].Methods such as AVOD[35]andMV3D[11]incorporate multiple views from camera and li-dar to detect objects.They rely on the fusion of pooledregions of interest and hence perform late feature fusionfollowing popular region proposal architectures[49].In adifferent line of research,Qi et al.[48]and Xu et al.[64]propose a pipeline model that requires a valid detection out-put for the camera image and a 3D feature vector extractedfrom the lidar point-cloud.Kim et al.[34]propose a gatingmechanism for camera-lidar fusion.In all existing meth-ods,the sensor streams are processed separately in the fea-ture extraction stage,and we show that this prohibits learn-ing redundancies,and,in fact,performs worse than a singlesensor stream in the presence of asymmetric measurementdistortions.

Multi-Sensor Fusion Multi-sensor feeds in autonomous vehicles are often fused to exploit different cues in measurements [43], to simplify path planning [15], to allow redundancy in the presence of distortions [47], or to resolve Joint vision tasks such as 3D object detection [64]. Existing sensing systems for fully autonomous driving include lidar, camera, and radar sensors. Due to the limited sensory input covered by large automotive datasets [65, 19, 58], most existing fusion methods have been proposed for LiDAR-camera systems [64, 55, 11, 35, 42]. Methods such as AVOD [35] and MV3D [11] fuse multiple viewpoints from cameras and LiDAR to detect objects. They rely on the fusion of regions of interest, thus following the popular region proposal architecture for later feature fusion [49]. In a different line of research, Qi et al. [48] and Xu et al. [64] proposed a pipeline model that requires efficient detection outputs from camera images and 3D feature vectors extracted from lidar point clouds. Kim et al. [34] proposed a gating mechanism for camera-eyelid fusion. In all existing methods, sensor streams are processed separately at the feature extraction stage, and we show that this prohibits learning redundancy and, in fact, performs worse than single sensor streams in the presence of asymmetric measurement distortions.

3.Multimodal Adverse Weather Dataset

To assess object detection in adverse weather,we haveacquired a large-scale automotive dataset providing 2D and3D detection bounding boxes for multimodal data with afine classification of weather,illumination,and scene typein rare adverse weather situations.Table 1 compares our dataset to recent large-scale automotive datasets,such as theWaymo[58],NuScenes[6],KITTI[19]and the BDD[68]dataset.In contrast to[6]and[68],our dataset containsexperimental data not only in light weather conditions butalso in heavy snow,rain,and fog.A detailed description ofthe annotation procedures and label specifications is givenin the supplemental material.With this cross-weather an-notation of multimodal sensor data and broad geographi-cal sampling,it is the only existing dataset that allows forthe assessment of our multimodal fusion approach.In thefuture,we envision researchers developing and evaluatingmultimodal fusion methods in weather conditions not cov-ered in existing datasets.

3. Multimodal Adverse Weather Dataset

To evaluate object detection in severe weather, we obtain a large-scale car dataset, provide 2D and 3D detection bounding boxes for multimodal data, and fine-grain classification of weather, lighting and scene type in rare severe weather conditions . Table 1 compares our dataset with recent large-scale automotive datasets such as Waymo [58], NuScenes [6], KITTI [19] and BDD [68] datasets. Different from [6] and [68], our dataset contains not only the experimental data under small weather conditions, but also the experimental data of heavy snow, heavy rain and heavy fog. A detailed description of the annotation steps and labeling specifications is given in the supplementary material. With this cross-weather annotation and extensive geographic sampling of multimodal sensor data, it is the only existing dataset that allows evaluation of our multimodal fusion approach. In the future, we envision researchers developing and evaluating multimodal fusion methods in weather conditions not covered by existing datasets.

In Figure 2,we plot the weather distribution of the pro-posed dataset.The statistics were obtained by manually an-notating all synchronized frames at a frame rate of 0.1 Hz.We guided human annotators to distinguish light from densefog when the visibility fell below 1 km[45]and 100 m,re-spectively.If fog occurred together with precipitation,thescenes were either labeled as snowy or rainy depending onthe environment road conditions.For our experiments,wecombined snow and rainy conditions.Note that the statisticsvalidate the rarity of scenes in heavy adverse weather,whichis in agreement to[61]and demonstrates the difficulty andcritical nature of obtaining such data in the assessment oftruly self-driving vehicles,i.e.without the interaction of re-mote operators outside of geo-fenced areas.We found thatextreme adverse weather conditions occur only locally andchange very quickly.

In Fig. 2, we plot the weather distribution of the proposed dataset. Statistics were obtained by manually marking all synchronized frames at a frame rate of 0.1 Hz. We instruct human annotators to distinguish between light and dense fog when visibility is below 1 km [45] and 100 m, respectively. If fog and precipitation occur at the same time, the scene is marked as snowy or rainy depending on the ambient road conditions. For our experiments, we combined snow and rain conditions. Notably, this statistic confirms that severe weather scenarios are rare, which is consistent with [61] and indicates the difficulty and critical nature of obtaining such data when evaluating true self-driving vehicles, i.e. without geographic Interaction of remote operators outside restricted areas. We found that extreme adverse weather conditions occurred only locally and changed very rapidly.

The individual weather conditions result in asymmetri-cal perturbations of various sensor technologies,leading toasymmetric degradation,i.e.instead of all sensor outputsbeing affected uniformly by a deteriorating environmentalcondition,some sensors degrade more than others,see Fig-ure 3.For example,conventional passive cameras performwell in daytime conditions,but their performance degradesin night-time conditions or challenging illumination settingssuch as low sun illumination.Meanwhile,active scanningsensors as lidar and radar are less affected by ambient lightchanges due to active illumination and a narrow bandpass on the detector side.On the other hand,active lidar sensorsare highly degraded by scattering media as fog,snow orrain,limiting the maximal perceivable distance at fog den-sities below 50 m to 25 m,see Figure 3.Millimeter-waveradar waves do not strongly scatter in fog[24],but currentlyprovide only low azimuthal resolution.Recent gated im-ages have shown robust perception in adverse weather[23],provide high spatial resolution,but are lacking color infor-mation compared to standard imagers.With these sensor-specific weaknesses and strengths of each sensor,multi-modal data can be crucial in robust detection methods.

Different meteorological conditions lead to asymmetric perturbations of various sensor technologies, leading to asymmetric degradation, i.e., the outputs of all sensors are not uniformly affected by deteriorating environmental conditions, and some sensors degrade more than others, see Fig. 3. For example, traditional passive cameras perform well in daytime conditions but degrade in nighttime conditions or in challenging lighting setups such as insufficient sunlight. At the same time, active scanning sensors such as lidar and radar are less affected by changes in ambient light due to active illumination and narrow bandpass on the detector side. On the other hand, active lidar sensors are seriously degraded by scattering media such as fog, snow, or rain, and the maximum perceivable distance is limited to 25 m when the fog density is lower than 50 m, as shown in Figure 3. Millimeter-wave radar waves are not strongly scattered in fog [24], but currently only provide low azimuth resolution. Recent gated images show robust perception in bad weather [23], providing high spatial resolution but lacking color information compared to standard imagers. Taking advantage of these sensor-specific weaknesses and the strengths of each sensor, multimodal data can be crucial in robust detection methods.

 

 

3.1.Multimodal Sensor Setup

For acquisition we have equipped a test vehicle with sen-sors covering the visible,mm-wave,NIR,and FIR band,seeFigure 2.We measure intensity,depth,and weather condi-tion.

Stereo Camera As visible-wavelength RGB cameras,we use a stereo pair of two front-facing high-dynamic-range automotive RCCB cameras,consisting of two On-Semi AR0230 imagers with a resolution of 1920×1024,a baseline of 20.3 cm and 12 bit quantization.The camerasrun at 30 Hz and are synchronized for stereo imaging.UsingLensagon B5M8018C optics with a focal length of 8 mm,afield-of-view of 39.6°×21.7°is obtained.

3.1. Multimodal sensor setup

For acquisition, we equipped a test vehicle with sensors covering visible light, millimeter wave, NIR and FIR bands, see Figure 2. We measure strength, depth and weather conditions.

Stereo Camera As an RGB camera for visible wavelengths, we use a stereo pair of two front mounted high dynamic range automotive RCCB cameras consisting of two On-Semi AR0230 imagers with a resolution of 1920 × 1024 and a baseline of 20.3 cm, 12 bit quantization. The cameras run at 30 Hz and are synchronized for stereoscopic imaging. Using the Lensagon B5M8018C optics with a focal length of 8mm, a field of view of 39.6° × 21.7° was obtained.

Gated camera We capture gated images in the NIR bandat 808 nm using a BrightwayVision BrightEye camera op-erating at 120 Hz with a resolution of 1280×720 and a bitdepth of 10 bit.The camera provides a similar field-of-viewas the stereo camera with 31.1°×17.8°.Gated imagersrely on time-synchronized camera and flood-lit flash lasersources[30].The laser pulse emits a variable narrow pulse,and the camera captures the laser echo after an adjustable delay.This enables to significantly reduce backscatter fromparticles in adverse weather[3].Furthermore,the high im-ager speed enables to capture multiple overlapping sliceswith different range-intensity profiles encoding extractabledepth information in between multiple slices[23].Follow-ing[23],we capture 3 broad slices for depth estimation andadditionally 3-4 narrow slices together with their passivecorrespondence at a system sampling rate of 10 Hz.

The gated camera uses Brightway Vision Bright Eye camera with a resolution of 1280 × 720, a bit depth of 10 bits, and a working frequency of 120 Hz to collect gated images in the near-infrared band of 808 nm. The camera offers a field of view similar to that of a 31.1° x 17.8° stereo camera. Gated imagers rely on time-synchronized cameras and floodlight flash laser sources [30]. The laser pulse emits a variable narrow pulse, and the camera captures the laser echo after an adjustable time delay. This enables a significant reduction in backscattering from particles in bad weather [3]. Furthermore, the high imaging speed enables the capture of multiple overlapping slices with different range-intensity profiles, encoding extractable depth information between multiple slices [23]. According to literature [23], in the case of a system sampling rate of 10 Hz, 3 wide slices for depth estimation and 3~4 narrow slices and their passive counterparts are captured.

Radar For radar sensing,we use a proprietary frequency-modulated continuous wave(FMCW)radar at 77 GHz with1°angular resolution and distances up to 200 m.The radarprovides position-velocity detections at 15 Hz.

Radar For radar detection, we use a dedicated frequency-modulated continuous wave (FMCW) radar at 77 GHz with an angular resolution of 1° and a range of up to 200 m. The radar provides position-velocity detection at 15 Hz.

Lidar On the roof of the car,we mount two laser scannersfrom Velodyne,namely HDL64 S3D and VLP32C.Both areoperating at 903 nm and can provide dual returns(strongestand last)at 10 Hz.While the Velodyne HDL64 S3D pro-vides equally distributed 64 scanning lines with an angularresolution of 0.4°,the Velodyne VLP32C offers 32 non-linear distributed scanning lines.HDL64 S3D and VLP32Cscanners achieve a range of 100 m and 120 m,respectively.

The laser radar is equipped with two speed-regulating generator laser scanners on the roof, which are HDL64 S3D and VLP32C. Both operate at 903 nm and provide a double echo (strongest and last) at 10 Hz. The variable speed generator HDL64 S3D provides 64 uniformly distributed scan lines with an angular resolution of 0.4 °, and the variable speed generator VLP32C provides 32 non-linearly distributed scan lines. HDL64 S3D and VLP32C scanners achieved scanning ranges of 100 m and 120 m, respectively.

FIR camera Thermal images are captured with an AxisQ1922 FIR camera at 30 Hz.The camera offers a resolu-tion of 640×480 with a pixel pitch of 17µm and a noiseequivalent temperature difference(NETD)<100 mK.

FIR camera AxisQ1922 type FIR camera was used to collect thermal images at 30 Hz. The camera resolution is 640 × 480, the pixel pitch is 17 µm, and the noise equivalent temperature difference ( NETD ) is < 100 mK.

Environmental Sensors We measured environmen-tal information with an Airmar WX150 weather stationthat provides temperature,wind speed and humidity,anda proprietary road friction sensor.All sensors are time-synchronized and ego-motion corrected using a proprietaryinertial measurement unit(IMU).The system provides asampling rate of 10 Hz.

Environmental Sensors We measure environmental information using an Airmar WX150 weather station providing temperature, wind speed and humidity and a proprietary road friction sensor. All sensors are time-synchronized and ego-motion corrected using a proprietary inertial measurement unit (IMU). The system provides a sampling rate of 10 Hz.

3.2.Recordings

Real-world Recordings All experimental data has beencaptured during two test drives in February and December2019 in Germany,Sweden,Denmark,and Finland for twoweeks each,covering a distance of 10,000 km under dif-ferent weather and illumination conditions.A total of 1.4million frames at a frame rate of 10 Hz have been collected.Every 100th frame was manually labeled to balance scenetype coverage.The resulting annotations contain 5,5k clearweather frames,1k captures in dense fog,1k captures inlight fog,and 4k captures in snow/rain.Given the extensivecapture effort,this demonstrates that training data in harshconditions is rare.We tackle this approach by training onlyon clear data and testing on adverse data.The train andtest regions do not have any geographic overlap.Insteadof partitioning by frame,we partition our dataset based onindependent recordings(5-60 min in length)from differentlocations.These recordings originate from 18 different ma-jor cities illustrated in Figure 2 and several smaller citiesalong the route.

3.2 Records

All experimental data were obtained from two two-week tests in Germany, Sweden, Denmark and Finland in February and December 2019, covering a distance of 10,000 km under different weather and light conditions. At a frame rate of 10Hz, a total of 1.4 million frames of images were collected. Manual annotation is performed every 100 frames to balance scene type coverage. The resulting annotations contain 5, 5k clear-sky weather frames, 1k captured in dense fog, 1k captured in light fog, and 4k captured in snow/rain. Given the extensive capture efforts, this suggests that training data under harsh conditions is rare. We solve this problem by only training on clean data and testing on bad data. There is no geographic overlap between the train and the test area. Instead of splitting by frame, we split the dataset based on independent recordings (5 ~ 60 min) at different locations. These records originate from the 18 different major cities shown in Figure 2 and several smaller cities along the route.

Controlled Condition Recordings To collect image andrange data under controlled conditions,we also providemeasurements acquired in a fog chamber.Details on the fogchamber setup can be found in[17,13].We have captured35k frames at a frame rate of 10 Hz and labeled a subsetof 1,5k frames under two different illumination conditions(day/night)and three fog densities with meteorological vis-ibilities V of 30 m,40 m and 50 m.Details are given in thesupplemental material,where we also do comparisons to asimulated dataset,using the forward model from[51].

Controlled Conditions Recording To collect images and scope data under controlled conditions, we also present measurements obtained in the fog chamber. Details on the fog chamber setup can be found in [17, 13]. We captured 35k frames of images at a frame rate of 10 Hz under two different lighting conditions (day/night) and three fog densities, and labeled a subset of 1, 5k frames with meteorological visibility V of 30 m, 40 m and 50 m. Details are given in the supplementary material, and we also compared with simulated datasets using the forward model from [51].

4.Adaptive Deep Fusion

In this section,we describe the proposed adaptive deepfusion architecture that allows for multimodal fusion in thepresence of unseen asymmetric sensor distortions.We de-vise our architecture under real-time processing constraintsrequired for self-driving vehicles and autonomous drones.Specifically,we propose an efficient single-shot fusion ar-chitecture.

4. Adaptive Deep Fusion

In this section, we describe the proposed adaptive deep fusion architecture that allows multimodal fusion without visible asymmetric sensor distortions. We designed our architecture within the real-time processing constraints required by self-driving cars and autonomous drones. Specifically, we propose an efficient structure for single-shot fusion.

4.1.Adaptive Multimodal Single-Shot Fusion

The proposed network architecture is shown in Figure 4.It consists of multiple single-shot detection branches,eachanalyzing one sensor modality.

4.1. Adaptive Multimodal Single Shot Fusion

The network architecture proposed in this paper is shown in Figure 4. It consists of multiple single-shot detection branches, each analyzing a sensor modality.

Data Representation The camera branch uses conven-tional three-plane RGB inputs,while for the lidar andradar branch,we depart from recent bird’s eye-view(BeV)projection[35]schemes or raw point-cloud representa-tions[64].BeV projection or point-cloud inputs do notallow for deep early fusion as the feature representationsin the early layers are inherently different from the camerafeatures.Hence,existing BeV fusion methods can only fusefeatures in a lifted space,after matching region proposals,but not earlier.Figure 4 visualizes the proposed input dataencoding,which aids deep multimodal fusion.Instead ofusing a naive depth-only input encoding,we provide depth,height,and pulse intensity as input to the lidar network.Forthe radar network,we assume that the radar is scanning in a2D-plane orthogonal to the image plane and parallel to thehorizontal image dimension.Hence,we consider radar in-variant along the vertical image axis and replicate the scanalong vertical axis.Gated images are transformed into theimage plane of the RGB camera using a homography map-ping,see supplemental material.The proposed input en-coding allows for a position and intensity-dependent fusionwith pixel-wise correspondences between different streams.We encode missing measurement samples with zero value.

The data camera branch uses the traditional three-plane RGB input, while for the lidar and radar branches we deviate from the recent Bird's Eye View (BeV) projection [35] scheme or the raw point cloud representation [64]. BeV projections or point cloud inputs do not allow for deep early fusion, since feature representations in early layers are intrinsically different from camera features. Therefore, existing BeV fusion methods can only perform feature fusion in the boosted space after matching region proposals, but not earlier. Figure 4 visualizes the proposed encoding of input data, which facilitates deep multimodal fusion. Instead of using a simple depth input encoding, we provide depth, height, and pulse intensity as input to the lidar network. For RadarNet, we assume that the radar scans in a 2D plane that is orthogonal to the image plane and parallel to the horizontal image dimension. Therefore, we consider radar invariants along the vertical image axis and replicate scans along the vertical axis. The gated image is transformed to the image plane of the RGB camera by homography mapping, see Supplementary Material. The proposed input encoding allows position- and intensity-dependent fusion between different streams with pixel-level correspondence. We zero-code the missing measurement samples.

Feature Extraction As feature extraction stack in eachstream,we use a modified VGG[54]backbone.Similar to[35,11],we reduce the number of channels by half and cutthe network at the conv4 layer.Inspired by[40,38],we usesix feature layers from conv4-10 as input to SSD detection layers.The feature maps decrease in size1,implementing afeature pyramid for detections at different scales.As shownin Figure 4,the activations of different feature extractionstacks are exchanged.To steer fusion towards the most reli-able information,we provide the sensor entropy to each fea-ture exchange block.We first convolve the entropy,apply asigmoid,multiply with the concatenated input features fromall sensors,and finally concatenate the input entropy.Thefolding of entropy and application of the sigmoid generatesa multiplication matrix in the interval[0,1].This scales theconcatenated features for each sensor individually based onthe available information.Regions with low entropy canbe attenuated,while entropy rich regions can be amplifiedin the feature extraction.Doing so allows us to adaptivelyfuse features in the feature extraction stack itself,which wemotivate in depth in the next section.

Feature Extraction As a feature extraction stack in each stream, we use a modified VGG [54] backbone. Similar to [35, 11], we halve the number of channels and perform network pruning at the conv4 layer. Inspired by [40, 38], we use the six feature layers of conv4-10 as the input of the SSD detection layer. The feature map size is reduced to 1, enabling feature pyramids for detection at different scales. As shown in Figure 4, the activations of different feature extraction stacks are swapped. To guide the fusion towards the most reliable information, we provide sensor entropy for each feature exchange block. We first convolve the entropy values, then apply aigmoid, multiply with the concatenated input features of all sensors, and finally concatenate the input entropy values. The folding of the entropy and the application of the sigmoid function produce a multiplicative matrix on the interval [0, 1]. The approach scales each sensor's associated features individually based on available information. During feature extraction, regions with low entropy can be attenuated, while regions with rich entropy can be enlarged. Doing so allows us to adaptively fuse features within the feature extraction stack itself, which is delved deeper in the next section.

4.2.Entropy-steered Fusion

To steer the deep fusion towards redundant and reliableinformation,we introduce an entropy channel in each sen-sor stream,instead of directly inferring the adverse weathertype and strength as in[57,59].We estimate local measure-ment entropy,the entropy is calculated for each 8 bit binarized stream Iwith pixel values i [0,255]in the proposed image-space data representation.Each stream is split into patches ofsize M×N=16 px×16 px resulting in a w×h=1920 px×1024 px entropy map.The multimodal entropymaps for two different scenarios are shown in Figure 5:theleft scenario shows a scene containing a vehicle,cyclist,andpedestrians in a controlled fog chamber.The passive RGBcamera and lidar suffer from backscatter and attenuationwith decreasing fog visibilities,while the gated camera sup-presses backscatter through gating.Radar measurementsare also not substantially degraded in fog.The right sce-nario in Figure 5 shows a static outdoor scene under vary-ing ambient lighting.In this scenario,active lidar and radarare not affected by changes in ambient illumination.For thegated camera,the ambient illumination disappears,leavingonly the actively illuminated areas,while the passive RGBcamera degenerates with decreasing ambient light.

To steer deep fusion towards redundant and reliable information, we introduce an entropy channel in each sensor stream, instead of directly inferring adverse weather type and intensity as in [57, 59]. We estimate the local measurement entropy, which is computed for an 8-bit binary stream I with each pixel value i [0, 255] in the proposed image space data representation. Each stream is split into small pieces of size M × N = 16 px × 16 px, resulting in an entropy map of w × h = 1920 px × 1024 px. The multimodal entropy diagrams of two different scenarios are shown in Figure 5: the left scenario is a scene containing vehicles, cyclists and pedestrians in a controlled fog room. Passive RGB cameras and lidar suffer from backscatter and attenuation, which reduces visibility in fog, while gated cameras suppress backscatter by gating.

The steering process is learned purely on clean weatherdata,which contains different illumination settings presentin day to night-time conditions.No real adverse weatherpatterns are presented during training.Further,we dropsensor streams randomly with probability 0.5 and set theentropy to a constant zero value.

The steering process is learned purely on clean weather data, which incorporates varying lighting settings from day to night. No real adverse weather conditions occurred during the training. Further, we randomly drop sensor streams with probability 0.5 and set the entropy value to a constant value of zero.

4.3.Loss Functions and Training Details

The number of anchor boxes in different feature layersand their sizes play an important role during training andare given in the supplemental material.In total,each anchorbox with class label yi and probability pi is trained using thecross entropy loss with softmax,

 The loss is split up for positive and negative anchor boxeswith a matching threshold of 0.5.For each positive anchorbox,the bounding box coordinates x are regressed using aHuber loss H(x)given by, 

The total number of negative anchors is restricted to 5×thenumber of positive examples using hard example mining[40,52].All networks are trained from scratch with a con-stant learning rate and L2 weight decay of 0.0005.

 

4.3. Loss function and training details

The number and size of anchor boxes in different feature layers play an important role in the training process and are given in the supplementary material. In general, each anchor box with class label yi and probability pi is trained using a cross-entropy loss with softmax, splitting the loss into positive and negative anchor boxes, with a matching threshold of 0.5. For each positive anchor box, the bounding box coordinate x is regressed using the Huber loss H(x) given by . Use hard example mining [40, 52] to limit the total number of negative anchor boxes to 5 times the number of positive examples. All networks are trained from scratch with a constant learning rate and L2 weight decay of 0.0005.

5.Assessment

In this section,we validate the proposed fusion modelon unseen experimental test data.We compare the methodagainst existing detectors for single sensory inputs and fu-sion methods,as well as domain adaptation methods.Dueto the weather-bias of training data acquisition,we only usethe clear weather portion of the proposed dataset for train-ing.We assess the detection performance using our novelmultimodal weather dataset as a test set,see supplementaldata for test and training split details.

In this section, we validate the proposed fusion model on unseen experimental test data. We compare the method with existing single-sensory input detectors and fusion methods as well as domain adaptation methods. Due to the weather-biased nature of training data collection, we only use the sunny-weather part of the proposed dataset for training. We use the new multimodal weather dataset as a test set to evaluate detection performance, see Supplementary Data for test and training split details.

We validate the proposed approach in Table 2,which we dub Deep Entropy Fusion,on real adverse weather data.We report Average Precision(AP)for three different dif-ficulty levels(easy,moderate,hard)and evaluate on carsfollowing the KITTI evaluation framework[19]at variousfog densities,snow disturbances,and clear weather condi-tions.We compare the proposed model against recent state-of-the-art lidar-camera fusion models,including AVOD-FPN[35],Frustum PointNets[48],and variants of the pro-posed method with alternative fusion or sensory inputs.Asbaseline variants,we implement two fusion and four sin-gle sensor detectors.In particular,we compare against latefusion with image,lidar,gated,and radar features concate-nated just before bounding-box regression(Fusion SSD),and early fusion by concatenating all sensory data at theearly beginning of one feature extraction stack(ConcatSSD).The Fusion SSD network shares the same structureas the proposed model,but without the feature exchangeand the adaptive fusion layer.Moreover,we compare theproposed model against an identical SSD branch with sin-gle sensory input(Image-only SSD,Gated-only SSD,Lidar-only SSD,Radar-only SSD).All models were trained withidentical hyper-parameters and anchors.

In this section, we validate the proposed fusion model on unknown experimental test data. We compare the method with existing single-sensory input detectors and fusion methods as well as domain adaptation methods. Due to the weather-biased nature of training data collection, we only use the sunny-weather part of the proposed dataset for training. We use the new multimodal weather dataset as a test set to evaluate detection performance, see Supplementary Data for test and training split details. We validate the proposed method in Table 2 on real severe weather data, which we call deep entropy fusion. We report the Average Precision (AP) for 3 different difficulty levels (easy, medium, and hard) and evaluate the car under different fog densities, snow disturbances, and clear weather conditions following the KITTI evaluation framework [19]. We compare the proposed model with state-of-the-art lidar-camera fusion models, including AVOD-FPN [35], frustum PointNets [48], and improved methods employing alternate fusion or sensory inputs. As a baseline variant, we implement two fusion and four single-sensor detectors. In particular, we compare late fusion (Fusion SSD ) cascaded with image, lidar, gating, and radar features before bounding box regression with early fusion (Concat SSD ) that concatenates all sensory data early in one feature extraction stack. . The Fusion SSD network has the same structure as the model in this paper, but without feature exchange and adaptive fusion layers. Furthermore, we compare the proposed model with the SSD branch with a single sensory input. All models are trained with the same hyperparameters and anchors.

Evaluated on adverse weather scenarios,the detection performance decrease for all methods.Note that assessmentmetrics can increase simultaneously as scene complexitychanges between the weather splits.For example,whenfewer vehicles participate in road traffic or the distance be-tween vehicles increases in icy conditions,fewer vehiclesare occluded.While the performance for image and gateddata is almost steady,it decreases substantially for lidar datawhile it increases for radar data.The decrease in lidar per-formance can be described by the strong backscatter,seeSupplemental Material.As a maximum of 100 measure-ment targets limits the performance of the radar input,thereported improvements are resulting from simpler scenes.

When evaluated under severe weather scenarios, the detection performance of all methods decreases. It is worth noting that the evaluation metric can increase simultaneously with the change of scene complexity between weather segments. For example, in icy conditions, fewer vehicles are shaded when there are fewer vehicles participating in road traffic or when the distance between vehicles increases. While the performance is almost constant for image and gated data, it drops drastically for lidar data and increases for radar data. The drop in lidar performance can be described by strong backscattering, see Supplementary Material. The reported improvements come from simpler scenarios, since a maximum of 100 measured targets limits the performance of the radar input.

Overall,the large reduction in lidar performance forfoggy conditions affects the lidar only detection rate by adrop in 45.38%AP.Furthermore,it also has a strong impacton camera-lidar fusion models AVOD,Concat SSD and Fu-sion SSD.Learned redundancies no longer hold,and thesemethods even fall below image-only methods.

Overall, the drastic drop in lidar performance under foggy conditions affects the detection rate of lidar alone by 45.38 % AP, and also strongly affects the camera-lidar fusion models AVOD, Concat SSD, and Fusion SSD. The learned redundancy no longer holds and these methods are even lower than image methods.

Two-stage methods,such as Frustum PointNet[48],dropquickly.However,they asymptotically achieve higher re-sults compared to AVOD,because the statistical priorslearned for the first stage are based on Image-only SSD thatlimits its performance to image-domain priors.AVOD islimited by several assumptions that hold for clear weather,such as the importance sampling of boxes filled with li-dar data during training,achieving the lowest fusion per-formance overall.Moreover,as the fog density increases,the proposed adaptive fusion model outperforms all othermethods.Especially under severe distortions,the proposedadaptive fusion layer results in significant margins overthe model without it(Deep Fusion).Overall the proposedmethod outperforms all baseline approaches.In dense fog,it improves by a margin of 9.69%compared to the next-bestfeature-fusion variant.

Two-step methods, such as Frustum Point Networks [48], Dropquickly. However, they achieve asymptotically higher results compared to AVOD because the statistical prior learned in the first stage is based on SSD in the image domain, This limits its performance. AVOD is constrained by several clear-weather assumptions, such as importance sampling on bins full of lidar data when training, and achieves the lowest fusion performance overall. Moreover, the proposed adaptive fusion model outperforms all other methods as the fog density increases. Especially under heavy distortion, the proposed adaptive fusion layer produces significant edges than the model without it (Deep Fusion). Overall, the proposed method outperforms all baseline methods. In dense fog, it outperforms the suboptimal feature fusion algorithm by 9.69%.

For completeness,we also compare the proposed modelto recent domain adaptation methods.First,we adapt ourImage-Only SSD features from clear weather to adverseweather following[60].Second,we investigate the style transfer from clear weather to adverse weather utilizing[28]and generate adverse weather training samples from clearweather input.Note that these methods have an unfair ad-vantage over all other compared approaches as they haveseen adverse weather scenarios sampled from our validationset.Note that domain adaptation methods cannot be directlyapplied as they need target images from a specific domain.Therefore,they do also not offer a solution for rare edgecases with limited data.Furthermore[28]does not modeldistortions,including fog or snow,see experiments in theSupplemental Material.We note that synthetic data aug-mentation following[51]or image-to-image reconstructionmethods that remove adverse weather effects[63]do nei-ther affect the reported margins of the proposed multimodaldeep entropy fusion.

For completeness, we also compare the proposed model with recent domain adaptation methods. First, we adapt our Image-Only SSD features from sunny to headwind weather [60]. Second, we examine style. For completeness, we also compare the proposed model with recent domain adaptation methods. First, we adapt our Image-Only SSD features from sunny to headwind weather [60]. Second, we leverage [28] to study sunny-to-bad-weather style transfer and generate bad-weather training samples from sunny-weather inputs. It is worth noting that these methods have an unfair advantage over all other comparison methods because they sample unfavorable weather scenarios from our validation set. Note that domain adaptation methods cannot be directly applied since they require target images from a specific domain. Therefore, they also do not provide solutions for the rare edge cases where data is limited. Furthermore, the literature [28] did not model distortions such as fog or snow, see experiments in the supplementary material. We note that employing synthetic data augmentation [51] or image-to-image reconstruction methods [63] that remove adverse weather effects does not affect the reported margins of the proposed multimodal deep entropy fusion.

6.Conclusion and Future Work

In this paper we address a critical problem in au-tonomous driving:multi-sensor fusion in scenarios,whereannotated data is sparse and difficult to obtain due to nat-ural weather bias.To assess multimodal fusion in adverseweather,we introduce a novel adverse weather dataset cov-ering camera,lidar,radar,gated NIR,and FIR sensor data.The dataset contains rare scenarios,such as heavy fog,heavy snow,and severe rain,during more than 10,000 kmof driving in northern Europe.We propose a real-time deepmultimodal fusion network which departs from proposal-level fusion,and instead adaptively fuses driven by mea-surement entropy.Exciting directions for future researchinclude the development of end-to-end models enabling thefailure detection and an adaptive sensor control as noiselevel or power level control in lidar sensors.

6. Conclusions and future work

This paper addresses a key problem in autonomous driving: multi-sensor fusion in scenarios where labeled data are sparse and difficult to obtain due to natural weather bias. To evaluate multimodal fusion in adverse weather, we introduce a new adverse weather dataset including camera, lidar, radar, gated NIR and FIR sensor data. This dataset contains rare scenes such as heavy fog, heavy snow and heavy rain during driving over 10,000 km in the Nordic region. We propose a real-time deep multimodal fusion network that moves away from proposal-level fusion to adaptive fusion driven by measurement entropy. Exciting future research directions include the development of end-to-end models capable of fault detection and adaptive sensor control such as sound pressure level or power level control in lidar sensors.

Guess you like

Origin blog.csdn.net/WakingStone/article/details/129490228