【Image Fusion】Dif-Fusion: Infrared/Visible Image Fusion Method Based on Diffusion Model


Summary

Color plays an important role in human visual perception, reflecting the spectrum of objects. However, existing infrared and visible image fusion methods rarely explore how to directly process multispectral/channel data and achieve high color fidelity . This paper proposes a diffusion model to generate the distribution of multi-channel input data , which improves the ability of multi-source information aggregation and the fidelity of color. Specifically, instead of converting multi-channel images in existing fusion methods to single-channel data, we create a multi-channel data distribution with a denoising network with forward and back-diffusion processes in the latent space. Then, we exploit the denoising network to extract multi-channel diffusion features with visible and infrared information. Finally, we feed the multi-channel diffusion features into the multi-channel fusion module to directly generate three-channel fused images . To preserve texture and intensity information, we propose multi-channel gradient loss and intensity loss . And a new evaluation metric is introduced to quantify color fidelity .

Source code: https://github.com/GeoVectorMatrix/DifFusion
Paper address: https://arxiv.org/pdf/2301.08072.pdf


I. Introduction

Due to the theoretical and technical limitations of optical imaging hardware equipment, only part of the image information can be obtained from images obtained by a single sensor or a single shooting setup. Therefore, the fusion of images from different sensors or different capture settings helps to enrich image information. Among various image fusion tasks, infrared and visible light image fusion is one of the most widely used. Infrared sensors can capture the thermal radiation of objects, but are susceptible to noise and difficult to capture texture information . In contrast, visible images usually contain rich structure and texture information but are vulnerable to illumination and occlusion . The complementarity between them makes it possible to generate fused images containing thermal objects and texture details. Infrared and visible image fusion has been widely used in military, target detection and tracking, pedestrian re-identification, semantic segmentation and other fields.

In the past decades, many image fusion techniques have been proposed, including traditional methods and deep learning-based methods. Traditional infrared and visible light image fusion algorithms can generally be divided into the following categories, including methods based on sparse representation , methods based on multi-scale transformation , methods based on subspace , methods based on saliency detection , and hybrid methods . The above algorithms can meet the needs of specific scenarios, but there are problems: 1) Existing traditional methods usually use the same method to represent image features, and rarely consider the unique characteristics of infrared and visible images; 2) Manually set the activity level measurement and fusion Rules cannot meet the needs of complex scenarios.

Fusion algorithms based on deep neural networks are generally divided into three categories: methods based on autoencoders (AE) , methods based on convolutional neural networks (CNN) , and methods based on generative confrontation networks (GAN) . Existing infrared and visible light image fusion methods, as an image generation task, lack in-depth exploration of generative models. Existing methods based on generative models are mainly GAN-based methods, including FusionGAN and GANMcC. However, due to the additional constraints these methods impose on the generator, infrared cannot establish and visible image the distribution. There are still some problems :
1. Existing methods mainly focus on preserving the thermal target in infrared images and the background texture structure in visible light images, but not how to preserve the color information in visible light images. However, color reflects the spectrum of an object, which is crucial in digital images, and its implications in understanding visual scenes are much studied. As shown in Figure 1©, the current method (U2Fusion) does not effectively exploit multispectral information and performs poorly in maintaining color information in visible images, which negatively impacts human perception.

insert image description here

Illustration of color fidelity. The red, yellow, and green dashed circles denote the color differences between the visible and fused images of walls, road surfaces, and vegetation, respectively. Compared with existing methods, DifFusion has higher color fidelity.

How to extract multi-channel complementary information in input data has not been well studied. Existing methods usually convert visible images stored in three channels (i.e., RGB channels) from RGB space to YCbCr space, and use the Y channel for fusion . After the single-channel fusion image is generated, it needs to be converted into a three-channel image through post-processing. Since not all channels are present in the input data, it is difficult to construct multi-channel distributions and extract multi-channel complementary information, resulting in color distortion.

This paper proposes a diffusion model-based fusion method for infrared and visible light images, namely Dif-Fusion .
First , we directly input multi-channel data consisting of three-channel visible light image and single-channel infrared image , and construct the multi-channel distribution in the latent space through a diffusion process. Diffusion process is a Markov process, which is divided into forward process and reverse process. In the forward pass, Gaussian noise is gradually added to the multi-channel input data, and in the backward pass, the noise added in the forward pass is removed over multiple time steps. A multi-channel distribution is constructed by back-training the denoising network to estimate the noise added in the forward pass .

Second , we extract multi-channel diffuse features from the denoising network, which include infrared and visible features .

Third , the multi-channel diffusion features are fed into the multi-channel fusion module to directly generate three-channel fused images . A multi-channel gradient loss L MCG and a multi-channel intensity loss L MCI are proposed to preserve the texture and gradient information of three-channel fused images. This method establishes the distribution relationship of multi-channel input data based on the diffusion model , and extracts multi-channel complementary information to obtain high color fidelity .

As shown in Figure 1(d), the fused image has higher color fidelity and is more suitable for human visual perception. In terms of fusion result evaluation, in addition to the existing metrics for quantifying intensity and gradient fidelity, a metric for quantifying color fidelity is also introduced. With the proposed Dif-Fusion, infrared and visible images can be simply input into the model without color space transformation .

2. Related work

1. Image fusion of infrared and visible light

Traditional infrared and visible light image fusion algorithms can generally be divided into five categories: sparse representation, multi-scale transformation, subspace representation, saliency detection and hybrid methods. The main idea of ​​sparse representation theory is that an image signal can be represented as a linear combination of the smallest possible atoms or transformation primitives in an overly complete dictionary. Over-completeness indicates that the number of atoms in the dictionary is greater than the dimensionality of the signal. In image fusion, sparse representations typically learn a full dictionary from a set of training images, thereby capturing inherently data-driven image representations. Overcomplete dictionaries contain rich underlying atoms, allowing more meaningful and stable source image representations. Multi-scale transformation can decompose the original image into sub-images of different scales. Multi-scale transformation is similar to the human visual process, which can make the fused image have good visual effect. Subspace representation-based methods aim to project high-dimensional features into low-dimensional subspaces. Projecting to a low-dimensional subspace can help capture the inherent structure of the original input image. Furthermore, data processing in low-dimensional subspaces can save time and memory compared to high-dimensional spaces. Commonly used methods based on subspace representation include Principal Component Analysis (PCA), Independent Component Analysis (ICA) and Nonnegative Matrix Factorization (NMF).

Saliency detection models simulate human behavior and capture the most salient regions/objects from an image or scene. In recent years, the infrared and visible light image fusion methods based on saliency detection can be mainly divided into two categories : weight calculation and saliency object extraction . Researchers explore hybrid methods, commonly used hybrid methods include hybrid multi-scale transformation and sparse representation , hybrid multi-scale transformation and saliency detection , etc.

Due to the good feature learning ability and nonlinear fitting ability of neural networks , researchers have explored data-driven infrared and visible light image fusion methods based on deep learning. These methods mainly include:

1. Most AE-based fusion methods use an encoder structure to extract features from source images, and a decoder structure to complete image reconstruction. DenseFuse is a typical AE-based method. The encoding network of this method is combined with convolutional layers, fusion layers and dense blocks, where the output of each layer is connected with each layer. The fused image is then reconstructed with a decoder. In order to improve the feature extraction ability of the encoder, the researchers proposed a method called NestFuse. From a multi-scale perspective, this method can preserve a large amount of information from densely connected input data. SEDRFuse is a symmetric codec with residual network. In the fusion stage, the trained extractor is used to extract the intermediate features and compensation features, and then the two attention maps obtained from the intermediate features are multiplied by the intermediate features for fusion.

2. CNN-based fusion method , a typical method is PMGI [53]. The method is a fast unified image fusion network based on gradient and intensity scale preservation. At the same time, the method introduces a path transmission block to exchange information between different paths, which can pre-fuse gradient information and intensity information, thereby enhancing the information to be fused. In order to adaptively determine the ratio of gradient information preservation and preserve a more complete texture structure, SDNet [27] is proposed. For gradient fidelity, the method determines the optimization target of the gradient distribution according to the texture richness, and guides the fused image to contain more texture details through an adaptive decision block. To provide spatial guidance for the integration of multi-source information, the stdfusion network employs a salient object mask to help with the fusion task [4]. To combine fusion tasks with high-level vision tasks, a fusion method assisted by high-level semantic tasks is proposed [55]. Additionally, there are methods to study how lighting conditions affect image fusion.

3. GAN-based image fusion technology , using GAN to estimate the probability distribution in an unsupervised manner. Among them, FusionGAN creates an adversarial game between the generator and the discriminator. The purpose of the generator is to generate the fused image, while the discriminator tries to force the fused image to contain more details of the visible image. To address the problem that the discriminator is only used to distinguish visible images, a dual-discriminator conditional generative adversarial network (DDcGAN) is proposed, which uses two discriminators to identify structural differences between the fused image and the source image. To help the generator focus on the foreground object information of the infrared image and the background details of the visible light image, the researchers use a multi-scale attention mechanism to fuse the infrared image and the visible light image into the generator and discriminator AttentionFGAN.
In order to balance the information between infrared images and visible images, a fusion method based on multi-class constraints Generative Adversarial Network (GANMcC) is proposed. However, the aforementioned fusion methods based on generative models, whose generators add gradient fidelity and intensity fidelity constraints during the training process, cannot realize the distribution construction of infrared and visible light images in the latent space. Meanwhile, existing methods usually convert three-channel visible images into single-channel images, which makes it challenging to fully utilize multispectral information and achieve high color fidelity.

2. Diffusion model (see the blogger's previous blog)

Diffusion models have emerged as a powerful family of deep generative models with record-breaking performance in many domains, including image generation, image inpainting, image super-resolution, and image-to-image translation . Moreover, the feature representations learned from diffusion models are also very useful in discriminative tasks, including image classification, image segmentation, and object detection. Diffusion model is a deep generative model with two processes, forward process and reverse process. In the forward pass, the input data is gradually perturbed over multiple time steps by adding Gaussian noise. In the reverse process, the task of the model is to restore the original input data through multiple reverse time steps by reducing the difference between the added noise and the predicted noise.

Due to the high quality and variety of model generation, diffusion models are widely used for generation. With the continuous development of diffusion models in various fields, it breaks the long-term dominance of GAN in the field of image generation. The fusion of infrared and visible light images can also be viewed as an image generation task . This paper explores an efficient way to leverage diffusion models to achieve state-of-the-art fusion results.

3. Method

This section describes in detail the diffusion-based image fusion framework for multimodal data. The main idea of ​​the method is shown in Figure 2. Pairs of visible and infrared images are concatenated along the channel dimension, providing multi-channel input to the diffusion model. In the forward pass, the multi-channel data is gradually added to Gaussian noise until the data is close to pure noise P(I t |I t−1 ) . Then, the reverse process tries to predict and remove the added noise with the help of the denoising network Q(I t−1 |I t ) . Then, diffusion features can be extracted from the diffusion model and input into the proposed multi-channel fusion module. Color fused images will be directly produced by the proposed framework under the guidance of the proposed multi-channel loss.
insert image description here
I 0 and I t denote the multi-channel input and multi-channel data in the forward diffusion process with t time steps. P(·|·) and Q(·|·) denote the forward and reverse diffusion processes. LMCI and LMCG represent multi-channel gradient loss and multi-channel intensity loss, respectively . Next, we will first discuss how the diffusion model learns a multi-channel distribution and generates new image pairs. Next, a multi-source information aggregation method based on the diffusion model is proposed in detail. Finally, we introduce a multi-channel intensity loss and a multi-channel gradient loss to guide the training process of the fusion network.

1. Joint diffusion of infrared and visible light images

Given a pair of registered infrared images I ir ∈ R HW1 and visible images I vis ∈ R HW3 , where H and W denote height and width, respectively. To learn the joint potential structure of multi-channel data, a 1-channel infrared image and a 3-channel visible light image are concatenated to form a 4-channel image, denoted by I ∈ R HW4 . We employ the diffusion process proposed in the Denoising Diffusion Probability Model (DDPM) to construct the distribution of multi-channel data. The forward diffusion process of multi-channel images is to gradually add noise through T time steps. In the reverse process, the noise is gradually removed over T time steps. The purpose of training the diffusion model with forward and reverse processes is to learn the joint latent structure of infrared and visible images by modeling the diffusion of 4-channel images in the latent space.

  1. Forward Diffusion Process: The forward diffusion process inspired by non-equilibrium thermodynamics can be viewed as a Markov chain that progressively adds Gaussian noise to the data at T time steps. At time step t, the noisy multi-channel image can be expressed as:
    insert image description here
    where Z is a standard normal distribution. I t and I t−1 denote the 4-channel noise image generated by adding t and t−1 Gaussian noise, respectively. γ∈R HW4 is Gaussian noise. αt controls the variance of the Gaussian noise added at time step t. Given the original input I 0 ∈ R HW4 , its expression (1) and equation (2) can be deduced by the equation:
    insert image description here
    In the process of forward diffusion, given time step t, variance schedule α1, ..., αt and sampling noise, the noise multi-channel samples at time step t can be directly calculated by the equation.

  2. Backdiffusion process: In the backdiffusion process, a series of smaller denoising operations are performed using the neural network to obtain the original multi-channel image. In each time step of the reverse process, the denoising operation is performed on the noisy multi-channel image I t to obtain the previous image I t−1 . The probability distribution of I t−1 under this condition can be expressed as
    insert image description here

σ is the variance of the conditional distribution Q(I t−1 |I t ) , which can be expressed as:
insert image description here
β t = 1 − α t . The mean µ θ of the conditional distribution Q(I t−1 |I t ) can be expressed as:

insert image description here
ε θ ( , ) is a denoising network whose input is a time step t and a noisy multi-channel image I t .

  1. Loss function: First, we sample a pair of registered visible and infrared images (I ir , I vis ) in the training set, forming a multi-channel image I, and then sample the noise γ from a standard normal distribution. Third, we sample time steps t∼U({1,...,T}) from a uniform distribution. After the above sampling is completed, the loss function of the diffusion model can be expressed as:
    insert image description here
    The structure of the denoising network: In order to predict the noise added during the forward diffusion process, the structure of the denoising network εθ(·,·) adopts the U- Net structure. The SR3 backbone consists of a contracting path
    , an expansive path, and a diffuser head
    . The shrinkage channel and the expansion channel consist of 5 convolutional layers. The Diffusion Head consists of a single convolutional layer to generate the predicted noise.

Figure 3 shows the paired visible and infrared images generated by our trained diffusion model. These image pairs can be seen to visually resemble real visible and infrared images. Targets highlighted in the corresponding infrared images also appear plausible. These results demonstrate that diffusion models are powerful tools for constructing multichannel data distributions.
insert image description here

2*. Fusion of multi-channel diffusion features

After training the denoising network, we exploit the denoising network to extract multi-channel features . In the image fusion training stage, we use two kinds of losses (i.e. multi-channel gradient loss and multi-channel intensity loss) for training. Using multi-channel loss, a three-channel fusion image can be directly generated without color space transformation .

  1. Multi-channel diffusion features.

The extended path of the SR3 backbone contains 5 convolutional layers, and the output feature map sizes are: W/16H/16, W/8H/8, W/4H/4, W/2H/2, WH. The multi-channel diffusion features of the 5 stages of the denoising network are fused using the multi-channel fusion module. For the five stage features of the five dilation layers, we sum them up and feed them into the fusion head to generate a fused image of I f ∈ R HW3 . Specifically, a 3×3 convolutional layer is applied to map the high-dimensional fused features to a 3-channel output. Use leaky ReLU and Tanh as activation functions. The structure of the denoising network and multi-channel fusion module is shown in Figure 4
insert image description here

  1. Fusion process loss function.

Existing gradient losses are designed for single-channel fused images. To preserve the texture information of visible images and directly generate three - channel fused images while maintaining gradients, we extend the existing gradient loss and propose a multi-channel gradient loss LMCG :

insert image description here
Among them, ∇ represents the gradient operator. If 1 , If 2 and If 3 represent the three channels ( ie red, green, blue) of the fused image. I vis 1 , I vis 2 and I vis 3 represent the 3 channels of the input visible image I vis . Thermal radiation is often characterized by pixel intensity. We apply an intensity loss so that the fused image has a similar intensity distribution to infrared and visible images . However, similar to the gradient loss, the intensity lossv is designed to generate single-channel fused images. We extend the existing intensity loss to a multi-channel intensity loss L MCI , which can be formulated as:

insert image description here

Existing fusion methods usually preserve color information through color space conversion . To solve this problem and make full use of the diffusion property, this paper directly generates three-channel fused images with multi-channel gradient and intensity loss.

insert image description here

4. Experiment

1. Experimental setup

Datasets : Color and infrared image pairs from MSRS, RoadSence and M3FD datasets to evaluate the proposed framework. Comparison with six state-of-the-art algorithms: FusionGAN, SDDGAN, GANMcC, SDNet, U2Fusion, and TarDAL. SDNet and U2Fusion are fusion methods based on CNN architecture, while FusionGAN, SDDGAN, GANMcC and TarDAL are based on generative models and their variants. For the methods being compared, fused images are generated using publicly accessible code and pretrained models. To produce color results for visual analysis and quantitative evaluation, the single-channel fusion results of the compared methods are converted to color images in post-processing.

Evaluation indicators : 6 statistical indicators were used in the quantitative evaluation, 5 of which were virtual information (MI), visual information fidelity (VIF), spatial frequency (SF), Qabf and standard deviation (SD). MI mainly evaluates how well the information from the initial image pair is aggregated in the fused image. VIF evaluates the fidelity of the information presented in the fused images. The spatial frequency related information in the combined data is measured by SF. Use Qabf to quantify the edge information of the source image. SD mainly evaluates the contrast of composite images.

Specifically, we introduce Delta E, a color difference calculation index built in the CIELAB space, which is considered to be more consistent with the human perceptual system , to quantify the color distortion between the fused image and the original visible image. Delta E is a color distance measurement. Due to the inhomogeneity of perception, the human eye is more sensitive to some colors than others, so Euclidean distance measured directly in color space does not match human perception. Delta E as a solution to these problems, along with some corrections for neutral color, luminance, chroma, chromatic aberration, hue, and hue rotation

It should be noted that while the other metrics require the original image, the SF and SD metrics can be directly computed on the fused image. A lower Delta E value indicates that the smaller the color distortion, the better the fusion quality, but the opposite is true for the other 5 indicators, the higher the fusion value, the better the fusion result

Training details : Training on MSRS (1083 training pairs of visible light and infrared images, 361 pairs of test images), randomly cropping 160×160 patches. We extract diffusion features generated at three time steps (eg, 5, 50, 100), forming multi-channel diffusion features . To train the fusion module, use the Adam optimizer with a learning rate of 0.0001. The batchsize is 24, and the model is trained for 300 epochs. Experimental hardware: Workstation with NVIDIA RTX3090 GPU and 3.80 GHz Intel ® Core(TM) i7-10700K CPU.

2. Fusion performance analysis (effect display)

  1. Qualitative results :

The MSRS dataset contains two scenes, daytime and nighttime, from which two pairs of images are selected to show the results from different models. Infrared images highlight objects with high thermal radiation information in daytime scenes, while visible images contain rich texture and color information. Fused images emphasize important objects in infrared images, maintaining fine-grained texture and color information in visible images.

The infrared image in Figure 5 highlights three pedestrians that persist in the fused images generated by all methods. However, only our method and TarDAL's results are very similar to the original visible image . Synthetic images obtained by other methods (such as FusionGAN, GANMcC, etc.) are visually darker and have large color distortions , such as green trees turning black in fusion images made by SDDGAN, U2Fuison, SDNet. The red boxes in Fig. 5 zoom in on the details of the windows under the eaves to demonstrate the benefits of our method in terms of detailed maintenance. Only the results of FusionGAN, SDDGAN, TarDAL and our method show that the area under the eaves in the infrared image is slightly brighter than the surrounding environment. But only our method clearly preserves the unique contours and arrangements of the windows. Furthermore, our method can easily identify foreground (green plants) and background (walls under windows) in fused images.
insert image description here
Another pair of daytime images is shown in Figure 6. In the infrared, the 2 cyclists and the distant pedestrian stand out. White markers in green areas, only visible in visible light images. Some methods (e.g., FusionGAN, GANMcC) have difficulty showing these properties clearly. U2Fusion and SDNet can display structural information in red boxes, while SDDGAN and TarDAL can emphasize structural information in green boxes. However, only our method can simultaneously preserve the key features in both rectangles. The above analysis reflects the advantages of our method in color preservation, learning complementary information, feature details , etc.

The red and green areas in Figure 7: the red box has two pedestrians; the green has a signboard, and the text is highlighted in the visible image, but completely black in the infrared image. The red box also contains the zebra crossings visible in the image in the middle and lower regions. All methods emphasize the two pedestrians in the red box to varying degrees. First, however, the body covered by clothing is not as bright as other areas in the raw infrared image. This difference is ignored by SDDGAN and TarDAL, i.e. the whole body is equally bright, resulting in loss of structural information. Second, many methods (such as FusionGAN, GANMcC, U2Fusion, SDNet) ignore the zebra crossing information in visible images. Our approach avoids both of these problems. In addition, compared with other methods, this method better preserves the information of the visible image in the green box, including brightness, color and sharpness .

insert image description here
The second pair of nighttime images in Fig. 8 demonstrates complex lighting scenarios. In the infrared image, the highlighted object is a pedestrian. The medium-brightness area is inside the window, and the weaker-brightness area is the irregular wall on the right. Areas in the image with a lot of color and texture are visible, such as the white car and the road (left side of the image), and the window area is bright. We expect the fused image to contain key information at different brightness levels in the infrared image . Additionally, we want to preserve the authenticity of colors and textures in visible images. Distinguishing surface features of upscaled images from FusionGAN, GANMcC, U2Fusion, and TarDAL is challenging. Although the results of SDDGAN and SDNet contain this structure, they are somewhat blurred or polluted by noise. Only our method produces fused images that are close to the original infrared images in both sharpness and brightness. Images produced by FusionGAN, SDNet, GANMcC, and U2Fusion all show color distortions, for example, a white vehicle in a green box appears green. In summary, this method can still maintain the color fidelity of visible images and the faint information of infrared images by extracting complementary information from multi-channel data.

  1. Quantitative results

We quantitatively compare the proposed method with six state-of-the-art methods. Figure 9 shows the quantitative results of six statistical indicators on the MSRS dataset. Our method shows significant benefits in five metrics, namely MI, VIF, Qabf, SD and Delta E. The highest MI indicates that our method successfully transfers the most information from the multi-channel source image to the fused image. The best VIF shows that divergent fusion produces fused images that are more in line with the human visual system. Our divergent fusion shows the best Qabf and thus preserves more edge information. In addition, the SD of this method is the best, indicating that the contrast of our fused images is the largest. Furthermore, since the diffusion model utilizes multi-channel complementary information, our method significantly outperforms the comparison methods in terms of color fidelity metric (Delta E). In the SF metric, this method is only slightly inferior to SDDGAN and TarDAL.
insert image description here

3. Generalization experiment

  1. qualitative results

On the RoadScene and M3FD datasets, the models trained on the MSRS dataset are tested to evaluate the generalization performance . Tests were also performed on these two datasets. We selected one example from each dataset to study. Figure 10 is from a dataset of road scenes. The visible light image is mostly composed of roads, trees, vehicles, and the sky, while the infrared image highlights the underside of the car and parts of the road . Although the bright areas in the infrared images were preserved to some extent in the images fused using various methods, the colors of the sky and trees in the images fused by FusionGAN, GANMCC, SDDGAN, and SDNet changed significantly. The images produced by U2Fusion and TarDAL have less color distortion, but the output is blurry and lacks important structural information (such as tree crowns).
In contrast, our fused image effectively preserves salient information in infrared images while maintaining the color and texture of salient regions (such as sky and trees) in visible images. In the red rectangle, the rear of a van is enlarged. In the fused images created by FusionGAN, GANMcC, SDDGAN, SDNet, U2Fusion, and TarDAL, the silhouettes of the carriages and wheels are chaotic and cluttered. Only our results preserve the color and structural details of regions in visible images. This phenomenon demonstrates the ability of the method to extract complementary information, as well as its advantages in texture and color preservation .

insert image description here

An underground garage scene was selected from the M3FD dataset for qualitative analysis, as shown in Figure 11, where the pipe structure and background wall are highlighted in the infrared image. First, the fused images produced by TarDAL and our method are very similar to the original visible images in terms of overall perception. Due to improper combination of complementary information, FusionGAN, SDNet, U2Fusion replace the brightness of the column in the visible image with the brightness of the infrared image, resulting in the column on the right side of the image being too dark. GANMcC, SDDGAN, and TarDAL partially alleviate this problem. In the composite image, the pillars preserve some texture and color information of the original visible image, but are not as effective as the proposed method. In addition, SDDGAN and TarDAL suffer from the same problem as in Figure 7, that is, the brightness enhancement is too high, resulting in the loss of wall structure information. Reflective corner masks in the original visible image are represented by red rectangles and enlarged. Only our fused image preserves the logo's color and structural details while maintaining brightness .

The above analysis results show that the method has strong generalization ability. It can mine complementary information from multimodal data in different scenes with good texture and color preservation.

  1. Quantitative results

We select 25 image pairs from two datasets other than the MSRS dataset to quantitatively evaluate the generalization performance of the method. Table I and Table II show the quantitative results of 6 statistical metrics and 6 state-of-the-art methods on the M3FD and RoadScene datasets. As shown in Table I, we can see that Dif-Fusion ranks first among 6 metrics on the M3FD dataset. Experimental results show that the proposed method generates fused images with rich texture details, highest contrast and best visual quality. According to Table II, on the road scene dataset, the Dif-Fusion method outperforms the compared methods in terms of VIF, Qabf and Delta E. In addition, Dif-Fusion ranks first in DeltaE of M3FD and road scene datasets, which means that this method can improve color fidelity while ensuring the amount of information.

  1. Ablation experiment

The framework improves color preservation accuracy and visual quality by exploiting multi-channel complementary information. To verify the validity of the diffusion model, we eliminated the diffusion process . More specifically, the original network structure is maintained, but the diffusion process is removed. The results of the ablation studies are summarized in Table III. In the MSRS dataset, after removing the diffusion process, the performance of our method on 5 metrics (i.e., MI, VIF, SF, Qabf and Delta E) drops. On both the M3FD dataset and the RoadScene dataset, after removing the diffusion process, the performance of our method decreases on all six metrics. It is worth noting that in the M3FD dataset, the color fidelity drops significantly after removing the diffusion process, indicating that the distribution of multi-channel information and the extraction of multi-channel complementary information play a very important role in color preservation.
insert image description here

Summarize

In this paper, a diffusion model-based fusion method for infrared and visible images is proposed to achieve multi-channel complementary information extraction and effectively preserve color and visual quality. On the one hand, the distribution of multi-channel input data in the latent space is constructed using forward and back-diffusion processes . The distribution of multi-channel data is established by training a denoising network in the inverse process to predict the Gaussian noise added in the forward process. On the other hand, a method for directly generating three-channel images is proposed . To directly preserve the gradient and intensity of three-channel images, a multi-channel gradient and intensity loss is proposed. For fusion image evaluation, in addition to the existing texture and intensity fidelity metrics, we also introduce Delta E to quantify color fidelity . Overall, we investigate a framework for extracting multi-channel complementary information based on diffusion models, and attempt to generate color fusion images directly from multi-modal inputs.

おすすめ

転載: blog.csdn.net/qq_45752541/article/details/130614373