Detailed explanation of the paper - "Learning Semantic-Aware Knowledge Guidance for Low-Light Image Enhancement"

Paper address: "Learning Semantic-Aware Knowledge Guidance for Low-Light Image Enhancement"
code address: https://github.com/langmanbusi/Semantic-Aware-Low-Light-Image-Enhancement

Abstract

Low Light Image Enhancement (LLIE) studies how to enhance lighting and produce normal light images. Most existing methods improve low-light images in a globally unified manner without considering the semantic information of different regions. Without semantic priors, the network can easily deviate from the original color of regions.
To address this issue, we propose a novel 语义感知知识引导框架(semantic-aware knowledge-guided framework, SKF)framework that can help low-light enhancement models learn rich and diverse priors that are included in semantic segmentation models.
We focus on integrating semantic knowledge from three key aspects:
a semantic-aware embedding module that integrates semantic priors in the feature representation space;
a semantic-guided color histogram loss (semantic-guided color histogram loss). loss) to maintain color consistency across various instances;
a semantic-guided adversarial loss to generate more natural textures via semantic priors.
Our SKF is attractive in serving as a general framework for LLIE tasks. Extensive experiments show that models using SKF significantly outperform baselines on multiple datasets, and our SKF generalizes well to different models and scenarios.

1. Introduction

In the real world, low-light imaging is fairly common due to unavoidable environmental or technical constraints such as insufficient lighting and limited exposure time. Low-light images not only have poor visibility to human perception, but are also unsuitable for subsequent multimedia computing and downstream vision tasks designed for high-quality images [4, 9, 36]. Therefore, we propose Low-light Image Enhancement (LLIE) to reveal hidden details in low-light images and avoid performance degradation in subsequent vision tasks. The mainstream traditional LLIE methods include methods based on histogram equalization [2] and methods based on Retinex model [18].

Recently, many deep learning-based LLIE methods have been proposed, such as end-to-end frameworks [5, 7, 34, 45, 46, 48] and retainex-based frameworks [29, 41, 43, 44, 49, 53, 54] . The deep LLIE method benefits from its ability to model the mapping between low-light and high-quality images, and usually achieves better results than traditional methods. However, existing methods usually perform a global and unified improvement on low-light images without considering the semantic information of different regions, which is the key to enhancement. As shown in Figure 1(a), networks that lack the utilization of semantic priors can easily deviate from the original tones of regions [22].
insert image description here

Furthermore, the study also demonstrates the importance of combining semantic priors with low-light enhancement. [8] utilize semantic maps as priors, which are integrated into the feature representation space, thus improving the image quality. [58] did not rely on optimizing intermediate features, but adopted a new loss to ensure the semantic consistency of the enhanced image. These methods successfully combine semantic priors with LLIE tasks, demonstrating the superiority of semantic constraints and guidance.
However, their approach fails to take full advantage of the knowledge that semantic segmentation networks can provide, limiting the performance gain of semantic priors. Furthermore, the interaction between segmentation and augmentation is method-specific, limiting the possibility of incorporating semantic guidance into the LLIE task.
Therefore, we wondered two questions:
1. How do we obtain diverse and usable semantic knowledge?
2. How does semantic knowledge contribute to the improvement of image quality for LLIE tasks?

We try to answer the first question. First, a semantic segmentation network pre-trained on a large-scale dataset is introduced as an example 语义知识库(semantic knowledge bank, SKB). This algorithm can provide richer and more diverse semantic priors to improve the ability of the augmented network. Second, according to previous studies [8, 19, 58], the available priors provided by SKB mainly consist of intermediate features and semantic maps. Once an LLIE model is trained, SKB generates the above semantic priors and guides the augmentation process. This prior can not only refine image features using techniques such as affinity matrices, spatial feature transformations [40], and attention mechanisms, but also explicitly integrate region information into LLIE task [26] to guide the design of the objective function.

Then we try to answer the second question. Based on this, we design a series of new methods to integrate semantic knowledge into LLIE tasks, and build a new semantic-aware knowledge-guided framework (SKF). First, we use a high-resolution network [38] (HRNet) pre-trained on the PASCAL-Context dataset [35] as the aforementioned SKB.
To take advantage of intermediate features, we developed a 语义感知嵌入(semantic-aware embedding,SE)模块. It computes the similarity between reference and target features and employs cross-modal interactions between heterogeneous representations. Therefore, we quantify the semantic awareness of image features as a form of attention, and embed semantic consistency in the augmentation network.

Second, some methods [20, 55] propose to optimize image enhancement using color histograms to maintain the color consistency of images, instead of simply enhancing brightness globally. On the other hand, the color histogram is still a global statistical feature and cannot guarantee local consistency. Therefore, we propose a Semantic-Guided Color Histogram (SCH) loss to improve color consistency. Here, we intend to utilize local geometric information derived from scene semantics and global color information derived from content. While ensuring the original color of the enhanced image, spatial information can also be added to the color histogram to achieve more detailed color restoration.

Third, existing loss functions do not match human perception and cannot capture the inherent signal structure of images, resulting in poor visual effects. To improve visual quality, enlightened [16] adopts global and local image-content consistency, and randomly selects local patches. However, the discriminator does not know which regions are likely to be "fake". Therefore, we propose a semantically guided adversarial loss. Specifically, by exploiting the segmentation map to identify pseudo-regions, the discriminator's ability is improved, which further improves the image quality.

The main contributions of our work are as follows:

  • We propose a semantically-aware Knowledge-Guided Framework (SKF) to improve the performance of existing methods by jointly maintaining color consistency and improving image quality.
  • To make full use of the semantic prior provided by Semantic Knowledge Base (SKB), we propose three key techniques: Semantic-Aware Embedding (SE) module, Semantic-Guided Color Histogram (SCH) loss and Semantic-Guided Adversarial (SA) loss.
  • We conduct experiments on the LOL/LOL-v2 dataset and the unpaired dataset. Experimental results show that our SKF has a great improvement in solving LLIE tasks, verifying its effectiveness.

2. Related Work

2.1. Low-light Image Enhancement

Traditional methods
Traditional low-light enhancement methods include methods based on histogram equalization [2] and methods based on the Retinex model [18]. The former improves low-light images by extending the dynamic range. The latter decomposes low-light images into reflection and light maps, and treats the reflection component as an augmented image. Such model-based approaches require explicit priors to fit the data well, but designing suitable priors for various scenarios is difficult.
Learning-based methods
Recent deep learning-based methods have shown promising results [15, 29, 43, 44, 53, 54, 56]. We can further divide existing designs into retainex-based methods and end-to-end methods. Retex-based methods utilize deep networks to decompose and enhance images. Wei et al. proposed a retainex-based two-stage method called retainex-net [43]. Inspired by reatex-net, Zhang et al. proposed two refinement methods called KinD [54] and KinD++ [53]. Recently, Wu et al. [44] proposed a new retainex-based deep unfolded network, which further integrates the advantages of model-based and learning-based methods.
Compared with retainex-based methods, end-to-end methods directly learn augmented results [5-7, 27, 32, 34, 37, 41, 45, 46, 51, 57, 59]. Lore et al. proposed a deep autoencoder named LowLight Net (LLNet), the first attempt of [30]. Subsequently, various end-to-end methods are proposed. Laplacian pyramids [27], local parameter filters [34], Lagrangian multipliers [57], De-Bayer-Filter [5], normalized flow [41] and wavelet transform [7] were proposed ] and other physics-based concepts to improve model interpretability and obtain visually pleasing results. In [16, 17, 48], adversarial learning is introduced to capture visual properties. In [11], light enhancement was creatively formulated as a task for image-specific curve estimation using zero-shot learning. [20, 47, 55] utilize 3D lookup tables and color histograms to maintain color consistency. However, existing designs focus on optimizing the enhancement process while ignoring the semantic information in different regions. In contrast, we design an SKF with three key techniques to explore the potential of semantic priors, leading to visually pleasing augmentation results.

2.2. Semantic-Guided Methods

In recent years, semantic-guided methods have demonstrated the reliability of semantic priors. These methods can be divided into two categories: loss-level semantic guidance methods and feature-level semantic guidance methods.
Loss-level semantic-guided methods
To exploit semantic priors, some studies have used semantic loss-awareness as an additional objective function for raw vision tasks. In image denoising [28], image super-resolution [1], and low-light image enhancement [58], researchers directly use semantic segmentation loss as an additional constraint to guide the training process. In addition, Liang et al. [26] better preserve the image details through a semantic brightness consistency loss.

Feature-level semantic-guided methods.
Compared with loss-level semantic-guided methods, feature-level semantic-guided methods focus on extracting intermediate features from semantic segmentation networks and introduce semantic priors combined with image features in the feature representation space. There are similar works in image restoration [23], image parsing [24], image super-resolution [40], low-light image enhancement [8], depth estimation, etc. [10,19].

Existing semantic-guided methods are limited due to insufficient interaction between semantic priors and the original task. Therefore, we propose a semantic-aware framework to fully utilize semantic information at both loss level and feature level, including two semantically guided losses and a semantically aware embedding module. Specifically, our SKF is attractive as a general framework compared to semantically guided methods [8, 26, 58] in the LLIE task.

3. Method

3.1 Motivation and Overview

Lighting enhancement is the process of making underexposed images look better by adjusting lighting, removing noise, and restoring lost detail. Semantic priors can provide rich information for improving augmentation performance. Specifically, semantic prior can help reformulate the existing LLIE (low-light image enhancement) method as a region-aware enhancement framework. In particular, the new model will blur noise in smooth areas like the sky in a simple way, while detail-rich areas like indoor scenes will be very careful. Furthermore, the enhanced images are color-consistently processed in combination with semantic priors. Networks that lack access to semantic priors can easily deviate from the original tone of regions [22]. However, existing low-light enhancement methods ignore the importance of semantic information and have limited capabilities.
insert image description here
This paper proposes a new SKF algorithm (semantic-aware knowledge-guided framework, SKF), which jointly optimizes image features, maintains regional color consistency, and improves image quality. As shown in Figure 2, SKB provides semantic priors and is integrated into the LLIE task through three key components: SE module, SCH loss and SA loss.

  • Problem definition of semantic-aware LLIE

Given low-light image I l ∈ RW × H × 3 I_l ∈ R^{W×H×3}IlRW x H x 3 , widthWWW , heightHHH , combined with semantic segmentation, the LLIE process can be modeled as two functions, the first one:

M = F segment  ( I l ; θ s ) , M=\mathbf{F}_{\text {segment }}\left(I_l ; \theta_s\right), M=Fsegment (Il;is),

where M is the semantic prior, including segmentation results and intermediate features with multi-scale dimensions.

F s e g m e n t F_{segment} FsegmentRepresents the pre-trained semantic segmentation network, as SKB (semantic knowledge bank), θ s \theta_sisis frozen during the training phase. Then use MMM as input:

I h ^ = F enhance  ( I l , M ; θ e ) \widehat{I_h}=\mathbf{F}_{\text {enhance }}\left(I_l, M ; \theta_e\right) Ih =Fenhance (Il,M;ie)

其中 I h ^ ∈ R W × H × 3 \hat {I_h} \in R^{W\times H\times 3} Ih^RW × H × 3 is the enhancement result,F enhance F_{enhance}FenhanceIndicates the enhanced network. During the training phase, in MMUnder the guidance of M , update θ e \theta _eby minimizing the objective functionie θ s \theta_s isfixed:

θ e ^ = argmin ⁡ L ( I h ^ , I h , M ) \widehat{\theta_e}=\operatorname{argmin} \mathcal{L}\left(\widehat{I_h}, I_h, M\right) ie =argminL(Ih ,Ih,M)

其中 I h ∈ R W × H × 3 I_h\in R^{W\times H\times 3} IhRW×H×3 是ground truth, L ( I h ^ , I h , M ) \mathcal{L}\left(\widehat{I_h}, I_h, M\right) L(Ih ,Ih,M ) is the objective function of semantic-aware LLIE.

3.2 Semantic-Aware Embedding Module

Another challenge that requires special consideration when refining image features with semantic priors is the difference between the two sources. To solve this problem, we propose SE module to refine image feature maps, as shown in Fig. 3. The SE module (semantic-aware embedding) is like a bridge between the segmentation net and the augmentation net (see Figure 2), establishing a connection between two heterogeneous tasks.
insert image description here
In our framework, we chose **HRNet [38]** as the SKB due to its excellent performance with some task-specific modifications. In addition to the semantic map, we also use the output features before the representation head as multi-scale semantic priors.

For further illustration, three SE modules are shown in Fig. 2: Therefore, we adopt three spatial resolutions ( H / 2 4 − b , W / 2 4 − b H/2^{4−b},W/2 ^{4−b}H/24b,W/24 b ) semantic/image features (F sb / F ib F^b_s/F^b_iFsb/Fib, b = 0 , 1 , 2 b = 0,1,2 b=0,1,2 ), where H and W are the height and width of the input image. SE module inF sb F^b_sFsbSum F ib F^b_iFibPerform pixel-wise interaction between, and give the final refined feature map F ob F^b_oFob. The details of the learning process are shown below.

The SE module calculates semantic awareness of the image features through cross-modal similarity, and generates a semantic-aware map.

We first apply a convolutional layer to transform F sb F_s^bFsbSum F ib F_i^bFibConvert to the same dimension. Next, inspired by restoremer [50], we employ a transposed-attention mechanism to compute the attention map with a lower computational cost. Therefore, the semantic-aware attention map is described as follows

A b = Softmax ⁡ ( W k ( F i b ) × W q ( F s b ) / C ) , A^b=\operatorname{Softmax}\left(W_k\left(F_i^b\right) \times W_q\left(F_s^b\right) / \sqrt{C}\right), Ab=Softmax(Wk(Fib)×Wq(Fsb)/C ),

Among them, W k ( ⋅ ) W_k( )Wk() andW q ( ⋅ ) W_q(·)Wq() is the convolutional layer, LN is the layer normalization, and C is the number of channels of the feature. Here,A b ∈ RC × CA^b∈R^{C×C}AbRC × C represents a semantically aware attention map, which representsF ib F^b_iFiband F sb F_s^bFsbinterrelationships between. Then we use A b A^bAbConstruct image featuresF ib F_i^bFib,As follows:

F o b = F N ( W v ( F i b ) × A b + F i b ) F_o^b=F N\left(W_v\left(F_i^b\right) \times A^b+F_i^b\right) Fob=FN(Wv(Fib)×Ab+Fib)

Where FN is a feed-forward network, F ob F_o^bFobis the final refined feature map of the bth SE module, which is used as the input of the (b+1)th decoder of the enhancement network.

3.3 Semantic-Guided Color Histogram Loss

Color histograms carry vital image statistics and are helpful for learning color representation. DCC-Net [55] uses a PCE module with an affinity matrix to match the color histogram and content at the feature level, thus maintaining the color consistency of the enhanced image. However, the color histogram describes a global statistic that removes the differences in color features between different instances. Therefore, we propose an intuitive way to achieve local color adjustment, the Semantic-Guided Color Histogram (SCH) loss, as shown in Figure 2. It focuses on adjusting the color histogram of each instance, thus preserving more detailed color information.

The augmented results are first segmented into image patches with different instance labels using a semantic map; each patch contains an instance with the same label. Therefore, the patch generation process is defined as follows:

P = { P 0 , P 1 , … , P class  } , P c = I out  ⊙ I seg  c P=\left\{P^0, P^1, \ldots, P^{\text {class }}\right\}, \quad P^c=I_{\text {out }} \odot I_{\text {seg }}^c P={ P0,P1,,Pclass },Pc=Iout Ithemselves c

In the formula, ⊙ is the dot product, I out I_{out}IoutFor enhanced results, I segc I^c_{seg}Isee gcis the cth channel of one-hot semantic map, P c ∈ RW × H × 3 P^c∈R^{W×H×3}PcRW × H × 3 is the cth image patch, and P is the group containing all patches.

Due to the discrete nature of the color histogram, we are inspired by Kernel Density Estimation [3], and an approximate differentiable version is used for model training. Pixels close to the boundary are not considered for prediction error of semantic results. Without considering the edge pixels, we refine the patch group P to P ′ P’P , reducing the impact of misclassification. For the R channel P c ′ ( R ) P_{c'}(R)of the cth image patchPc( R ) , the estimation process is defined as follows:

x i j h = x j − i − 0.5 255 , x i j l = x j − i + 0.5 255 , x_{i j}^h=x_j-\frac{i-0.5}{255}, \quad x_{i j}^l=x_j-\frac{i+0.5}{255}, xijh=xj255i0.5,xijl=xj255i+0.5,

In the formula, xj x_jxjmeans P c ′ ( R ) P^{c'}(R)Pc (R)jjth_j pixels,i ∈ [ 0 , 255 ] i∈[0,255]i[0,255 ] for pixel intensity. xijh x_{ij}^hxijh x i j l x_{ij}^l xijlRepresenting the higher anchor and lower anchor respectively, which are the key variables for estimating the histogram, as shown in the following figure:

H i c = ∑ j ( Sigmoid ⁡ ( α ⋅ x i j h ) − Sigmoid ⁡ ( α ⋅ x i j l ) ) H c = { i , H i c } i = 0 255 \begin{gathered}H_i^c=\sum_j\left(\operatorname{Sigmoid}\left(\alpha \cdot x_{i j}^h\right)-\operatorname{Sigmoid}\left(\alpha \cdot x_{i j}^l\right)\right) \\H^c=\left\{i, H_i^c\right\}_{i=0}^{255}\end{gathered} Hic=j(Sigmoid( axijh)Sigmoid( axijl))Hc={ i,Hic}i=0255

where H c H^cHc meansP c ′ ( R ) P^{c'}(R)Pc (R)differentiable histogram,H ic H_i^cHicIndicates that the intensity value is iiThe estimated number of pixels for i . α \alphaα is a scale factor, for better estimation, we set it to 400 in our experiments. TwoSigmoids ( ⋅ ) Sigmoid(·)Sigmoid() The difference of the result representsxj x_jxjFor intensity values ​​iiThe contribution of the number of pixels of i , whenxj x_jxjexactly equal to iiWhen i , the difference is 1, that is,xj x_jxjFor H ic H_i^cHicplus 1.

Finally, we use l 1 l_1l1loss to constrain the estimated differentiable histogram. Therefore, SCH loss can be described as:

L S C H = ∑ c ∥ H c ( I ^ h ) − H c ( I h ) ∥ 1 , \mathcal{L}_{S C H}=\sum_c\left\|H^c\left(\hat{I}_h\right)-H^c\left(I_h\right)\right\|_1, LSCH=c Hc(I^h)Hc(Ih) 1,

where I ^ h \hat I_hI^hSum I h I_hIhIndicate output and ground truth respectively, H c ( . ) H^c(.)Hc (.)represents the histogram estimation process.

3.4 Semantic-Guided Adversarial Loss

In the image inpainting task, global and local discriminators are used to encourage more realistic results [14, 25]. Enlightened gan [16] also adopts this idea, but local patches are randomly selected instead of focusing on fake regions. Therefore, we introduce semantic information to guide the discriminator to focus on useful regions. To do this, we pass the segmentation graph I seg I_{seg} mentioned in Section 3.3Isee gand image patches P ′ P'P further refines the global and local adversarial loss functions. Finally, we propose a semantic-guided adversarial (SA) loss.

For the local adversarial loss, we first use the modified patch group P ′ P’P' as outputI out I_{out}IoutThe candidate fake patches. Then, we compare P'P'PThe discriminating result of the image patch between ′ , the worst patch is most likely to be "fake", which can be selected to update the parameters of the discriminator and generator. Therefore, the discriminator is likely to find the target fake regionxf x_fxf ~ p f a k e p_{fake} pfake. And the real real patch xr x_rxr ~ p r e a l p_{real} prealEach time it is still randomly cropped from the real image. The local adversarial loss function is defined as:

L local  = min ⁡ G max ⁡ D E x r ∼ p real  MSE ⁡ ( D ( x r ) , 0 ) + E x f ∼ p fake  MSE ⁡ ( D ( x f ) , 1 ) , \begin{gathered}\mathcal{L}_{\text {local }}=\min _G \max _D \mathbb{E}_{x_r \sim p_{\text {real }}} \operatorname{MSE}\left(D\left(x_r\right), 0\right) \\\\+\mathbb{E}_{x_f \sim p_{\text {fake }}} \operatorname{MSE}\left(D\left(x_f\right), 1\right),\end{gathered} Llocal =GminDmaxExrpreal MSE(D(xr),0)+Exfpfake MSE(D(xf),1),

x f = P t , D ( P t ) = min ⁡ ( D ( P 0 ) , … , D ( P class  ) ) x_f=P^t, D\left(P^t\right)=\min \left(D\left(P^0\right), \ldots, D\left(P^{\text {class }}\right)\right) xf=Pt,D(Pt)=min(D(P0),,D(Pclass ))

where MSE ( ⋅ ) MSE(·)MSE() is mean square error,P t P_tPtFake patch for the target.

For the global adversarial loss, we adopt a simple design to achieve semantic-aware guidance in identifying fake samples. We will I out I_{out}IoutI seg ′ I'_{seg}Isee gConcatenated, this is the output feature before Softmax, as a new xf x_fxf. Randomly sample images xr x_r with real distributionxr. Finally, the global adversarial loss function is defined as:

L global  = min ⁡ G max ⁡ D E x r ∼ p real  MSE ⁡ ( D ( x r ) , 0 ) + E x f ∼ p fake  MSE ⁡ ( D ( x f , I seg  ′ ) , 1 ) \begin{array}{r}\mathcal{L}_{\text {global }}=\min _G \max _D \mathbb{E}_{x_r \sim p_{\text {real }}} \operatorname{MSE}\left(D\left(x_r\right), 0\right) \\\\+\mathbb{E}_{x_f \sim p_{\text {fake }}} \operatorname{MSE}\left(D\left(x_f, I_{\text {seg }}^{\prime}\right), 1\right)\end{array} Lglobal =minGmaxDExrpreal MSE(D(xr),0)+Exfpfake MSE(D(xf,Ithemselves ),1)

Therefore, the SA loss can be defined as:

LSA = L global + L local \mathcal{L}_{SA}=\mathcal{L}_{\text {global }}+\mathcal{L}_{\text {local }}LS A=Lglobal +Llocal 

We define the original loss function of Enhancement Net as L recon L_{recon}Lrecon, depending on the original settings of the chosen method, can be l 1 l_1l1loss, MSE loss, SSIM loss, etc., can also be a combination of them. Therefore, the overall loss function of our SKF can be expressed as:

L all  = L recon  + λ S C H L S C H + λ S A L S A \mathcal{L}_{\text {all }}=\mathcal{L}_{\text {recon }}+\lambda_{S C H} \mathcal{L}_{S C H}+\lambda_{S A} \mathcal{L}_{S A} Lall =Lrecon +lSCHLSCH+lS ALS A

where λ s \lambda sλs are the weights used to balance these losses.

4. Experiments

4.1 Experimental Settings

  • Datasets
    We evaluate the proposed framework on several datasets in different scenarios, including LOL [43], LOL-v2 [49], MEF [31], LIME [12], NPE [39] and DICM [21] . The LOL dataset [43] is a real-capture dataset containing 485 low/normal light image pairs for training and 15 for testing. The Loll-v2 dataset [49] is the real part of Loll-v2, which is larger and more diverse than LOL, including 689 low/normal light pairs for training and 100 pairs for testing. Among them, MEF (17 images), LIME (10 images), NPE (85 images) and DICM (64 images) are real datasets containing unpaired images.

  • Metrics
    To evaluate the performance of different LLIE methods with and without SKF, we used full reference and non-reference image quality evaluation metrics. For the LOL/LOL-v2 dataset, Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM) [42], Learned Perceptual Image Patch Similarity (LPIPS) [52], Natural Image Quality Evaluator (NIQE) [33]. For MEF, LIME, NPE and DICM datasets without paired data, only NIQE is used because there is no ground truth.

  • Compared Methods
    To verify the effectiveness of our design, we compared our method with a series of SOTA methods for LLIE, including LIME [13], RetinexNet [43], KinD [54], DRBN [48], KinD++[53], Zero-DCE[11], ISSR[8], tgan[16], MIRNet[51], HWMNet[7], SNR-LLIE-Net[46], LLFlow[41]. To truly demonstrate the superiority of our method, we rationally select several methods as baseline networks. The most representative methods are retexnet, KinD and KinD++, and the latest methods are HWMNet, SNR-LLIE-Net and LLFlow. Therefore, our methods are labeled as RetinexNetSKF, KinD-SKF, DRBN-SKF, KinD+±SKF, HWMNetSKF, SNR-LLIE-Net-SKF, LLFlow-s-SKF and LLFlow-lskf (small and large versions of LLFlow, respectively). Version).

  • Implementation Details
    We conduct experiments on NVIDIA 3090 GPU and NVIDIA A100 GPU, and the experiments are based on the code released by the baseline network with the same training settings. Among them, only the last subnet of retex-SKF, KinD-SKF and KinD+±SKF is trained with SCH loss and SA loss, and the other subnets are trained with the original loss function. In addition, we did not apply SA loss to LLFlow because the output is not augmented during the training phase. Furthermore, SE modules are reasonably located in the decoders of all baseline networks.

4.2 Quantitative Evaluation

  • The evaluation results of Quantitative results on LOL and LOL-v2 datasets
    are shown in Table 1. We can observe that our SKF achieves consistent and significant performance gains over each baseline method. Specifically, our SKF provides an average improvement of 1.750 dB/1.611 dB on the LOL/loll-v2 dataset, respectively, by introducing the ability to suppress noise and artifacts and maintain color consistency. Notably, our LLFlow-L-SKF achieves PSNR values ​​of 26.798 dB/28.451 dB on the LOL/LOL-v2 dataset, establishing a new SOTA. Furthermore, similar performance is achieved for SSIM values. On the LOL/loll-v2 dataset, our SKF produces better SSIM values ​​with an average of 0.041/0.037, which shows that our SKF helps baseline methods recover brightness and contrast and preserve detailed structural information. Furthermore, the large gains in LPIPS and NIQE provided by our SKF reasonably suggest that human intuition is closer to matching by introducing our devised semantic priors.
    insert image description here
  • Quantitative results on MEF, LIME, NPE and DICM datasets.
    The evaluation results of MEF, LIME, NPE and DICM datasets are shown in Table 2. Overall, every method using SKF achieves better NIQE results than the baselines on all 6 datasets except for the 3 worse cases of DRBN-SKF and HWMNet-SKF. retavexnet-SKF performed best on the MEF dataset with a NIQE value of 3.632, while KinD+±SKF performed best on the other 5 datasets. Overall, it is worth noting that our SKF yields an average gain of 0.519 on NIQE across all methods and datasets. The results show that the method can obtain more natural textures and perform better on low-light images.
    insert image description here

4.3 Qualitative Evaluation

Qualitative evaluations on the LOL and LIME datasets are shown in Figure 4 and Figure 5, respectively. As can be seen from Figure 4, our SKF can improve the augmentation capabilities of the baseline methods and generate images of more satisfactory perceptual quality. Specifically, the retexnet results are unrealistic due to the obvious chromatic aberration and severe noise, which can be mitigated by our SKF. Compared with the results of KinD and KinD++, KinD-SKF and KinD+±SKF resolve the problems of inconsistent lighting and strange white artifacts. For other effects, our SKF achieved more consistent color and natural detail restoration for tables, walls and clothing.
insert image description here

We further show the visual enhancement results on the LIME dataset in Fig. 5. It can be seen that our SKF method suppresses the unnatural halo around the luminaire and restores natural colors and details. Therefore, our method using SKF produces more pleasing visual results compared to the baseline, which supports the excellent performance of our method in quantitative evaluation. More visualization results are provided in the supplementary material.
insert image description here

4.4. Ablation Study

We conduct ablation studies on the LOL dataset to demonstrate the effectiveness of our SKF in several ways.

  • SCH loss, SA loss and SE module
    are shown in Table 3. We conducted experiments on KinD+±SKF, DRBNSKF and HWMNet-SKF.
    insert image description here
    After adding the SCH loss and SE modules, the PSNR is improved by an average of 0.243 dB and 0.841 dB over the baseline, respectively. Simultaneously applying the SCH loss and the SE module further improves the baseline method, yielding an average gain of 1.741 dB over the baseline. This verifies that more beneficial semantic-aware priors are integrated in the augmentation process. Although adding the SA loss leads to a slight drop in some full reference metrics, NIQE achieves an average gain of 0.292 in all cases. Thus, each component refines the baseline method with semantic-aware knowledge, and the overall framework leads to significant performance gains. In addition, the results in Fig. 6 show that the model using SCH loss and SE module can maintain color consistency and details, while SA loss reduces false regions by producing more natural textures.
    insert image description here

  • Semantic-guided losses.
    Table 4 lists the results for different loss settings. w/o S and w/ S denote computing the global histogram and semantically guided histogram, respectively. For SA loss, w/o SA, w/o S, and w/S denote classic global and local adversarial losses without SA loss, such as tgan [16] and our SA loss. First, HWMNet-SKF with SCH loss has better performance, and the average improvement in PSNR is 0.512 dB, which shows that the ability of SCH loss to maintain color consistency is significant. Furthermore, adding the classical adversarial loss on NIQE has an average gain of 0.271, which can be attributed to the ability of the discriminator to improve the visual quality. Finally, our SA loss has a good gain of 0.411 over the baseline on NIQE, truly demonstrating that semantic priors help to discover spurious regions, resulting in more natural images.
    insert image description here

  • Superiority of semantic priors
    We choose HWMNet SKF, LLFlow-S-SKF and LLFlow-L-SKF to investigate whether the semantic prior provided by our SKF and more parameters of the SE module are beneficial for performance improvement. As shown in Table 5, Baseline, Large and w/ SKF represent the original model, represent the original model with more layers or channels, and represent the original model with our SKF. Our method achieves a significant improvement in PSNR with an average margin of 1.272 dB. Thus, we demonstrate the superiority of semantic priors rather than additional parameters.

5. Conclusion

This paper proposes a novel semantic-aware image enhancement framework named SKF. SKF introduces semantic prior into the augmentation network to preserve color consistency and visual details through SE module, SCH loss and SA loss. The SE module allows image features to perceive rich spatial information through semantic feature representation. The SCH loss provides effective semantically-aware region constraints to maintain color consistency. The SA loss combines global and local adversarial losses and semantic priors to find object spurious regions and produce natural results. Extensive experiments show that our SKF achieves superior performance against all six baseline methods, and LLFlow-L-SKF outperforms all competitors. However, this improvement has limitations when dealing with unknown categories, and has greater possibilities when improving the ability of SKB to identify unknown instances. In addition, we will also explore the potential of SKF in other low-level vision tasks.

Guess you like

Origin blog.csdn.net/zyw2002/article/details/132433059