Multiframe-to-Multiframe Network for Video Denoising

Multiframe-to-Multiframe Network for Video Denoising

Summary

Existing method: multiple adjacent frames restore a clean frame, the effect is good, but the video may flicker due to sequential denoising considerations;

This paper: A multi-frame-to-multi-frame denoising model is proposed to recover multiple clean frames from consecutive noisy frames. In this paper, based on the training strategy, the denoising video is optimized from time to time and space, so as to maintain temporal consistency.

MMNet architecture: adopts spatio-temporal convolution mechanism while considering inter-frame similarity and single-frame characteristics. benefit from parallelism

INTRODUCTION

1. Denoising is widely used, and the basic noise degradation model is: z=r+α, in video, different from static images, video contains possess spatial informationandrich temporal redundancy

Due to the characteristics of video, it faces two challenges: (1) the volume of video data is large, (2) the content in each frame varies continuously in the temporal dimension; (the content of a single frame is constantly changing in the temporal dimension)

2. The traditional method to solve the problem - handcrafted priors to model clean videos, among which, the classic one is patch-based methods(the three-step strategy of realization is analyzed in the article), the current CNN-Basedmethod uses the intermediate reference frame and its adjacent auxiliary frame as input, output denoised frames corresponding to reference frames, but have disadvantages: (1) cannot directly optimize the quality from the temporal dimension, since they restore clean sequences in a frame-by-frame manner, which may cause visual flickering. (2) They are not efficient enough because they have to process multiple frames to recover only one.

image-20220315093834132

4. Compare the differences and advantages of MM and SS/MS

Comparing SS: making full use of the temporal redundancy information of the sequence and improving the spatial quality

Comparing MS: MS is widely used video denoisingin and brust image denoising, but this paper proposes that multiple consecutive frames can be output instead of a single frame, which can optimize time dimension information,

Due to the artificiality of the parallel mechanism, MM is more efficient. Moreover, unlike other video processing works [39]–[41] that adopt training schemes that take multiple frames as input and output multiple frames, the proposed method focuses on Rebuild multiple frames .

5. MMNet module and its function

the proposed MMNet consists of an interframe denoising module, intraframe denoising module, and merging module

帧间去噪The interframe denoising The module explores inter-frame similarity by extracting features from spatial and temporal dimensions , which helps exploit temporal redundancy within consecutive frames.

帧内去噪the intraframe denoising module focuses on refining single-frame features by extracting spatial features , which helps improve the spatial representation of each individual frame

Subsequently, the merging moduletwo sets of features are aggregated and a clean sequence is estimated.

image-20220721001831693

原文:帧间去噪模块explores the interframe similarity by extracting the features from both the spatial and temporal dimensions, which helps to capitalize on the temporal redundancy within consecutive frames. 相比之下,帧内去噪模块 focuses on refining the single-frame characteristics by extracting spatial features, which helps to improve the spatial representation of each individual frame. 随后,合并模块ggregates both sets of features and estimates the clean sequence. MMNet recovers video in parallel and does not need to calculate the flow in the reference stage, which considerably improves the denoising efficiency.

RELATED WORK

(1) Image denoising

Similar to the SS method, it has been quite popular in recent years, among which the CNN-based method is particularly popular. A brief introduction to the model proposed by two people. These CNN-based models, although extended to image denoising, have limited performance due to the limitations of using simple networks in the feature extraction process.

In order to extract more representative spatio-temporal features, most of the work is devoted to the development of CNN architecture; a brief introduction to DnCNN and its improved model; through the encoder-decoder architecture; through the establishment of noise models, etc.; but these do not consider continuous frames The time redundancy between them, the performance is not good.

(2) Video denoising

The high correlation between adjacent video frames provides rich temporal redundancy, which helps to improve the denoising quality. In order to take full advantage of temporal redundancy, existing video denoising methods tend to adopt MS denoising schemes. For example, [12] extended the idea proposed by BM3D to video denoising (VBM3D ) by using an inductive procedure to search for similar blocks in the data-adaptive spatio-temporal subdomain of video sequences. Based on this approach, VBM4D [15] uses motion-compensated 3D patches to overcome the main problem of VBM3D's inability to distinguish between temporal and spatial similarity of grouped similar patches. In [13], an updated version of VBM4D, called BM4D , achieves video denoising by stacking mutually similar rectangular 3D patches into a 4D array and then removing the noise in the transform domain. Furthermore, to mitigate motion artifacts, Buades et al. [16] proposed a denoising algorithm that combines patch-based denoising methods with motion estimation methods . These methods greatly improve the final video quality; however, most of them rely on hand-crafted priors and are inefficient in processing video data, leaving considerable room for further performance improvements in terms of denoising quality and efficiency. Space.

Recently, with the advancement of deep learning, related techniques have been applied to many video processing tasks, such as semantic segmentation [51], event summarization [52]-[54] and sign language recognition [55]. To improve the denoising performance, many . The earliest attempt in [33] devised a recurrent architecture for video denoising . However, this method cannot effectively exploit self-similarity, resulting in denoising performance that cannot compete with patch-based methods. To solve this problem, more research tends to employ CNN models to learn a direct mapping from input to output. In particular, the methods proposed in [23]–[25] use cascaded 2D-CNNs to perform spatial and temporal denoising separately, which enables them to achieve state-of-the-art denoising performance. Furthermore, a video non-local denoising network combining non-local patch-based methods with CNN models was proposed in [26], [27] . In [29], Xue et al. Integrate motion estimation and image processing steps into video processing models , leveraging a task-oriented pipeline in a self-supervised manner. In [31], Xu et al. developed a 3D deformable kernel for video denoising and proposed a spatiotemporal pixel aggregation network to efficiently sample pixels in spatiotemporal space. Compared with patch-based methods, CNN-based methods have achieved huge improvements in denoising quality. However, they recover clean sequences frame by frame, which makes them unable to optimize the denoising results in the temporal dimension. Furthermore, to recover clean sequences, they have to process each noisy frame multiple times, which limits their denoising efficiency. Compared with the above methods, the method proposed in this paper adopts the MM denoising scheme of directly recovering short sequences, which enables the denoising model to optimize the denoising results in both spatial and temporal dimensions, achieving more competitive denoising efficiency.

METHODOLOGY

MM Denoising Scheme

$X \times T \rightarrow R $ 被定义为 z ( x , y , t ) = r ( x , y , t ) + η ( x , y , t ) x , y ∈ X , t ∈ T z(x, y, t)=r(x, y, t)+\eta(x, y, t) \quad x, y \in X, t \in T z(x,y,t)=r(x,y,t)+n ( x ,y,t)x,yX,tT

rRepresents a clean video, ηwhich means adding noise, (x, y, t) represents 3D space-time coordinates, where X is the space coordinate and T is the time domain

The observed video sequence z(X,T), according to the denoising model D(.) is restored through the network model parameters θ to obtain the denoising sequence $ \overline r $

For SS model: r ~ ( X , t ) = D ( z ( X , t ) ; Θ ) \tilde{r}(X, t)=\mathrm{D}(z(X, t) ; \ Theta)r~(X,t)=D(z(X,t);i )

For MS model: r ~ ( X , t ) = D ( z ( X , t { − n , n } ) ; Θ ) \tilde{r}(X, t)=\mathrm{D}\left( z\left(X, t_{\{-n, n\}}\right) ; \Theta\right)r~(X,t)=D(z(X,t{ n,n});i )

Among them, z(X,t) represents a noise frame from Z(X,T) in the sequence, n represents the number of adjacent frames, $\overline r $(X,t) represents the clean frame recovered at time t,

Clearly, the temporal consistency of the denoising results of SS or MS denoising models cannot be optimized in the temporal dimension, since these models recover sequences in a frame-by-frame manner. Moreover, SS denoising models simply cannot exploit temporal redundancy to improve the spatial quality of denoising results.


本文模型: r ~ ( X , t { − n ^ , n ^ } ) = D ( z ( X , t { − n , n } ) ; Θ ) \tilde{r}\left(X, t_{\{-\hat{n}, \hat{n}\}}\right)=\mathrm{D}\left(z\left(X, t_{\{-n, n\}}\right) ; \Theta\right) r~(X,t{ n^,n^})=D(z(X,t{ n,n});i )

r ~ ( X , t { − n ^ , n ^ } ) \tilde{r}\left(X, t_{\{-\hat{n}, \hat{n}\}}\right) r~(X,t{ n^,n^}) is the denoising sequence, the total number of restored frames of the denoising model is 2n ‾ \overline nn+1( n ‾ \overline n n< n); compared with the SS and MS denoising schemes, the proposed scheme recovers multiple consecutive frames simultaneously, which enables it to optimize the denoising results in both spatial and temporal dimensions. Therefore, we propose a hybrid loss consisting of spatial loss and temporal loss to train the denoising model.

The Architecture of MMNet

1)Interframe Denoising Module

As shown in Fig. 3, in this module, the down sampling or upsampling operations are implemented using [56], which conducts a transformation between the spatial and channel dimensions for the extracted features. Moreover, a spatiotemporal convolution operation [57] is used to extract the spatiotemporal features from both the spatial and temporal dimensions by convolving a 3D kernel to the 3D ectangular patches of the consecutive input frames. Specifically, convolution kernels with sizes of 1 × 3 × 3,3 × 1 × 1, and 3 × 3 × 3 are used to extract the spatial features and the temporal features and aggregate the spatiotemporal information, respectively. The output feature maps are set to 64.

In order to take full advantage of the temporal redundancy in video, an inter-frame denoising module implemented through an encoder-decoder architecture . As shown in Fig. 3, in this module, the downsampling or upsampling operation is implemented using the pixel-shuffle strategya pixel shuffling strategy [56], which transforms the extracted features between spatial and channel dimensions. Furthermore, the spatio-temporal convolution operation [57] is used to extract spatio-temporal features from spatial and temporal dimensions by convolving 3D kernels onto 3D rectangular blocks of consecutive input frames. Specifically, convolution kernels with sizes of 1×3×3, 3×1×1 and 3×3×3 are used to extract spatial and temporal features and aggregate spatiotemporal information, respectively. The output feature map is set to 64. In addition, LeakyReLU non-linearity [58] and batch normalization [59] are used to facilitate model training. In this way, the proposed module effectively exploits the inter-frame similarity within consecutive frames of the input

2)Intraframe Denoising Module

Reason: The method exploits the temporal redundancy within consecutive frames with the help of spatio-temporal features. However, when the denoising model only uses spatio-temporal features to represent each frame, the spatial representation of each individual frame may be affected by object motion. As a result, denoising results may suffer from motion artifacts. **Therefore, the features of each individual frame must be refined to improve their spatial representation.

In this work, we propose an intra-frame denoising module to explore single-frame features. The backbone of the intra-frame . Therefore, the intra-frame denoising model only focuses on the spatial dimension of each input frame, which helps to avoid the impact of object motion on the spatial representation of each frame. Essentially, single-frame features are complementary to spatio-temporal features, helping to generate more accurate spatial representations for each individual frame.

3)Merging Module

The merging module is used to recover consecutive frames by aggregating the extracted spatiotemporal and spatial features.

(1) connect the spatiotemporal and spatial features. Connect the features of the previous two modules

(2)**spatial convolution operation **The spatial-temporal convolution operation integrates the extracted features to generate a residual noise map

(3) adopt the residual learning strategy [60] to restore the denoising results

For this module we use a simple architecture merging module, since the features are already fully extracted by the inter- and intra-frame denoising modules.

MM Training

To optimize the denoising model, we propose a hybrid loss(混合损失)for network training. Because the hybrid loss function includes temporal loss, since temporal consistency is an important perceptual factorperceptual factor for video

Hybrid loss function: ℓ hybrid = ℓ spatial ( r , r ~ ) + λ ℓ temporal ( r , r ~ ) \ell_{hybrid}=\ell_{\text {spatial }}(r, \tilde{r})+ \lambda \ell_{\text {temporal }}(r, \tilde{r})hybrid=spatial (r,r~)+λtemporal (r,r~)

The three represent spatial, temporal, and hybrid losses, respectively. The parameter λ is used to balance temporal and spatial losses, where spatial loss is similar to the loss function in MS and SS; temporal loss forces object motion and illumination changes in the denoising sequence to be temporally consistent with the original sequence.

spatial loss A spatial loss is utilized to ensure that the content of each denoised frame is as close as possible to the ground truth. In general, loss functions commonly used in MS or SS denoising methods can be used as spatial loss functions; these include mean square loss [25], total variation loss [61] and perceptual loss [62]. In this paper, for simplicity, the mean-squared lossmean square error lossℓ
spatial ( r , r ~ ) = 1 2 B ∑ i = 1 B ∑ t = − n ^ n ^ ∥ r ~ i ( X , t ) − ri ( X , t ) ∥ 2 = 1 2 B ∑ i = 1 B ∑ t = − nn ∥ D ( zi ( X , t ) ; Θ ) − ri ( X , t ) ∥ 2 \begin{aligned} \ell_{\text {spatial }} & (r, \tilde{r})=\frac{1}{2 B} \sum_{i=1}^{B} \sum_{t=-\hat{n}}^{\hat{n}} \left\|\tilde{r}_{i}(X, t)-r_{i}(X, t)\right\|^{2} \\ =\frac{1}{2 B} \sum_ {i=1}^{B} \sum_{t=-n}^{n}\left\|\mathrm{D}\left(z_{i}(X, t) ; \Theta\right)-r_ {i}(X, t)\right\|^{2} \end{aligned}spatial =2 b1i=1Bt=nnD(zi(X,t);i )ri(X,t)2(r,r~)=2 b1i=1Bt=n^n^r~i(X,t)ri(X,t)2
B represents the training pairbatch, and ˆn represents the adjacent frame of the reference frame. This denoising model utilizes a spatial loss to remove most of the noise in the input sequence. However, the spatial loss evaluates each denoised frame independently, which makes it impossible to optimize the denoised sequence from the temporal dimension.

Temporal LossTo achieve better temporal consistency for the denoised video, we propose a temporal loss that optimizes the recovered consecutive frames by forcing the motion and intensity changes of the denoised video to be temporally consistent with the original video. Specifically, we first compute the forward optical flow fo ( X , t ) = F ( r ( X , t ) , r ( X , t − 1 ) ) fo(X, t)=\mathrm{F}(r(X, t), r(X, t-1))fo(X,t)=F(r(X,t),r(X,t1 )) (F is the optical flow estimation function),

Then, deform the denoised frame and gt frame according to the following calculation process: rw ′ ( X , t ) = W ( r ′ ( X , t − 1 ) , fo ( X , t ) ) r_{w}^ {\prime}(X, t)=\mathrm{W}\left(r^{\prime}(X, t-1), fo(X, t)\right)rw(X,t)=W(r(X,t1),fo(X,t ) ) wherer ′ ( X , t − 1 ) r^{\prime}(X, t-1)r(X,t1 ) Indicates the sequence frame of gt or the sequence frame of denoising, W is the function of calculating the compensated optical flow from frame t-1 to frame t[40]

r w r_w rwIndicates the deformed gt frame, r ~ w \tilde{r}_{w}r~wRepresents the denoised frame after deformation, since we only pay attention to the perceptual quality of the current frame and the deformed frame, since in sequence motion, r ′ r^{\prime}r some pixels may not be inrw ′ r_{w}^{\prime}rw中,因此通过mask掩码m计算这些时间上的损失,
ℓ temporal  ( r , r ~ ) = 1 B ∑ i = 1 B ∑ t = − n ^ + 1 n ^ ∥ m i ( X , t ) ⊙ ( ( r ~ i ( X , t ) − r ~ w i ( X , t ) ) − ( r i ( X , t ) − r w i ( X , t ) ) ) ∥ , \begin{array}{r} \ell_{\text {temporal }}(r, \tilde{r})=\frac{1}{B} \sum_{i=1}^{B} \sum_{t=-\hat{n}+1}^{\hat{n}} \| m_{i}(X, t) \odot\left(\left(\tilde{r}_{i}(X, t)-\right.\right. \\ \left.\left.\tilde{r}_{w i}(X, t)\right)-\left(r_{i}(X, t)-r_{w i}(X, t)\right)\right) \|, \end{array} temporal (r,r~)=B1i=1Bt=n^+1n^mi(X,t)((r~i(X,t)r~wi(X,t))(ri(X,t)rwi(X,t))),
where m (X, t)∈[0,1] is the mask calculated using optical flow, which is 0 in the occlusion boundary and motion boundary area, and 1 in other areas. The circle represents multiplication, although the optical flow calculation takes up a lot of time, but only needs to be computed in the training phase; in the reference phase, no optical flow information is needed, which helps improve the competitive denoising efficiency of the model.

EXPERIMENTAL RESULTS

Experimental Settings

(1) Initialization of weights: Kaiming uniform initializationUniform initialization by [64]Kaiming

(2) ADAM optimizer, epoch=35, batch=10, lr init=0.0004, lr is reduced by 2 times after every 6 epochs, patch=128, input_frame=7, the hyperparameter of loss is 0.02 (refer to Section 5 for this parameter setting)

(3) Training dataset: (synthetic) VimeoAdd AWGN with parameter δ∈[0,55] to -90K dataset

(4) Test dataset: (real) the Captured Raw Video Denoising (CRVD)2 dataset [30]

Note that the proposed method is suitable for standard RGB (sRGB) video denoising; however, the CRVD dataset contains videos in RAW format. Therefore, we follow the technique used in [30] and use a pretrained image signal processor (ISP) model [66] to generate a real sRGB video dataset.

(5) Evaluation criteria: PSNRand SSIM, one more: In addition, following FastDVDnet [25], the spatiotemporal reduced reference entropic differences ( ST-RRED) [67] index to measure temporal distortion and evaluate temporal consistency.

Comparison Results for Gaussian Noise

Mainly for comparison between SS and MS methods

image-20220315111605434

Compared with DnCNN, MMNet's performance is due to its intra-frame denoising module, which can fully explore the spatial and temporal redundant information between consecutive frames.

Compared with MS denoising, MMNet's performance is due to its architecture design and the proposed MM training mode: fully considering the inter-frame similarity and single-frame features, reducing time redundancy and reducing artifacts

In addition, it can be seen from the table 时间损失函数的设计可以优化结果that the execution effect has improved (the difference between the last two plus or not temporal losses)

Figure 4 of the original image: Note the sharpness of the face and background shown in the green box. The proposed method can recover sharper faces and produce less artifacts in the background.

Comparison Results for Real-World Noise

image-20220720232546071

The reason these values ​​are different: The performance of different methods for denoising real noise and Gaussian noise is different. These differences may be because the data distribution of the CRVD test dataset is different from that of the Vimeo-90K test dataset, and the real noise is more complex than Gaussian noise. Furthermore, the learning ability and generality of denoising models for real noise and Gaussian noise are different. Original Figure 6 shows some image comparison results of these methods.

Generality Evaluation Using Other Types of Noise

Compared with methods containing other noise Poisson noise and speckle noise (Poisson noise and speckle noise) [The models are all retrained with corresponding noise]

image-20220315114149297

Temporal Consistency

Temporal consistency is an important factor for visual quality . In this paper, ST-RRED is used as a non-subjective indicator for evaluation. The results are shown in Table 1. The smaller the value of ST-REDD, the better.

The temporal quality of MS denoising methods mainly relies on learning residuals from consecutive input frames, but due to the lack of explicit supervision, the learned residuals may be inaccurate. In contrast, the proposed MMNet not only learns the residual from the input, but also optimizes the output from the temporal dimension . Furthermore, the proposed MMNet uses an intra-frame denoising module to refine the features of each individual frame to improve their spatial representation, thus, its value is more pronounced.

image-20220720232929457

Figure 8, DnCNN cannot exploit temporal redundancy to recover texture details, resulting in over-smoothness of recovered grass and inconsistency of grass texture.

The state-of-the-art methods DVDnet and FastDVDnet are able to recover some details in grass areas, but they produce motion artifacts that cause temporal inconsistencies in detail textures. In contrast, with the help of temporal loss, the proposed MMNet recovers fine details and maintains high temporal consistency.

Runtime

image-20220720233158910

This dramatic improvement can be attributed to the underlying parallelism and MMNet's ability to handle motion implicitly.

DISCUSSION AND ANALYSIS Discussion of details

Ablation Study

The result of removing one of the modules

Remove the inter-frame module: make it impossible to use spatio-temporal redundancy to recover fine details

Removing intra-frame modules: When only using spatio-temporal features to represent each frame, the spatial representation of each frame may be affected by object motion, leading to motion artifacts. Therefore, both inter and intra frame denoising modules help to improve the denoising quality.

image-20220315124128762

Discussion of the Number of Frames

Design: The training set still uses Vimeo-90K dataset. Considering the problem of GPU memory, the main discussion frame number is: 1, 3, 5, 7

Consider from two aspects, the output frame and the number of output frames

2)Discussion of the Number of Input Frames

image-20220315124356379

2) Discussion of the Number of Output Frames

Simultaneously recovering multiple frames enables the proposed method to optimize the denoising results in spatial and temporal dimensions, however, it also leads to asymmetric utilization of temporal information since some recovered frames will not be the central frame of the input sequence.

image-20220315124647490

An example that uses 7 frames as input and recovers 3 frames (indicated by blue, green and red boxes). The green box utilizes the first three frames and the next three frames symmetrically, while the blue box uses the first two frames and the last four frames asymmetrically; likewise, the red box uses the first four frames and the next two frames asymmetrically.

To analyze the impact of the output frame number, we conduct experiments where the input frame number is fixed at 7 and the recovery output frame number varies.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-CcQtArwM-1689334219366)(https://data-1306794892.cos.ap-beijing.myqcloud.com/typora_imgs /typora_imgs/20220315-1248-332.png)]

When the number of output frames is greater than 1, no substantial improvement or drop occurs, indicating that the impact of asymmetric temporal information utilization is very limited. In addition, methods that restore multiple frames (i.e., 3, 5, and 7) achieve more competitive performance compared to methods that restore only one frame, demonstrating the effectiveness of optimizing denoising results from both spatial and temporal dimensions.

According to the analysis, combined with 10(b), we found that when the output frame is 7, the running time will be smaller, so we decided to choose 7

Analysis of the Hyperparameter λ

Parameter setting in the loss function

The hyperparameter λ is important to optimize the denoising results. Sensitivity analysis of λ is performed by conducting experiments on the Vimeo-90K test set with different λ values ​​ranging from 0 to 1. The noise level is set to 25 and 45. As shown in Fig. 12, overall, the proposed MMNet can significantly improve the denoising quality when λ is between 0.01 and 0.1, and the denoising effect is the best when λ is set to 0.02. Therefore, in this work, we set λ to 0.02 to train the proposed MMNet

image-20220315125220179

CONCLUSIONS

MMNet achieves state-of-the-art denoising quality. In addition, it recovers video frames in parallel and does not need to compute traffic in the reference stage, resulting in very competitive denoising efficiency. Extensive comparisons on synthetic and real datasets demonstrate the effectiveness and superiority of the proposed method.

question:

(1) MMNet requires paired training data, so when the application cannot effectively obtain paired data, the model cannot be easily fine-tuned

(2) MMNet handles object motion implicitly, so to some extent, its ability to handle motion depends on the training data. These aspects are the main limitations of the proposed method.

(3) In future work, we plan to implement the MM denoising scheme in a self-supervised manner and improve the robustness of MMNet to different motion levels.

In addition, it recovers video frames in parallel and does not need to compute traffic in the reference stage, resulting in very competitive denoising efficiency. Extensive comparisons on synthetic and real datasets demonstrate the effectiveness and superiority of the proposed method.

question:

(1) MMNet requires paired training data, so when the application cannot effectively obtain paired data, the model cannot be easily fine-tuned

(2) MMNet handles object motion implicitly, so to some extent, its ability to handle motion depends on the training data. These aspects are the main limitations of the proposed method.

(3) In future work, we plan to implement the MM denoising scheme in a self-supervised manner and improve the robustness of MMNet to different motion levels.

Guess you like

Origin blog.csdn.net/qq_38758371/article/details/131730307