【论文翻译】Multi-modalmulti-scale convolutional neural network based in-loop filter design for next genera

MULTI-MODAL/MULTI-SCALE CONVOLUTIONAL NEURAL NETWORK BASED IN-LOOP FILTER DESIGN FOR NEXT GENERATION VIDEO CODEC

Jihong Kang, Sungjei Kim, Kyoung Mu Lee
Department of ECE, ASRI, Seoul National University, Seoul, Korea
Korea Electronics Technology Institute, Seongnam-si, Korea


Abstract
This paper proposes a novel video compression loop filter design methods. Our purpose is to replace the conventional method of deblocking filter and the SAO HEVC standard (SAO) with a multi-mode / multi-scale convolutional neural network (MMS-net). CNN structure proposed by two different sizes of subnets. First, the input image is sampled at low dimension by the network is restored, and then its output is connected with the original image fed to the input image is a high scale network. In addition, in order to improve restore performance, the proposed architecture utilizes information coding sequence. Specifically, by using the coding tree units (CTU) CNN compression parameters as input, help to reduce blocking artifacts in the reconstructed image. In the experiments, the conventional methods based on neural networks [1] and reference software HEVC HM16.7 [2] compared to our method in the "All Intra-Main" configuration, BD-rate decreased by 4.55% and 8.5%.

Index entries - loop filter, HEVC, CNN, video compression, image restoration

1. Introduction
compression of video and image distortion inherently lead frame content. Compression technology has been while maintaining content quality, reducing for storing compressed data of the number of bits has made progress. In these techniques, the deblocking filter HEVC standard [5] [3] and the sample adaptive offset (SAO) [4] plays an important role. For removing visual artifacts such as blocking artifacts, ringing artifacts, blurring artifacts like. As the name suggests, it is responsible for the deblocking filter is removed from the quantization block artifacts caused by the block-based processing. HEVC standard SAO newly introduced additional offset value compensation using other workpieces. These two filters will restore the subjective quality of the reconstructed image, but also to restore the quality of the reconstructed image of the objective, and because the reconstructed image is a reference to the sequence to predict other frames, it can also help to improve the compression ratio.

In recent years, the successful application of convolutional neural network (CNN) in image classification has been extended to many other areas of research. In the image restoration, as the super-resolution [6,7,8], deblurring [9,10] and denoising [11], CNN has been superior to conventional non-learning method. From the perspective of machine learning, we can say that CNN is through the use of large amounts of training data to learn from non-linear distortion of the image to the undistorted image mapping function to complete the task of image restoration. Despite the different characteristics of the input image distortion, but the learning process in different CNN image restoration task is very similar. Typically, distortion and undistorted image of the target is fed as an input and CNN, and CNN to learn how to restore the target image by eliminating distortion of the input image. Similarly, the same procedure may be applied to video coding in the learning operation of the loop filter.

Some studies have begun to use CNNs instead of loop filter. Park and Kim [12] proposed a IFCNN substituted SAO. IFCNN convolution has three layers, connected jumping [7] between input and output. They demonstrate the possibilities of use by replacing conventional CNN loop filter to reduce the 1.6-2.8% BD-rate. Dai et al. [1] VRCNN proposed as an alternative to block the filter, and SAO. VRCNN 4 floors convolution layers, each layer has two different sizes of core. VRCNN reported encouraging results averaged 4.6% of the reduced bit rate. However, [1,12] The CNN architecture is too shallow, CNN did not reflect recent improvements, such as batch standardization [13] and the remaining network [14, 15], and these work is just simply the CNN applied to this problem without considering the wealth of information compressed video.

Contains a large number encoded video information frames directly or indirectly affect distortion, such as coding parameters (CPs), so we can use this information to recover. For example, the hierarchical coding tree units (CTU) information can be accurately positioned boundary possible blocking artifacts. With this information as input CNN network may help to accurately detect and recover artifacts.

This paper proposes a method to design a loop filter HEVC multimodal / multiscale CNN (MMS-net) structure. Our model has a full convolution neural network architecture, and work to end to end. CNN multi-scale structure effectively improve the recovery performance recovery process by coarse to fine. In addition, CTU network information encoded video guide correctly detect and remove the blocking artifacts. For better handling information CTU, we propose a method of converting CTU matrix form and a method of matching features from different modalities pretreatment network.

2. multimodal / multiscale convolutional neural network

2.1. Formatting encoding parameters
blocking effect is mainly caused by the block-based compression scheme. Thus, a given block division information, the network can be easily positioned and eliminating blocking artifacts. In HEVC standard, CTU compression is a basic processing unit. CTU recursively divided into quadtree coding units (CU), each CU by prediction unit (PUs) and converting units (TUs) composition. For simplicity, we only use the CU and TU as input information for each frame of CNN.

In order to provide information to the CU and TU CNN, need to convert this information into a suitable form. CU and TU information should provide position and size of the input image for each unit. In our implementation, we create a matrix input image size. We then the corresponding CU and TU for each outermost pixel position assigned the value "2", and the non-boundary area set value for each matrix "1." Thus, for each frame to generate two matrices. A coding example shown in Figure 2 CU.
Here Insert Picture Description
2.2. Multimodal Adaptive Network
encoded CPs easiest way is fed into the network connected to the input shaft along the channel matrix CP image. However, in such a way to provide information to the network CP may hinder the recovery process of the network, rather than because of the different ways to improve results. In order to efficiently handle a multi-modal information in a single network, we added a simple pretreatment network (known as adaptive network), the CP converts the image information to the spatial feature space. Convolution composition layer and the layer ReLU matrix CP are projected onto a single channel mapping feature. Then, the CP characteristic of the input image FIG multiplied element-wise, and is connected to an input image as an additional channel. Upper left corner of the adaptive structure of the network shown in Figure 3. Here Insert Picture Description
2.3. Model structure
in recent years, multi-scale image restoration architecture is widely used in various image restoration studies. Multiscale image restoration can be seen as a hierarchical process multiscale image space, so that small details in the restored image retained on a finer scale, long range correlation to retain thicker in scale. For example, Nah et al. [9] Create three Gaussian pyramid images from blurred images, and used as an input corresponding to each scale CNNs to deblurring operation. Coarse to fine recovery concept has proven to be useful in severely blurred image sharpening in. Similarly, our proposed model has two different scales of successive sub-network (K = 2). Coarse-scale network recover the distorted image from the input image is reduced in half, while the fine-scale network by the image reconstructing an image from both coarse-scale network and the original distorted image as an input to output the restored. In the coarse-scale network, we do not adjust the size of the input image from the external network, but the use of the sampling step is performed at layer 2 convolution, deconvolution using the sample layer. This embedding interpolation network configuration simplifies the handling of the whole system.

每个比例尺度的子网结构都是SRResnet的修改版本[8]。网络的基本构建单元是剩余块[15]。每个剩余块包含两个连续的子模块,由一个批处理规范化层、一个ReLU层和一个卷积层组成。每个残差块的输入和输出之间的跳跃连接直接将输入信号传播到输出,因此具有卷积层的子模块学习输入信号上的残差特征。类似地,每个子网络的输入和输出之间的全局跳过连接[7]会引导网络在失真图像上生成还原图像的残差。我们对每个尺度路径使用M个残差块。网络中的所有卷积层都使用3×3的卷积核,除了每个尺度的子网络的第一卷积层使用7×7的卷积核外。在每个卷积层的边界上添加填充,以使输入和输出维度相同。图3用总体架构描述了每个卷积层的内核数和跨步数。

注意,所提出的模型没有全连接层,因此,这是一个完全卷积的神经网络[16],它可以生成与输入相同大小的图像。

2.4. 多尺度损失
每个子网的损失分别由子网输出与GT图像之间的MSE准则计算。然后,总损失定义如下:Here Insert Picture Description
其中,Rk、G、K分别表示第K子网的输出、GT图像和子网的数目(比例)。通过输入图像的宽度w和高度h对损失进行归一化。请注意,第k级子网的损失通过网络反向传播,并与提议架构中的(k-1)级损失一起累积。

2.5. 训练
对于训练数据集,我们使用来自Xiph.org视频测试媒体的28个YCBCR420颜色格式的高清序列[17]。由于GPU内存有限,我们将序列裁剪为176×144大小。在每个裁剪序列中,在HEVC编码过程中(使用HM 16.7软件[2])从环路滤波器前捕获输入训练图像,并将原始序列用作GT图像。 网络实现采用Caffe[18]框架。我们使用ADAM[19]优化器运行了3×105次训练迭代。学习速率自适应地从1×10-3调整到5×10-6,直到损失收敛。

3. 实验结果

3.1. 测试环境
实验中,在JCT-VC通用测试条件[20]的测试序列上进行了测试。我们的模型只在帧的Y通道上训练,但是我们也应用了该模型来恢复U,V通道。编码器配置设置为“All Intra-Main”配置。根据QP22、QP27、QP32和QP37使用HEVC编码器相应QP设置的训练图像训练4个不同的模型。

3.2. 网络消融研究
为了研究模型的各个组成部分对性能的影响,我们进行了几个可变组成部分的控制实验。在这个实验中,我们的方法在类D[20]中的所有序列上进行评估。

首先,我们比较了单尺度网络和多尺度网络。表1显示多尺度网络将性能提高了0.08-0.14db。此外,附加的编码参数CU和TU也提高了输出图像的峰值信噪比。沿输入图像的通道轴连接CP图像在5个残差块网络中很有用,但对于15个残差块网络则更糟。然而,对于5个和15个残差块网络,通过CP自适应网络(pre-network)获得0.08db的增益。我们可以观察到,CP信息有助于压缩伪影的局部化,但它需要预处理网络来关联不同的模式。在5个残差分块网络中使用CP比没有CP的15个残差分块网络性能稍好,因此可以通过牺牲较小的性能增益将计算量减少到三分之一。
Here Insert Picture Description
表1中还包括先前工作的PSNR和模型参数的数量[1,7]。尽管VDSR比VRCNN具有更多的变量,但它无法找到更好的最优点,性能也不如VRCNN。该方法利用残差网络的结构和成批规范化层,可以缓解收敛到局部极小值的问题。

3.3. Compared with the most advanced methods
we measure the MMS-net compared with the existing network based on the most advanced BD rate. For a fair comparison, we trained VDSR [7] and VRCNN [1] on the training data set, until convergence of these models. Due to limitations of GPU memory, we only model categories C and D sequences were tested. In Table 2, MMS-net (M =by reducing the average 8.5% of BD-rate Y in HEVC reference channel, to obtain a superior performance. In the model generalization, it is worth noting that the model is trained on a 176 × 144 image, but equally well for an image resolution greater effect. In addition, this model only in the Y-channel training, but further reduced the BD-rate on the U and V channels.
Here Insert Picture Description
For the subjective quality compared to some exemplary image as shown in FIG. The results showed that, MMS network while maintaining the image presented in the main content and sharp edges, effectively deblocking the image addition.
Here Insert Picture Description
4. Conclusion
In this paper, we propose a new multi-mode / multi-scale convolutional neural network (MMS Net) architecture to replace the existing HEVC loop filter. CTU MMS network may utilize compressed information bit stream successfully remove blocking artifacts. Further, from coarse to fine method of multiscale networks proven beneficial quantized image restoration. In future work, we will explore how to use other coding information, such as QP, motion vectors and intra prediction mode, image restoration.

Published 11 original articles · won praise 1 · views 602

Guess you like

Origin blog.csdn.net/qq_37200357/article/details/103953101