End-to-End Video Coding: DVC

This article is from the CVPR 2019 paper "DVC: An End-to-end Deep Video Compression Framework"

Official open source code address: https://github.com/GuoLusjtu/DVC

DVC is an end-to-end video coding model. There have been some DNN-based video coding methods before, but usually a DNN model is used to replace a certain module of video coding, and the overall training process is not end-to-end.

DVC replaces all modules of the traditional block-based coding framework with neural networks. Figure 1(a) is the traditional video coding framework, and Figure 1(b) is the DVC framework.

figure 1

symbol definition

  \begin{equation*} \nu =\{x_{1} ,x_{2} ,...x_{t} ,...\} \end{equation*} Indicates the video sequence,  \begin{equation*} x_{t} \end{equation*} is the tth frame, \begin{equation*} \overline{x_{t}} \end{equation*}  is the corresponding predicted frame, and \begin{equation*} \widehat{x_{t}} \end{equation*}  is the reconstructed/decoded frame. \begin{equation*} r_{t} =x_{t} -\overline{x_{t}} \end{equation*} Represents the residual, \begin{equation*} \widehat{r_{t}} \end{equation*}  which is the reconstructed/decoded value of the residual. \begin{equation*} v_{t} \end{equation*} is the motion vector and  \begin{equation*} \widehat{v_{t}} \end{equation*} is the corresponding reconstruction value. Since transformation and quantization are also performed during the encoding process, \begin{equation*} r_{t} \end{equation*}  the result of the transformation is  \begin{equation*} y_{t} \end{equation*} , and  \begin{equation*} v_{t} \end{equation*} the result of the transformation is \begin{equation*} m_{t} \end{equation*}  .

DVC architecture

Motion Estimation and Compression

Using CNN for optical flow estimation, the obtained result is used as motion information \begin{equation*} v_{t} \end{equation*}  . The motion information will also be compressed through the MV codec network. Optical Flow Net, MV Encoder Net, MV Decoder Net in Figure 1(b).

motion compensation

The motion compensation network mainly calculates and predicts frames based on the optical flow obtained earlier   \begin{equation*} \overline{x_{t}} \end{equation*}.

Transformation, Quantization and Inverse Transformation

Different from the traditional DCT and DST transformations, the residual codec network is used here for nonlinear transformation. The residual   \begin{equation*} r_{t} \end{equation*}of the nonlinear transformation is   \begin{equation*} y_{t} \end{equation*}\begin{equation*} y_{t} \end{equation*} quantized as  \begin{equation*} \widehat{y_{t}} \end{equation*} . \begin{equation*} \widehat{y_{t}} \end{equation*}  The reconstructed residual value can be obtained through the residual decoding network   \begin{equation*} \widehat{r_{t}} \end{equation*}.

entropy coding

Quantized motion information   \begin{equation*} \widehat{m_{t}} \end{equation*} and residuals \begin{equation*} \widehat{y_{t}} \end{equation*}  are to be encoded as a bitstream, and to estimate the number of bits, the   distribution of \begin{equation*} \widehat{m_{t}} \end{equation*} and  is obtained using the Bit rate estimation net \begin{equation*} \widehat{y_{t}} \end{equation*} . 

frame reconstruction

The frame reconstruction process is the same as the traditional encoding frame.

MV codec network

Figure 2 MV codec network

Figure 2 is the codec network of MV, Conv(3,128,2) represents the convolution operation, the convolution kernel is 3x3, the output is 128 channels, and the step size is 2. GDN/IGDN are nonlinear transformation functions.

If   \begin{equation*} v_{t} \end{equation*}the size of the input optical flow is MxNx2, the size of the output of the MV encoding network  \begin{equation*} m_{t} \end{equation*} is M/16 x N/16 x 128,  \begin{equation*} m_{t} \end{equation*} quantized as  \begin{equation*} \widehat{m_{t}} \end{equation*} . The MV decoding network will  \begin{equation*} \widehat{m_{t}} \end{equation*} decode as \begin{equation*} \widehat{v_{t}} \end{equation*}  . In addition,  \begin{equation*} \widehat{m_{t}} \end{equation*} it is also used in the entropy encoding process.

motion compensation network

\begin{equation*} \widehat{x_{t-1}} \end{equation*} Given the reconstructed frame sum of the previous frame  \begin{equation*} \widehat{v_{t}} \end{equation*}  , the motion compensation network can generate the reconstructed frame of the current frame \begin{equation*} \widehat{x_{t}} \end{equation*}  , as shown in Figure 3.

Figure 3 Motion Compensation Network

The motion compensation here is at the pixel level, so it can provide more accurate time-domain information, and avoid the block effect of traditional block-based motion compensation, so there is no need for loop filtering. Please refer to the paper for network details.

Residual codec network

Residual information is encoded by the residual encoding and decoding network in Figure 1. This network is highly nonlinear, and compared with traditional DCT, it can more fully tap the ability of nonlinear transformation.

training strategy

loss function

 \begin{equation*} \lambda D+R=\lambda d\left( x_{t} ,\widehat{x_{t}}\right) +\left( H\left(\widehat{m_{t}}\right) +H\left(\widehat{y_{t}}\right)\right) \ \ ( 1) \end{equation*}

The goal of training is to reduce the bit rate while reducing the distortion. Other d(.) function to calculate distortion, use MSE to calculate distortion, H(.) indicates estimated code rate. As shown in Figure 1, the reconstructed frame, the original frame, and the estimated bit rate are all input to the loss function.

Quantify

Both the residual and the motion vector need to be quantized before entropy coding can be performed, but quantization itself is not differentiable, so the paper replaces quantization by adding a uniform noise during the training phase.

  \begin{equation*} \widehat{y_{t}} =y_{t} +\alpha \ \ ( 2) \end{equation*}

where alpha is the uniform noise.

Use the rounding operation directly in the inference stage,

  \begin{equation*} \widehat{y_{t}} =round( y_{t}) \ \ ( 3) \end{equation*}

bit rate estimation

In order to balance the code rate and distortion, it is necessary to estimate the code rate of the residual and the motion vector during the encoding process. To estimate the code rate, it is necessary to obtain the entropy of the data, that is, to obtain the distribution of the data. This paper implements it through a CNN.

Cache History Frames

Since the reference frame is needed in motion estimation and motion compensation, the reference frame is the reconstructed frame of the previous frame output by the network, that is, the tth frame needs the reconstructed frame of the t-1th frame, and the t-1th frame needs the t-th frame A reconstructed frame of 2 frames and so on requires saving all frames in the GPU, which is impossible when t is very large. The paper proposes an online update strategy, updating one frame per iteration.

experiment settings

Dataset :

The paper uses the Vimeo-90K dataset for training, which contains 89800 video sequences. Validation was performed using the UVG dataset and the HEVC standard sequence.

Evaluation indicators :

Use PSNR and MS-SSIM to evaluate distortion, and use bpp to measure bit rate.

Implementation details :

Four models were trained using 4 lambdas (256, 512, 1024, 2048). Each model was trained using the Adam optimizer. The initial learning rate was set to 0.0001, beta1 was set to 0.9, and beta2 was set to 0.999. When the loss is stable, the learning rate is divided by 10, and the mini-batch is set to 4. The training image resolution is 256x256. Using the tensorflow framework for training, it took 7 days to complete the training on two Titan X GPUs.

Experimental results

Figure 4 Part of the experimental results

Figure 4 shows some experimental results. It can be seen that the paper method is better than H264 in PSNR and MS-SSIM on most data sets. Compared with H265, the performance of MS-SSIM indicators is similar. The paper uses MSE to calculate the distortion, and the quality will be further improved if MS-SSIM.

Summarize

The paper uses the DNN model to replace each part of the traditional video coding framework to achieve end-to-end coding, so that it can be trained as a whole. For specific information about each model, please refer to the paper and open source implementation https://github.com/GuoLusjtu/DVC

If you are interested, please pay attention to WeChat public account Video Coding

 

Guess you like

Origin blog.csdn.net/Dillon2015/article/details/123805175