DVC: An End-to-end Deep Video Compression Framework(CVPR 2019) - Video Compression Paper Reading

Learning based optical flow estimation is utilized to obtain the motion information and reconstruct the current frames. Then employ two auto-encoder style neural networks to compress the corresponding motion and residual information.

Traditional video compression framework将输入帧\(x_t\)分为一些相同大小的块(如\(8 \times 8\)), 算法在编码器端的编码过程如下:

运动估计: 估计当前帧\(x_t\)和上一帧的重构\(\hat{x}_{t-1}\)之间的运动, 得到每一个块对应的运动向量(motion vector, MV) \(v_t\)
运动补偿: 根据1中得到的运动向量\(v_t\), 将上一帧的重构的相应像素移动得到预测帧\(\overline{x}_{t}\), 原始帧\(x_t\)和预测帧\(\overline{x}_{t}\)之间的残差\(r_t=x_t-\overline{x}_{t}\)
变换(Transform)和数字化(Quantization): 将2中得到的残差\(r_t\)数字化为\(\hat{y}_{t}\), 在数字化时, 用一个线性变换(如DCT)来得到更好的压缩性能
逆变换: 3中的数字化结果\(\hat{y}_{t}\)通过逆变换得到重构残差\(\hat{r}_{t}\)
熵编码: 将1中的运动向量\(v_t\)和3中的数字化结果\(\hat{y}_{t}\)都通过熵编码方法编码为二进制, 并送给解码器
帧重构: 将2中的预测帧\(\overline{x}_{t}\)和4中的重构残差\(\hat{r}_{t}\)相加得到重构帧\(\hat{x}_{t}=\hat{r}_{t}+\overline{x}_{t}\), 重构帧会在\(t+1\)帧的第1步中用到

而对解码器来说, 基于在5中编码器输出的比特流, 在2中进行运动补偿, 在4中进行逆数字化, 最后在6中进行帧重构就得到了重构帧\(\hat{x}_{t}\)

Traditional Video Compression Framework:

Deep learning based end-to-end video compression framework和traditional video compression framework是一一对应的:

N1. 运动估计和压缩: 用一个CNN来估计光流, 将其视为运动信息\(v_t\). 并且并不对原始的光流值进行编码, 而是用一个MV Encoder-Decoder来对光流值进行压缩和解码, 其中数字化的运动特征表示为\(\hat{m}_{t}\), 然后再将\(\hat{m}_{t}\)通过MV Decoder进行解码得到重构运动信息\(\hat{v}_{t}\). 细节在MV Encoder and Decoder Network

N2. 运动补偿: 通过一个运动补偿网络, 输入N1中得到的光流, 输出预测帧\(\overline{x}_{t}\). 细节在下面

N3-N4. 变换, 数字化和逆变换: 不用3中的线性变换, 而是用一个高度非线性的残差Encoder-Decoder网络, 将残差\(r_t\)非线性的映射到表示\(y_t\). 然后将\(y_t\)数字化得到\(\hat{y}_t\), \(\hat{y}_t\)喂到残差解码网络中得到重构残差\(\hat{r}_t\). 数字化方法用的是End-to-end optimized image compression中的方法. 细节在下面

N5. 熵编码: 在测试时, 将N1中得到的数字化的运动表示\(\hat{m}_t\)和N3中得到的残差表示\(\hat{y}_t\)编码为比特流, 并送给解码器. 在训练时, 为了估计我们方法中的比特成本, 用CNN(下图中的bit rate estimation net)来获得每个符号在\(\hat{m}_t\)和\(\hat{y}_t\)中的概率分布. 细节在下面

N6. 帧重构: 和6相同

End-to-end video compression framework:

MV Encoder and Decoder Network:

用一个auto-encoder style network来压缩光流, End-to-end optimized image compression提出的首先用于图像压缩任务的方法

MV compression network:

Motion Compensation Network:

输入上一帧的重构\(\hat{x}_{t-1}\)和运动向量\(\hat{v}_t\), 输出预测帧\(\overline{x}_t\), 并且希望它和当前帧\(x_t\)尽可能接近

将上一帧的重构\(\hat{x}_{t-1}\)基于运动信息\(\hat{v}_t\)扭曲(warp)得到当前帧. 但是这样得到的帧仍然有瑕疵
为了去掉瑕疵, 联合扭曲后的帧\(w(\hat{x}_{t-1})\), 参考帧\(\hat{x}_{t-1}\)和运动向量\(\hat{v}_t\)来作为输入，喂到另一个CNN来得到调整后的预测帧\(\overline{x}_t\)

这是种像素级的运动补偿方法

Residual Encoder and Decoder Network:

原始帧\(x_t\)和预测帧\(\overline{x}_t\)之间的残差信息\(r_t\)通过residual encoder network编码

这篇论文用了Variational image compression with a scale hyperprior中的highly non-linear neural network来将残差变换成对应的潜变量表示

Training Strategy:

Goal: 1. minimize the number of bits used for encoding the video 2. reduce the distortion between the original input frame \(x_t\) and the reconstructed frame \(\hat{x}_t\)

Rate-distortion optimization problem:

\[\lambda D+R=\lambda d(x_t,\hat{x}_t)+(H(\hat{m}_t)+H(\hat{y}_t)) \]

\(d(x_t,\hat{x}_t)\)表示\(x_t\)和\(\hat{x}_t\)之间的失真, 实现中用MSE

\(H(.)\)表示编码特征用的位数, residual representation \(\hat{y}_t\)和motion representation \(\hat{m}_t\)都需要编码成比特流

\(\lambda\)是拉格朗日乘子

Quantization:

数字化操作不可微怎么办?

用End-to-end optimized image compression中的方法, 在训练阶段加入均匀噪声代替数字化运算(Q: ???)

比如\(y_t\)的数字化，训练时的数字化表示\(\hat{y}_t\)用加均匀噪声来近似，即\(\hat{y}_t=y_t+\eta\)，然后在推理阶段，直接用舍入操作，即\(\hat{y}_t=round(y_t)\)

Bit Rate Estimation:

\(H(\hat{m}_t)\)和\(H(\hat{y}_t)\)怎么求？

比特率的正确度量是对应潜变量表示的熵，因此用Variational image compression with a scale hyperprior中的CNN来估计\(\hat{m}_t\)和\(\hat{y}_t\)的概率分布

Buffering Previous Frames:

每次迭代的reconstructed frame会被保存在buffer中以便后面一帧拿来用

Code: https://github.com/GuoLusjtu/DVC

DVC: An End-to-end Deep Video Compression Framework(CVPR 2019) - Video Compression Paper Reading

猜你喜欢