Scale-space flow for end-to-end optimized video compression(CVPR 2020) - Video Compression Paper Reading

展示了一个能更好地处理常见错误情况的广义翘曲算子(generalized warping operator) (Q: 这是什么东东?)。
提出尺度空间流，这是光流的直观概括，它添加了尺度参数以允许网络更好地对不确定性进行建模

双线性翘曲(bilinear warping)通过2通道位移场\((f_x,f_y)\)来直接从2D源图像采样
这篇论文对双线性翘曲做了扩展，通过3通道位移+标度场(displacements+scale field)\((g_x,g_y,g_z)\)来对三维标度空间体积进行三线性采样
(Q: 位移场怎么定义怎么求? 标度场怎么定义怎么求? 三线性采样具体怎么采样?)

Bilinear Warping:
Given an image \(x\) shape \(H \times W\) and a flow field \(f=(f_x,f_y)\)

\[x':=Bilinear-Warp(x,f) \\ s.t. x'[x,y]=x[x+f_x[x,y],y+f_y[x,y]] \]

We refer to the flow channels \(f_x,f_y \in \mathbb{R}^{H \times W}\) as the x- and y- displacement fields of the flow \(f\).

Construct a fixed-resolution scale-space volume \(X=[x,x*G(\sigma_0),x*G(2\sigma_0),...,x*G(2^{M-1} \sigma_0)]\)
\(x*G(\sigma)\)表示\(x\)经过尺度为\(\sigma\)的Gaussian kernel卷积后得到的结果
\(X\)为一堆逐渐模糊的\(x\)的叠加, 尺寸为\(H \times W \times (M+1)\)
通过三线性插值(trilinear interpolation)，可以在连续坐标\((x,y,z)\)下采样

Scale-space Warping:
Define a scale-space flow field as a 3-channel field \(g:=(g_x,g_y,g_z)\)
scale-space warp of image \(x\) is:

\[x':=Scale-Space-Warp(x,g) \\ s.t. x'[x,y]=X[x+g_x[x,y],y+g_y[x,y],g_z[x,y]] \]

第三维\(g_z\)称为scale-space flow \(g\)的scale field.

We set \(M=5\) in all of our experiments.

Trilinear interpolation:
In-between two levels \(i<=z<i+1\), with corresponding Gaussian kernel sizes \(\sigma_a\) and \(\sigma_b\)

\[\sigma=\sqrt{(z-i)\sigma_a^2 + (1-z+i)\sigma_b^2} \]

So when given a desired effective kernel size \(0<\sigma<2^{M-1}\sigma_0\), we can easily solve for the corresponding value:

\[z=i+\frac{\sigma_b^2 -\sigma^2}{\sigma_b^2 - \sigma_a^2} \]

Compression Model:
Given a sequence of frames \(x_0,...,x_N\)

将第一帧(I)编码为潜变量\(z_0\), \(z_0\)量化为一个整数\([z_0]\), 再得到重构\(\hat{x}_0\)
对于当前给定的帧(P), 用单个神经网络来联合估计和编码量化the quantized scale-space warp latents \([w_i]\), 从中解码得到scale-space flow \(g_i\)
对前面的重构\(\hat{x}_{i-1}\)进行scale-space warp, 获得当前帧的估计\(\overline{x}_i\)
但是\(\overline{x}_i\)是不完美的，另一个分支会对残差\(r_i=x_i-\overline{x}_i\)进行编码, 得到潜变量\([v_i]\)，并解码得到\(\hat{r}_i\), 最终的重建为\(\hat{x}_i=\overline{x}_i+\hat{r}_i\)

总共有三个潜变量:

image latent \(z_0\)
motion compensation latents \(w_i\)
residual latents \(v_i\)

对每个潜变量, 用一个单独的超先验(hyperprior)来建模相应的密度

Optimize the whole system for the total rate-distortion loss unrolled over N frames:

\[\sum_{i=1}^{N-1} d(x_i,\hat{x}_i)+\lambda [H(z_0)+\sum_{i=1}^{N-1}H(v_i)+H(w_i)] \]

\(H(.)\)表示各个潜变量的熵估计(entropy estimate)，包括超先验提取的边缘信息，\(d\)表示失真度量如MSE或MS-SSIM.

Overview of this end-to-end optimized low-latency compression system:

Scale-space flow for end-to-end optimized video compression(CVPR 2020) - Video Compression Paper Reading

猜你喜欢