论文笔记之视频：Video Compression through Image Interpolation

2.Related Work

Image Compression

Progressively encodes the image using a recurrent neural network
allow for variable compression rates with a single model
Use fully convolutional networks to handle arbitrary image sizes
Bottleneck:contains spatially redundant activations
Entropy coding further compresses this redundant information
Learning the binary representation is inherently non-differentiable

Stochastic binarization and backpropagate
soft assignment to approximate quantization
replace the quantization by adding uniform
allow for gradients to flow through the discretization
use stochastic binarization

Video Compression

Two simple ideas:

Decompose each frame into blocks of pixels, known as macroblocks

Divide frames into image (I) frames and referencing (P or B) frames

I-frames directly compress video frames using image compression.

Most of the savings in video codecs come from referencing frames.

P-frames borrow color values from preceding frames. save motion estimate and a highly compressible difference image for each macroblock.

B-frames additionally allow bidirectional referencing, as long as there are no circular references.

A hierarchical way

I-frames form the top of the hierarchy.

In each consecutive level, P- or B-frames reference decoded frames at higher levels.

The author’s that referencing (P or B) frames are a special case of image interpolation.

Image interpolation and extrapolation

Image interpolation
hallucinate an unseen frame between two reference frames -------> an encoder-decoder network architecture to move pixels through time

a spatially-varying convolution kernel
a flow field
combine two predictions
Image extrapolation
predicts a future video from a few frames, or a still image
The authors’ extend image interpolation and incorporate few compressible bits of side information to reconstruct the original video.

3. Preliminary

$I^{(t)}\in {R^{WH3}}$ a serires of frames for $t\in{0,1,…}$

Goal: compress each frame $I^{(t)}$ into a binary code $b^{(t)}\in { {0,1}}^{N_t}$

An Encoder E ; A decoder D. E and D have two competing aims:

Minimize the total bitrate $\sum_{t}N_t$

Reconstruct the original video as faithfully as possible, measured by $l(\hat{I},I) = ||\hat{I} - I||_1$

Image compression

The simplest encoders and decoders process each image independently.

$E_I:I^{(t)}\to b^{(t)}$

$D_t:b^{(t)}\to I^{(t)}$

Model of Toderici

encodes and reconstructs an image progressively over K iterations. At each iteration, the model encodes a residual $r_k$ between the previously coded image and the original frame:

$r_0:=I$

$b_k := E_I(r_{k-1},g_{k-1}), r_k := r_{k-1}-D_I(b_k,h_{k-1}), for k 1,2,…$

$g_k$and $h_k$ are latent Conv-LSTM states that are updated at each iteration. All iterations share the same network architecture and parameters forming a recurrent structure.

The training objective minimizes the distortion at all the steps $\sum_{k=1}^{K}||r_k||1$

The reconstructs $\hat I_K = \sum{k=1}^{K}D_I(b_k)$

Both the encoder and the decoder consist of 4 Conv-LSTMs with stride 2.

Bottleneck: a binary feature map with L channels and 16 times smaller spatial resolution in both width and height.

Solution Toderici use a stochastic binarization

Video compression

Process I-frames using an image encoder $E_I$ and decoder $D_I$

P-frames store a block motion estimate $T ∈ R^{W×H×2}$

The original color frame is then recovered by
$I_i^{(t)} = i_{i-T_i^{(t)}}^{(t-1)+R_i^{(t)}}$
for every pixel i in the image.

The compression is uniquely defined by a block structure and motion estimate T.

The author’s image interpolation network with motion information and add a compressible bottleneck layer.

4.Video Compression through Interpolation

Chose every n-th frame as an I-frame.

The remaining n − 1 frames are interpolated. R-frames

Basic interpolation network $\to$ a hierarchical interpolation (reduce the bitrate)
在这里插入图片描述

4.1 Interpolation network

A context network $C:I \to {f^{(1)},f{(2)},…}$ to extract a series of feature maps $f^{(l)}$ of various spatial resolutions

let $f := {f^{(1)},f{(2)},…}$ be the collection of all context features

use the upconvolutional feature maps of a U-net architecture with increasing spatial resolution

C and D are trained jointly.

Motion compensated interpolation

Tried both optical flow and block motion estimation.

Use the motion information to warp each context feature map: $\check f_i^{(l)} = f_{i-T_i}^{(l)}$ at every spatial location i.

scale the motion estimation with the resolution of the feature map

use bilinear interpolation for fractional locations

drawback: only produces content seen in either reference image. Variations beyond motion, such as change in lighting, deformation,
occlusion, etc. are not captured by this model.

Residual motion compensated interpolation

jointly train an encoder $E_R$, context model C and interpolation netowrk $D_R$

The encoder sees the same information as the interpolation network, which allows it to compress just the missing information, and avoid a redundant encoding.

$r_0 := I$

$b_k := E_R(r_{k-1}, \check f_1,\check f_2,g_{k-1})$

$r_k := r_{k-1} - D_R(b_k,\check f_1,\check f_2,h_{k-1})$ for k = 1,2,…

Allows for learning a variable rate compression

4.2 Hierarchical interpolation

maximizing the number of temporally close interpolations

4.3 Implementation

Architecture:

Toderici Image compression model
U-net context model

reduce the number of channels of all filters by half

remove the final output layer and takes the feature maps at the resolutions that are 2×, 4×, 8× smaller than the original input image.

Conditional encoder and decoder

fuse the U-net features with the individual Conv-LSTM layers

Entropy coding

Motion compression