AV1 rapid division of transform blocks based on machine learning

AV1 rapid division of transform blocks based on machine learning

In the previous article, " AV1 Fast Transformation Mode Selection Based on Machine Learning " explained how AV1 uses machine learning technology to select the appropriate transformation mode for each transformation block. This section will explain how AV1 uses machine learning technology to perform transformation on the transformation block. Divide.

AV1 does not need to force a fixed transform unit size as in VP9, ​​but allows coded blocks to be recursively divided. And it supports squares from 4×4 to 64×64, and ratios of 2:1/1:2 and 4:1/1:4 are also available.

 

As shown in the figure above, a coding block can be divided into smaller TUs (transformation units) in many ways, and each small TU can be further divided. In order to find the best division mode, the encoder has to calculate for each TU its division and the free RD cost, and select the mode with the smallest RD cost.

In order to use ML to accelerate the above process, the above problem can be transformed into a binary classification problem. The goal is to build an ML model to predict whether each coding block is divided into TU.

The input of the ML model is some features calculated from the pixels of the residual block. These characteristics include the mean and standard deviation of the two levels. The first level is the mean and standard deviation of the entire residual block. The second level is the mean and standard deviation of each 1/4 sub-block (for square blocks and 1:4 rectangular blocks) or each 1/2 sub-block (for 1:2 rectangular blocks). At the same time, it also calculates the mean of the second-level standard deviation and the standard deviation of the mean. So there are 12 features for square blocks and 1:4 rectangular blocks, and 8 features for 1:2 rectangular blocks . (Note: The 1:4 rectangular block in the AV1 code actually uses 8 features) In order to make the prediction more accurate, build models for blocks of different sizes. The structure of all models is shown in the figure below:

 

The model is implemented by a fully connected neural network, and the hidden layer uses ReLU as the activation function.

 

The model is used as shown in the figure above. Features are extracted from each prediction residual block, and the features are input to the model to obtain output. According to the output results of the model, the partition model is selected. If the model considers that there is a high probability that the result is not optimal, then it is not divided, and if the model considers the high probability that the result of the division is optimal, then it is divided. The predefined threshold controls the final selection (AV1 threshold is 4000 by default).

Specific model structure

Two specific model structure definitions are introduced below:

1. 4x8 block model

It is known from the foregoing that the 4x8 model inputs 8 features, that is, the number of input units is 8. It has only 1 hidden layer, and the number of hidden layer units is 16. The model parameters are as follows:

//输入层到隐藏层间的权重weight
static const float av1_tx_split_nn_weights_4x8_layer0[8 * 16] = {
  0.068650f,  -0.732073f, -0.040361f, 0.322550f,  -0.021123f, 0.212518f,
  -0.350546f, 0.435987f,  -0.111756f, -0.401568f, 0.069548f,  -0.313000f,
  0.073918f,  -0.373805f, -0.775810f, -0.124753f, 0.181094f,  -0.602641f,
  -0.026219f, -0.350112f, 0.020599f,  -0.311752f, -0.476482f, -0.669465f,
  -0.310921f, 0.348869f,  -0.115984f, 0.154250f,  0.200485f,  -0.016689f,
  0.020392f,  0.413810f,  0.634064f,  -0.627530f, 0.399178f,  -0.012284f,
  0.472030f,  0.091087f,  -0.706100f, -0.447944f, -0.274226f, 0.445656f,
  0.309339f,  0.505522f,  0.038496f,  -0.152809f, 0.408684f,  -0.068151f,
  0.271612f,  0.353233f,  -0.150365f, 0.075212f,  -0.035096f, 0.346615f,
  0.124382f,  0.477072f,  0.216288f,  0.070548f,  -0.106362f, 0.681613f,
  -0.145502f, -0.218631f, -0.099248f, -0.001983f, -0.196819f, -0.969045f,
  0.063009f,  -0.123053f, 0.104875f,  -0.137581f, -0.282933f, -0.003624f,
  -0.315659f, -0.333523f, -0.503000f, -0.100063f, -0.536711f, -0.059978f,
  -0.670248f, -0.353762f, 0.181109f,  0.289715f,  -0.071206f, 0.261141f,
  0.052796f,  -0.114554f, -0.139214f, -0.261380f, 0.075984f,  -0.647925f,
  -0.099528f, -0.677814f, 0.015712f,  -0.389385f, -0.095622f, -0.165117f,
  -0.109454f, -0.175240f, -0.393914f, 0.212330f,  0.037822f,  0.248280f,
  0.180197f,  0.110493f,  -0.525727f, -0.092329f, -0.524029f, -0.407364f,
  -0.542373f, -0.435626f, -0.912194f, 0.062794f,  0.160433f,  0.741485f,
  -0.103659f, -0.119327f, -0.055275f, 0.334358f,  0.014713f,  0.046327f,
  0.831114f,  -0.576682f, 0.354369f,  -0.082088f, 0.452331f,  0.039730f,
  -0.792429f, -0.385862f,
};
//输入层到隐藏层间的偏置值bias
static const float av1_tx_split_nn_bias_4x8_layer0[16] = {
  0.238621f,  2.186830f,  1.383035f,  -0.867139f, 1.257119f, -0.351571f,
  -0.240650f, -0.971692f, 2.744843f,  1.116991f,  0.139062f, -0.165332f,
  0.262171f,  -1.598153f, -1.427340f, -1.602306f,
};
//隐藏层到输出层的权重
static const float av1_tx_split_nn_weights_4x8_layer1[16] = {
  -0.367134f, 1.373058f, -0.897039f, -0.326819f, -0.734030f, -0.290413f,
  -0.501249f, 0.505321f, -0.537692f, -0.767893f, 0.268697f,  0.278987f,
  0.085082f,  0.614986f, 0.847904f,  0.637578f,
};
//隐藏层到输出层的偏置
static const float av1_tx_split_nn_bias_4x8_layer1[1] = {
  0.20586078f,
};

2. 8x8 block model

It is known from the foregoing that the 4x8 model inputs 12 features, that is, the number of input units is 12. It has only 1 hidden layer, and the number of hidden layer units is 12. The model parameters are as follows:

//输入层到隐藏层间的权重weight
static const float av1_tx_split_nn_weights_8x8_layer0[144] = {
  0.177983f,  -0.938386f, -0.074460f, -0.221843f, -0.073182f, -0.295155f,
  -0.098202f, -0.279510f, 0.001054f,  -0.119319f, -1.835282f, -0.581507f,
  -1.222222f, -1.049006f, -0.807508f, -0.454252f, -0.774879f, -0.180607f,
  -0.886976f, -0.231971f, -0.824677f, -0.351872f, -1.323819f, 0.235378f,
  0.015331f,  -0.341818f, 0.145549f,  -0.348362f, 0.147647f,  -0.323400f,
  0.047558f,  -0.553025f, -0.295485f, -0.330368f, -0.530605f, -0.407516f,
  0.447740f,  0.782381f,  -0.179164f, -0.584675f, -0.052645f, 0.038656f,
  -0.096783f, 0.038342f,  -0.170762f, -0.405844f, -0.552665f, -0.509866f,
  0.757204f,  -1.296465f, 0.631015f,  0.009265f,  0.646192f,  0.044523f,
  0.653161f,  0.033820f,  0.849639f,  -0.068555f, -1.036085f, -0.511652f,
  0.104693f,  -1.458690f, 0.286051f,  -0.089800f, 0.381564f,  -0.302640f,
  0.304465f,  -0.268706f, 0.432603f,  -0.117914f, -2.070031f, -0.565696f,
  -0.073027f, -1.783570f, -0.318144f, -0.320990f, -0.343966f, -0.140996f,
  -0.322977f, -0.232147f, -0.373210f, -0.158266f, -1.922305f, -0.634373f,
  0.101894f,  -0.221847f, 0.018412f,  -0.423887f, -0.266684f, -0.444930f,
  -0.196237f, 0.106638f,  -0.065834f, -0.538401f, -0.280772f, -0.620348f,
  1.089957f,  -0.799928f, 0.504112f,  -0.165763f, 0.578741f,  -0.172653f,
  0.547316f,  -0.143484f, 0.717220f,  -0.297190f, -1.237854f, -0.074819f,
  -0.977304f, -0.484092f, -0.646427f, -0.451443f, -0.612126f, -0.224475f,
  -0.731608f, -0.257077f, -0.665857f, -0.346742f, -1.216372f, 0.227267f,
  0.231249f,  -1.693073f, -0.035899f, 0.380845f,  -0.058476f, 0.409405f,
  -0.066679f, 0.406731f,  -0.068501f, 0.396748f,  0.639462f,  0.150834f,
  -0.418659f, -1.421931f, 0.101889f,  0.083573f,  0.129746f,  0.134460f,
  0.081185f,  0.127420f,  0.083664f,  0.051096f,  1.361688f,  0.386093f,
};
//输入层到隐藏层间的偏置值bias
static const float av1_tx_split_nn_bias_8x8_layer0[12] = {
  4.280443f, 2.218902f, -0.256953f, 3.161431f,  2.082548f, 2.506052f,
  2.563224f, 1.421976f, -1.627813f, -1.436085f, 2.297265f, 1.500469f,
};
//隐藏层到输出层的权重
static const float av1_tx_split_nn_weights_8x8_layer1[12] = {
  1.178833f,  -0.428527f, -0.078737f, 0.381434f, -0.466895f, -0.901745f,
  -0.766968f, -0.356663f, 0.450146f,  0.509370f, -0.356604f, -0.443506f,
};
//隐藏层到输出层的偏置
static const float av1_tx_split_nn_bias_8x8_layer1[1] = {
  -0.156294f,
};

Training set construction

These models have been trained in AV1, and the data set for training models is also well constructed. Select sequences of different types and resolutions, use different QP encodings, and calculate input features (two-level mean and standard deviation) and corresponding tags (RD cost) during the encoding process.

If you are interested, please follow the WeChat public account Video Coding

 

Guess you like

Origin blog.csdn.net/Dillon2015/article/details/107994961