论文阅记【CVPR2020】IR-Net: Forward and Backward Information Retention for Accurate Binary Neural Networks

论文题目:Forward and Backward Information Retention for Accurate Binary Neural Networks

Literature Address: https://arxiv.org/abs/1909.10788

Source Address: https://github.com/htqin/IR-Net

Effect application IR-Net

  The authors used two benchmark datasets: CIFAR-10 and ImageNet (ILSVRC12) conducted experiments. Experimental results show that the two sets of data, IR-Net more competitive than the conventional art methods.

  To further verify the efficiency of the IR-Net deployed in an actual mobile device, the authors further achieved in the IR-Net Raspberry Pi 3B 1.2GHz 64-bit quad-core ARM Cortex-A53 and tested for their true velocity in practical applications.

  As can be seen from the table, IR-Net reasoning much faster, the model size is greatly reduced, and the shift operation of the IR-Net reasoning little extra time and storage consuming.

Binary network status and motivation

  Weights and activation binarization is an effective depth of the neural network compression method , using a bit operation can be accelerated reasoning. And, in actual use, the binary neural networks with its small storage space and efficient reasoning has been widespread concern in society. Although many binarization by the front minimizing the quantization error propagation to improve the model accuracy , but still between the model and full binarization accuracy of the model exists significant performance difference .

  Binary neural network performance degradation The main reason is because it represents a limited capacity and the value of the two discrete , resulting in the forward and backward propagation are in serious loss of information . In the first propagation, when activated, and the amount of weight is limited to two values , the model diversity sharp decline , while diversity proved neural networks [54] of the critical high accuracy .

  Diversity means that the process of propagating forward will have the ability to carry enough information , and, in the process of back-propagation, the precise gradient will optimize provide the correct information . However, the binary network in the training process, discrete binary often bring inaccurate gradient and error optimization direction . Therefore, the authors believe the binary network to quantify the loss of information in the forward propagation and dissemination of reverse both lost cause.

  Therefore, the proposed information to keep the network-Net IR ( Information Retention Network ):

  1. In the forward propagation introduced a parameter called Libra binarization (the Libra-PB balance normalized quantization method) by the maximum entropy quantization parameter and the quantization error is minimized, minimizing the loss of information in the forward propagation;
  2. In backpropagation using error attenuator estimator (in EDE ) calculates a gradient, accurate gradients through better approximation sign function to minimize the loss of information, and to ensure adequate training to update the end of the training start.

  As shown below, the first drawing depicts a schematic Libra-PB forward to change the weight distribution of the process; the second web depicted in FIG usage Libra-PB and in EDE really a training process; third web FIG EDE process depicted by the sign function approximation minimizing information loss.

AND-Net

Preliminaries

  a. Conventional calculation

  Depth neural network, the main operation can be expressed as:

Wherein, w represents a weight vector; A represents the input vector is activated.

  b. binarization process

  The need for a weight w and binarizes the input, following Qx (x) are respectively input during a weight w and binarized, i.e. either Qw (w), either Qa (a). each α is a scalar coefficient.

  Two values network 's goal is to represent a floating point value and the weight of each layer after the activation function of the output by 1-bit. In general, quantification can be expressed as:

Wherein, x is a floating point parameters include a floating point and activates the output weights w a, Bx∈ {-1, +1} denotes the binary, including binarization weight Bw and activates the output Ba. α represents a binary scalar, scalar weight αw and output, including the right to activate the called αa. Usually sign function to get Bx:

  c. binarized network operation

  When the weights obtained after the binarization after Qw weight (w) and the input Qa (a) after binarization, can follow a conventional operation calculated in the form of binary arithmetic network, as follows:

Wherein the operation comprises XNOR operation is.

  Using a sign function to quantify , in the back-propagation procedure will exist in a problem : the number of pilot symbols is almost everywhere function 0 , and this does not correspond to backpropagation. Because before discretization (or pre-activation weight) gradient exact original value will be zero. Therefore, in general, "Thru estimator (STE) [5]" to train the binary model, the model by identical or Hardtanh function to propagate the gradient .

Forward propagating the Libra-PB (Libra Parameter Binarization)

  Traditional quantization error minimizing objective function

  Forward propagation process, the quantization will bring about loss of information. In many binary convolutional neural network quantizer, the quantization error will be minimized as a way of optimizing the objective function:

Wherein, x represents a full precision parameter; Qx (x) represents the quantization parameter; J (Qx (x)) represents the quantization loss.

  For binary model, which parameter indicates the capability is limited to two values , so that the neurons carry the information is easily lost . Binary neural network solution space and solution space full precision neural networks are also very different . Thus, if the network does not hold information only by minimizing the quantization error to ensure a good binarization network is insufficient, it is difficult .

  Thus, loss of binding and loss of information quantized, the previous calculation, made while holding information Libra-PB minimize information loss.

  Libra-PB objective function

  For random number b∈ {-1, +1}, obey the Bernoulli distribution, b here is actually Qw (w) quantized and Qa (a). Its probability distribution can be expressed as follows:

  Binarization process parameter is binarized to initialize the weights and inputs nature . The authors hope the binary weight distribution can be obtained by re-balanced, that is to say, let Qx (x) of information entropy bigger the better, the greater the more confusion, the more balanced distribution.

  For binarization result Qx (x) for obtaining information entropy, entropy is actually obtains the Bx (the network can be obtained by the formula binarization process). Thus, it can be expressed as follows:

  And, when P = 0.5, the maximum entropy. Value after quantization means evenly distributed.

  Libra-PB objective function is defined as:

  In addition, the negative effects in order to make training more stable and avoid weights and gradients generated further balance weights were normalized.

Wherein, σ (·) is the standard deviation.

  Weight w ^ has two characteristics:

  (1) zero-mean, the binarization entitled to the maximum entropy weight.

  (2) unit norm, which makes the full precision weights involved binarizing weight more dispersed.

  As can be seen from the figure, the weights are the weights from a full-precision conversion to binary weights, weights after binarization Libra-PB weight compared to conventional binarization higher information entropy distribution is more balanced.

  Thus the final, Libra for forward propagation parameter binarization can be expressed as follows:

  R-Net main arithmetic operations can be expressed as:

Which represent left << >> right shift operation. s can be calculated by the following expression:

The back propagation EDE (Error Decay Estimator)

  Since the binarized discontinuity gradient is approximately the inevitable propagation. Therefore, the impact can not be quantified accurately model with similar, causing a huge loss of information. It can be approximated as:

Wherein, L (w) denotes loss function, g (w) represents the sign function of the approximate expression. For g (w) usually has two approximation methods:

  1. Identity:y = x

  Identity function simply passes the output value of the gradient information to the input value, totally ignored binarized impact. Shaded area in FIG. 3 (a) is shown, the gradient error is large, and will be accumulated in the back propagation process. Stochastic gradient descent algorithm, gradient required to retain the right information, training to avoid instability, rather than ignore the noise caused by the Identity function.

  1. Clip: y=Hardtanh(x)

  Clip function takes into account the properties of the binarized truncated, reduced gradient error. But it can only pass truncation interval in the gradient information. As can be seen in FIG. 3 (b), for [-1, +1] outside the parameters, the gradient is limited to 0. This means that once the cutoff value out of an interval, you can not update it. This feature greatly limits the ability to update the counter-propagating proved ReLU is a better than Tanh activation function. Thus, in practical applications, increasing the difficulty of the Clip approximation optimized, reduced accuracy. Ensure sufficient critical update possibilities, especially at the beginning of the training process.

  Identity function lost quantized gradient information, and the function is lost Clip gradient information than the truncation interval. There is a contradiction between the two gradient information loss.

  In order to preserve the information derived by back propagation loss function, EDE introduces a two-stage progressive approximation of the gradient method.

  The first stage: to retain the ability to update the back-propagation algorithm. We estimate the value of the derivative function gradient is maintained at a level close to 1, then gradually cut-off value from a large number down to 1. Using this rule, we approximate the function evolved from Identity Clip function to function, thus ensuring the training early renewal.

  The second stage: to maintain accurate gradient parameter is about zero. We will maintain a cut-off value, and gradually evolved into a derivative curve shape of a step function. Using this rule, we approximate the function evolved from function to sign Clip function, thus ensuring the consistency of the forward and reverse transmission.

  EDE shape change various stages in FIG. 3 (c) FIG. By this design, the value of EDE to reduce the difference between the approximate function, and all parameters can be obtained functions and reasonable to update the forward two.

  

Guess you like

Origin www.cnblogs.com/monologuesmw/p/12621335.html