## 一、 Question 1:How does Tensorflow do quantization and dequantization?

Details
According to the blog post https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/*（重点！）*, Tensorflow quantizes values before they go into a layer. After being processed by the layer, the values are dequantized. Tensorflow quantizes values by rescaling the values between 0 and 255, so it needs to keep “min” and “max” to dequantize the values.
I would like to ask: 1. how the “min” and “max” in the outputs of a “quantization” op are determined? I mean, if we simply find the minimum and maximum value and set them to 0 and 255, we will get data overflow or underflow when doing convolution. 2. how the “min” and “max” in the outputs of a “convolution” op are determined? Both weights and activations are quantized, so there are two sets of “min” and “max”. How does a convolution op combine them to form a single set of “min” and “max”?

TensorFlow uses i.a. gemmlowp for low-precision matrix multiplications. Although 8-bit values are used as inputs, intermediate results are 32-bit values. These 32-bit values are converted back to 8-bit before returning the results.
TensorFlow使用i.a. gemmlowp用于低精度矩阵乘法。 尽管8位值用作输入，但中间结果是32位值。 在返回结果之前，这些32位值被转换回8位。

To avoid overflow, we internally accumulate results on more than 8 bits, and at the end we keep only some significant 8 bits.

## 二、How to Quantize Neural Networks with TensorFlow (Blog)，官网指南

### 2.1 量化已有模型并做测试

``````代码地址：https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/quantize/python/quantize_graph.py

tar xzf /tmp/inceptionv3.tgz -C /tmp/
bazel build tensorflow/contrib/quantization/tools:quantize_graph
bazel-bin/tensorflow/contrib/quantization/tools/quantize_graph \
--input=/tmp/classify_image_graph_def.pb \
--output_node_names="softmax" --output=/tmp/quantized_graph.pb \
--mode=eightbit``````

This will produce a new model that runs the same operations as the original, but with eight bit calculations internally, and all weights quantized as well. If you look at the file size, you’ll see it’s about a quarter of the original (23MB versus 91MB). You can still run this model using exactly the same inputs and outputs though, and you should get equivalent results. Here’s an example:

``````bazel build tensorflow/examples/label_image:label_image
bazel-bin/tensorflow/examples/label_image/label_image \
--input_graph=/tmp/quantized_graph.pb \
--input_width=299 \
--input_height=299 \
--mean_value=128 \
--std_value=128 \
--input_layer_name="Mul:0" \
--output_layer_name="softmax:0"``````

### 2.2 量化张量使用什么表示？ using gemmlowp

### 2.3.1 量化实现代码（重要！！）  ### 2.3.2 原理讲解 The low-precision paradigm in gemmlowp, and how it’s implemented (gemmlowp)

“Low-precision” means that the input and output matrix entries are integers on at most 8 bits. The scalar type is uint8_t.
gemmlowp is flexible enough to support multiple low-precision paradigms, i.e. multiple ways that a meaning is attached to 8bit values so that a computation can rely on a 8bit GEMM provided by gemmlowp.

#### Building a quantization paradigm from first principles

1. Quantization as an affine map.
2. Domain-specific constraint: the real value 0 must be exactly representable.
3. The final form of the quantization equation
4. Quantizing a matrix multiplication
5. Implementation of quantized matrix multiplication
6. How this is implemented in gemmlowp
7. How this differs from the older legacy gemmlowp quantization paradigm
8. Example code illustrating the new quantization paradigm

3.1 mobilenet_v1_train.py

``````def build_model():
"""Builds graph for model to train with rewrites for quantization.
Returns:
g: Graph with fake quantization ops and batch norm folding suitable for
training quantized weights.
train_tensor: Train op for execution during training.
"""
g = tf.Graph()
with g.as_default(), tf.device(
inputs, labels = imagenet_input(is_training=True)
with slim.arg_scope(mobilenet_v1.mobilenet_v1_arg_scope(is_training=True)):
logits, _ = mobilenet_v1.mobilenet_v1(
inputs,
is_training=True,
depth_multiplier=FLAGS.depth_multiplier,
num_classes=FLAGS.num_classes)

tf.losses.softmax_cross_entropy(labels, logits)

# Call rewriter to produce graph with fake quant ops and folded batch norms
# quant_delay delays start of quantization till quant_delay steps, allowing
# for better model accuracy.
if FLAGS.quantize:
**tf.contrib.quantize.create_training_graph(quant_delay=get_quant_delay())**``````
``````# 创建伪量化graph
def create_training_graph(input_graph=None, quant_delay=0):
"""Rewrites a training input_graph in place for simulated quantization.
Variables added by the rewrite get added to the global variables collection.
The graph has fake quantization ops inserted to simulate the error
introduced by quantization. Since the graph is transformed in place,
the expected behavior of previously held references to nodes and tensors may
change.
The default value of quant_delay is suitable for finetuning an already trained
floating point model (recommended).
If one wants to train a quantized model from scratch, quant_delay should be
set to the number of steps it take the floating point model to converge.
Quantization will be activated at this point and effectively finetune the
model. If quant_delay is not provided when training from scratch, training can
often fail.
Args:
input_graph: The tf.Graph to be transformed.
quant_delay: Number of steps after which weights and activations are
quantized during training.
Raises:
ValueError: If elements contains an element that isn't a tf.Tensor or
tf.Operation.
"""
# TODO(raghuramank) Need to have freeze_bn_delay be a function of batch size
# Currently the values below are hardcoded for mobilenetV1 on imagenet
# Please use the experimental API if you need to tune these values.
freeze_bn_delay = None

**_create_graph(**
input_graph=input_graph,
is_training=True,
quant_delay=quant_delay,
freeze_bn_delay=freeze_bn_delay)

# 转到_create_graph
def _create_graph(input_graph=None,
is_training=True,
weight_bits=8,
activation_bits=8,
quant_delay=None,
freeze_bn_delay=None,
scope=None):
"""Rewrites an input_graph in place for simulated quantization.
The graph has fake quantization ops inserted to simulate the error
introduced by quantization. Since the graph is transformed in place,
the expected behavior of previously held references to nodes and tensors may
change.
Args:
input_graph: The tf.Graph to be transformed, if None then defaults to the
default graph.
is_training: Whether quantizing training or eval graph.
weight_bits: Number of bits to use for quantizing weights.
activation_bits: Number of bits to use for quantizing activations.
quant_delay: Number of steps after which weights and activations are
quantized during training.
freeze_bn_delay: Number of steps after which moving mean and variance are
frozen and used instead of batch statistics during training.
freeze_bn_delay should be greater than quant_delay and should correspond
to the number of steps when training has almost converged
scope: The scope to be transformed. If it's not None, only the ops which
are in this scope will be transformed.
Raises:
ValueError: If elements contains an element that isn't a tf.Tensor or
tf.Operation.
"""

if input_graph is None:
input_graph = ops.get_default_graph()
with input_graph.as_default():
fold_batch_norms.FoldBatchNorms(
input_graph,
freeze_batch_norm_delay=freeze_bn_delay,
is_training=is_training)
**quantize.Quantize(**
input_graph,
is_training,
quant_delay=quant_delay,
weight_bits=weight_bits,
activation_bits=activation_bits,
scope=scope)``````

3.3 quantize.py

``````def Quantize(graph,
is_training,
weight_bits=8,
activation_bits=8,
ema_decay=0.999,
quant_delay=None,
vars_collection=ops.GraphKeys.GLOBAL_VARIABLES,
scope=None):
Currently we quantize the following tensors:
* Conv/MatMul: Quantize the weights if it matches.
* Activation: Quantize the output if it matches.
* Bypass/Post-activation Bypass: Quantize both input and output
if it matches.
Args:
graph: Graph to modify.
is_training: Whether quantizing training graph or eval graph.
weight_bits: Number of bits to use for quantizing weights.
activation_bits: Number of bits to use for quantizing activations.
ema_decay: (Optional) Float, EMA decay parameter.  EMA is used to update
quantization intervals for quantizing activations (see here about EMA:
https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average).
quant_delay: (Optional, default None) Int, count of global steps for which
to delay quantization.  This helps weights stabilize at the start of
training.
vars_collection: (Optional) Collection where to store the variables for
quantization interval ends.
scope: The scope to be transformed. If it's not None, only the ops which
are in this scope will be transformed.
Raises:
ValueError: When quantization fails.
"""
……
for layer_match in **_FindLayersToQuantize(graph):**
……
**_InsertQuantOp(**
'act_quant',
layer_match.activation_op,
consumer_ops,
is_training,
moving_avg=True,
ema_decay=ema_decay,
quant_delay=quant_delay,
vars_collection=vars_collection,
bits=activation_bits,
init_min=0.0,
producer_scope=scope)

# 转到
**def _FindLayersToQuantize(graph):**
"""Matches layers in graph to quantize.
The following patterns get matched. Nodes surrounded by [] will be
optionally matched:
weight|folded_weight
/
conv|fc
|
[post_conv_correction]
|
|
[bypass]
|
activation
|
[post_activation_bypass]
Match replacements:
If weight|folded_weight is found, FakeQuant is added afterwards.
If bypass is found, FakeQuant is added before and after.
If activation is found, FakeQuant is added afterwards.
If post_activation_bypass is found, FakeQuant is added afterwards.
Args:
graph: Graph to perform match on.
Returns:
list of _LayerMatches.
"""
#接下来
**def _InsertQuantOp(context,**
name,
producer,
consumers,
is_training,
moving_avg=True,
init_min=-6.0,
init_max=6.0,
bits=8,
ema_decay=0.999,
quant_delay=None,
vars_collection=ops.GraphKeys.GLOBAL_VARIABLES,
narrow_range=False,
producer_scope=None,
consumer_scope=None):
"""Inserts a quant op between a producer op and (multiple) consumer ops.
Args:
context: Context where producer and consumer operations are nested.
name: Name for the new quantization op within the context.
producer: Producer operation of the pairs where quantization will be
inserted.
consumers: Consumer operations of the pairs.
is_training: Whether quantizing training graph or eval graph.
moving_avg: Specifies whether to use exponential moving average or just
the last value seen.
init_min: Starting minimum value for the new quantization op.
init_max: Starting maximum value for the new quantization op.
bits: Number of bits to use for quantization, must be between 2 and 8.
ema_decay: (Optional) Float, EMA decay parameter.  EMA is used to update
quantization intervals for quantizing activations (see here about EMA:
https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average).
quant_delay: (Optional, default None) Int, count of global steps for which
to delay quantization.  This helps weights stabilize at the start of
training.
vars_collection: (Optional) Collection where to store the variables for
quantization interval ends.
narrow_range: Whether to use the narrow quantization range
[1; 2^bits - 1] or wide range [0; 2^bits - 1].
producer_scope: The restriction of producer scope. If not None, the new op
will be inserted only when the producer is in this scope.
consumer_scope: The restriction of producer scope. If not None, the new op
will be inserted only when all the consumers are in this scope.
Raises:
ValueError: When producer operation is not directly connected to the
consumer operation.
"""
#接下来
### 对于 变量值
if moving_avg:
quant = (
quant_ops.MovingAvgQuantize(
inputs,
init_min=init_min,
init_max=init_max,
ema_decay=ema_decay,
is_training=is_training,
num_bits=bits,
narrow_range=narrow_range,
vars_collection=vars_collection,
name_prefix=name_prefix))
else:
quant = (
quant_ops.LastValueQuantize(
inputs,
init_min=init_min,
init_max=init_max,
is_training=is_training,
num_bits=bits,
narrow_range=narrow_range,
vars_collection=vars_collection,
name_prefix=name_prefix))
### 对于 激活值
if quant_delay and quant_delay > 0:
activate_quant = math_ops.greater_equal(
common.CreateOrGetQuantizationStep(),
quant_delay,
name=name_prefix + '/activate_quant')
quant = control_flow_ops.cond(
activate_quant,
lambda: quant,
lambda: inputs,
name=name_prefix + '/delayed_quant')
### 对于 消费操作
if consumers:
tensors_modified_count = graph_editor.reroute_ts(
[quant], [inputs], can_modify=consumers)
# Some operations can have multiple output tensors going to the same
# consumer. Since consumers is a set, we need to ensure that
# tensors_modified_count is greater than or equal to the length of the set
# of consumers.
if tensors_modified_count < len(consumers):
raise ValueError('No inputs quantized for ops: [%s]' % ', '.join(
[consumer.name for consumer in consumers]))``````

## 四、Other Sources

#### 4. tf.quantize

Defined in tensorflow/python/ops/array_ops.py.

``````tf.quantize(
input,
min_range,
max_range,
T,
mode='MIN_COMBINED',
round_mode='HALF_AWAY_FROM_ZERO',
name=None
)``````

#### 5. Fixed Point Quantization

5.1 Quantization training with TensorFlow
TensorFlow can train models with quantization in the loop. Because training requires small gradient adjustments, floating point values are still used. To keep models as floating point while adding the quantization error in the training loop, fake quantization nodes simulate the effect of quantization in the forward and backward passes.

Since it’s difficult to add these fake quantization operations to all the required locations in the model, there’s a function available that rewrites the training graph. To create a fake quantized training graph:

``````# Build forward pass of model.
loss = tf.losses.get_total_loss()

# Call the training rewrite which rewrites the graph in-place with
# FakeQuantization nodes and folds batchnorm for training. It is
# often needed to fine tune a floating point model for quantization
# with this training tool. When training from scratch, quant_delay
# can be used to activate quantization after training to converge
# with the float graph, effectively fine-tuning the model.
tf.contrib.quantize.create_training_graph(quant_delay=2000000)

# Call backward pass optimizer as usual.
optimizer.minimize(loss)``````

The rewritten eval graph is non-trivially different from the training graph since the quantization ops affect the batch normalization step. Because of this, we’ve added a separate rewrite for the eval graph:

``````# Build eval model
logits = tf.nn.softmax_cross_entropy_with_logits(...)

# Call the eval rewrite which rewrites the graph in-place with
# FakeQuantization nodes and fold batchnorm for eval.
tf.contrib.quantize.create_eval_graph()

# Save the checkpoint and eval graph proto to disk for freezing
# and providing to TFLite.
with open(eval_graph_file, ‘w’) as f:
f.write(str(g.as_graph_def()))
saver = tf.train.Saver()
saver.save(sess, checkpoint_name)``````

Methods to rewrite the training and eval graphs are an active area of research and experimentation. Although rewrites and quantized training might not work or improve performance for all models, we are working to generalize these techniques.

5.2 Generating fully quantized models
The previously demonstrated after-rewrite eval graph only simulates quantization. To generate real fixed point computations from a trained quantization model, convert it to a fixed point kernel. Tensorflow Lite supports this conversion from the graph resulting from create_eval_graph.

First, create a frozen graph that will be the input for the TensorFlow Lite toolchain:

``````bazel build tensorflow/python/tools:freeze_graph && \
bazel-bin/tensorflow/python/tools/freeze_graph \
--input_graph=eval_graph_def.pb \
--input_checkpoint=checkpoint \
--output_graph=frozen_eval_graph.pb --output_node_names=outputs``````

Provide this to the TensorFlow Lite Optimizing Converter (TOCO) to get a fully quantized TensorFLow Lite model:

``````bazel build tensorflow/contrib/lite/toco:toco && \
./bazel-bin/third_party/tensorflow/contrib/lite/toco/toco \
--input_file=frozen_eval_graph.pb \
--output_file=tflite_model.tflite \
--input_format=TENSORFLOW_GRAPHDEF --output_format=TFLITE \
--inference_type=QUANTIZED_UINT8 \
--input_shape="1,224, 224,3" \
--input_array=input \
--output_array=outputs \
--std_value=127.5 --mean_value=127.5``````

See the documentation for tf.contrib.quantize and TensorFlow Lite.

#### 6. quantize.cc

``````const MinMax& GetOrComputeMinMax(Model* model, const string& array_name) {
auto& array = model->GetArray(array_name);
// Normally we should have a MinMax recorded on this Array,
// so we just use it.
if (array.minmax != nullptr) {
return *array.minmax;
}

// We don't have a MinMax. That's bad news: we need
// the graph to provide MinMax info for all arrays in order
// for inference to reproduce faithfully the same quantization
// error as the training process had.
//
// But we still want to support a fallback for constant arrays,
// just using the plain min and max computed from array elements.
// We should hopefully never rely on that in production, as that
// will not give very good accuracy as that typically won't be
// exactly what the training process used. But it will be useful
// to allow easily trying out quantization even if the graph
// lacks some minmax information.
if (array.buffer != nullptr) {
LOG(WARNING)
<< "Constant array " << array_name
<< " lacks MinMax information. To make up for that, we will now compute"
<< " the MinMax from actual array elements. That will result in"
<< " quantization parameters that probably do not match whichever "
"arithmetic"
<< " was used during training, and thus will probably be a cause of "
"poor"
<< " inference accuracy.";
CHECK(array.buffer->type == ArrayDataType::kFloat);
const auto& data = array.GetBuffer<ArrayDataType::kFloat>().data;
// We always want [min, max] to contain 0.
float min = 0.f;
float max = 0.f;
for (auto val : data) {
min = std::min(min, val);
max = std::max(max, val);
}
if (min == 0.f && max == 0.f) {
// Prevent downstream anger from quantized math that expects min and max
// to not be equal.
max = 1.f;
}
auto& minmax = array.GetOrCreateMinMax();
minmax.min = min;
minmax.max = max;
return minmax;
}

LOG(FATAL) << "Array " << array_name
<< " does not have MinMax information, "
"and is not a constant array. Cannot "
"proceed with quantization.";
}``````