tensorflow model to quantify examples

1 Overview

  Model to quantify compression technology should be a model and now the easiest to implement, but also, after all, is basically a road in the mobile end deployment model. The basic model to quantify can be divided into two: post training quantizated and quantization aware training. In pyrotch and tensroflow are provided corresponding implementation interface.

  For quantization can be summarized with the formula with the min-max now common way:

    $r = S (q - Z)$

  In the above formulas q is a quantized value, r is the original floating point value, S is floating type scaling sparse, Z is the same type of value r q represents 0 o'clock. according to:

    $\frac{q - q_{min}}{q_{max} - q_{min}} = \frac{r - r_{min}}{r_{max} - r_{min}}$

  It can be extrapolated value S and Z:

    $S = \frac{r_{max} - r_{min}}{q_{max} - q_{min}}$

    $Z = q_{min} - \frac{r_{min}}{S}$

2, experimental section

  Based on LeNet tensorflow experiments to quantify these two ways, code, see GitHub: https://github.com/jiangxinyang227/model_quantization .

  post training quantizated

  In tensorflow implement particularly simple model of training, but choose to save with savedModel model to quantify as input and convert it to tflite, we will this version called v1 version.

import tensorflow as tf

saved_model_dir = "./pb_model"

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir,
                                                     input_arrays=["inputs"],
                                                     input_shapes={"inputs": [1, 784]},
                                                     output_arrays=["predictions"])
converter.optimizations = ["DEFAULT"]
tflite_model = converter.convert()
open("tflite_model_v3/eval_graph.tflite", "wb").write(tflite_model)

  However, in the actual process model size after this tflite transcoded not narrowed to 1/4. So very strange, it is not yet determined the cause. On this basis, we introduced a line of code, this version called v2:

import tensorflow as tf

saved_model_dir = "./pb_model"

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir,
                                                     input_arrays=["inputs"],
                                                     input_shapes={"inputs": [1, 784]},
                                                     output_arrays=["predictions"])
converter.optimizations = ["DEFAULT"]   # Saved as v1, v2 version used 
converter.post_training_quantize = True   # use when saving for the v2 version 
tflite_model = converter.convert () 
Open ( " tflite_model_v3 / eval_graph.tflite " , " wb " ) .write (tflite_model)

  The size of this model is reduced to 1/4.

  After another single into the model tflite, this is called v3:

import tensorflow as tf

saved_model_dir = "./pb_model"

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir,
                                                     input_arrays=["inputs"],
                                                     input_shapes={"inputs": [1, 784]},
                                                     output_arrays=["predictions"])
tflite_model = converter.convert()
open("tflite_model_v3/eval_graph.tflite", "wb").write(tflite_model)

  Obviously, directly into tflite, model size definitely does not compress, we look at the speed infer, deduce the code again on GitHub, results were as follows:

  

   The above prediction checkpoint checkpoint is loaded directly on cpu. See here only v2 version of the model is compressed to 1/4 of the original, but it is better to infer the speed v1 and v3 versions, and the inference speed tflite model is significantly better than the checkpoint. I guess the reason may be:

    1, tflite itself interpreter of tflite model is accelerated.

    2, as the model of how to quantify but the effect is not good, because the calculation of the training quantized nature post is to convert int to float calculation, therefore there is an intermediate quantization and inverse quantization operations must account for some time.

  quantization aware training

  Introduces quantization operation in the training is much more complicated, in the first training behind loss calculation, optimizing to be introduced to define the front tf.contrib.quantize.create_training_graph (). as follows:

= self.loss slim.losses.softmax_cross_entropy (self.train_digits, self.input_labels) 

# Get the current calculation map for subsequent quantization 
self.g = tf.get_default_graph () 

IF self.is_train:
     # After the loss function, before optimizing the definition, calculation is automatically selected where some of the figures and the activation operation to make the dummy quantization 
    tf.contrib.quantize.create_training_graph (self.g, 80000 ) 
    self.lr = cfg.LEARNING_RATE 
    self.train_op = tf.train. AdamOptimizer (self.lr) .minimize (self.loss)

  After completion of the training model will be saved as a checkpoint file, which contains pseudo-quantitative information. The inside of the variable or float type, we need to convert it to an int containing only model file, specific practices are as follows:

  1, save for the freeze pb file and use tf.contrib.quantize.create_eval_graph () to infer the mode is converted into

tf.Session with () AS Sess: 
    le_net = Lenet (False) 
    saver = tf.train.Saver ()   # can not be introduced into the train graph, need to re-create a graph, then the train graph of FIG filling the parameters in FIG 
    saver .restore (Sess, cfg.PARAMETER_FILE) 

    frozen_graph_def = graph_util.convert_variables_to_constants ( 
        Sess, sess.graph_def, [ ' Predictions ' ]) 
    tf.io.write_graph ( 
        frozen_graph_def, 
        " pb_model " ,
         " freeze_eval_graph.pb " , 
        as_text = False)

  Note that the above comments, where the calculated view saver must not be introduced into the training tf.train.import_meta_graph a similar manner, but by calling again the initial class Lenet a calculation map, and the parameter variable is assigned to the training figure calculation FIG.

  2, converted into tflite file

import tensorflow as tf

path_to_frozen_graphdef_pb = 'pb_model/freeze_eval_graph.pb'
converter = tf.contrib.lite.TFLiteConverter.from_frozen_graph(path_to_frozen_graphdef_pb,
                                                              ["inputs"],
                                                              ["predictions"])

converter.inference_type = tf.contrib.lite.constants.QUANTIZED_UINT8
converter.quantized_input_stats = {"inputs": (0., 1.)}
converter.allow_custom_ops = True
converter.default_ranges_stats = (0, 255)
converter.post_training_quantize = True
tflite_model = converter.convert()
open("tflite_model/eval_graph.tflite", "wb").write(tflite_model)

  Note that:

  1), [ "inputs"], [ "predictions"] is the input and output nodes of freeze pb

  2), quantized_input_states defined mean and variance are entered, tensorflow lite document that said calculated mean and var are: mean an integer value between 0 and 255, is mapped to the floating point 0.0f. std_dev = 255 / (float_max - float_min) but here again I found the use of 0.5 and 1. The effect is good.

  3), default_ranges_states refers to a range of quantized values, wherein 2 ^ 255 is 8--1.

  3, using tflite forecast

Import Time
 Import tensorflow TF AS
 Import numpy AS NP
 Import tensorflow.examples.tutorials.mnist.input_data AS Input_Data 


MNIST = input_data.read_data_sets ( ' MNIST_data / ' , one_hot = True) 
Labels = [label.index (. 1) for label in MNIST .test.labels.tolist ()] 
Images = mnist.test.images 

"" " 
predicted, it is necessary to normalize the input to the standard normal distribution 
" "" 
means = np.mean (Images, Axis =. 1) .reshape ([10000,. 1 ]) 
STD = np.std (Images, Axis =. 1, ddof =. 1) .reshape ([10000,. 1  ])
Images= (images - means) / std
"""
需要将输入的值转换成uint8的类型才可以
"""
images = np.array(images, dtype="uint8")

interpreter = tf.contrib.lite.Interpreter(model_path="tflite_model/eval_graph.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

start_time = time.time()


predictions = []
for image in images:
    interpreter.set_tensor(input_details[0]['index'], [image])
    interpreter.invoke()
    score = interpreter.get_tensor(output_details[0]['index'])[0][0]
    predictions.append(score)

correct = 0
for prediction, label in zip(predictions, labels):
    if prediction == label:
        correct += 1
end_time = time.time()
print((end_time - start_time) / len(labels) * 1000)
print(correct / len(labels))

  Also pay attention to two things:

  1), enter normalized to standard normal distribution, and prior to this I think is set quantized_inputs_states consistent.

  2) Enter the type to convert to uint8, or will be error.

  4, performance comparison

  

   Reduce the size of the previous model to 1/4, this is no problem, the performance decreased by 2%, acceptable, concluded that about 3 times faster.

  Before we further next post training quantized Contrast: the same size and v2, v2 performance difference 2%, 0.02 inference speed. Personally I think the reason may be as follows:

  1, first of all may be on the mnist LeNet large datasets model, so post training quantized little performance loss, and therefore quantization aware training ratio is not a disadvantage, but also some advantages.

  2 inference speed quantization aware training to be faster (Note: This value is not by chance, I tested many times, infer basic rate stabilized at a value, the average difference 0.02), but fast is not obvious, but the more and v1 v3 also declined since convolutional network, computational complexity is mainly affected by convolution, convolution and here is not large, the effect on the estimation speed is not obvious after quantization, followed by quantization operation also incorporated Some loss of time, and there v2 inverse quantization operation, and therefore a bit more time consuming. Finally, the hardware is not possible to calculate int8 of special support.

  In summary above is just testing the entire tensorflow quantization process. Because the selected network is relatively simple, and do not see the gap in such Inception3 on mobileNet as obvious point. In addition tflite can really accelerate.

Guess you like

Origin www.cnblogs.com/jiangxinyang/p/12056209.html