Deep learning model compression and accelerated model inference

Introduction

When deploying a machine learning model into a production environment, there are often requirements that need to be met that were not considered during the model prototype stage. For example, a model used in production will have to handle a large number of requests from different users. Therefore, you will want to optimize for lower latency and/or throughput.

  • Latency: is the time it takes for a task to complete, like the time it takes to load a web page after clicking a link. It is the waiting time between starting a task and seeing the results. 

  • Throughput: It is the number of requests that the system can handle within a certain period of time.

This means that machine learning models must be very fast when making predictions, and for this there are various techniques to increase the speed of model inference, this article will introduce some of the most important ones.

Model compression

There are some techniques that aim to make models smaller, so they are called model compression techniques, while others focus on making models faster during the inference phase, and thus fall under the domain of model optimization. But often making the model smaller also helps increase inference speed, so the line between these two areas of research is very blurry.

low rank decomposition

This is the first method we've seen for the first time, and it's being studied extensively, in fact, there have been a lot of papers published about it recently.

The basic idea is to replace the matrix of a neural network (the matrix representing the layers of the network) with a low-dimensional matrix (although a more correct term is a tensor, since we often have matrices with more than 2 dimensions). In this way, we will reduce the number of network parameters and thus increase the speed of inference.

A trivial example is replacing 3x3 convolutions with 1x1 convolutions in a CNN network. This technology is used in network structures such as SqueezeNet.

Recently, similar ideas have been applied for other purposes, such as allowing fine-tuning of large language models with limited resources. When fine-tuning a pre-trained model for a downstream task, the model still needs to be trained on all parameters of the pre-trained model, which can be very expensive.

Therefore, the idea of ​​a method called "Low-rank Adaptation of Large Language Models" (or LoRA) is to replace (using matrix factorization) the original model with smaller matrices, which have smaller dimensions. This way, only these new matrices need to be retrained to adapt the pretrained model to more downstream tasks.

a93b001d4553c55f2d3828dad2627218.png

Matrix factorization in LoRA

Now, let’s see how to fine-tune LoRA using Hugging Face’s PEFT library. Suppose we want to fine-tune bigscience/mt0-large using LoRA. First, we have to make sure we import what we need.

!pip install peft
!pip install transformers
from transformers import AutoModelForSeq2SeqLM
  from peft import get_peft_model, LoraConfig, TaskType


  model_name_or_path = "bigscience/mt0-large"
  tokenizer_name_or_path = "bigscience/mt0-large"

The next steps will be to create the configuration that will be applied to LoRA during fine-tuning.

peft_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
)

We then instantiate the model using the base model from the Transformers library and the configuration object we created for LoRA.

model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

knowledge distillation

This is another approach that allows us to put "small" models into production. The idea is to have a large model called the teacher, and a smaller model called the student, and we will use the teacher's knowledge to teach the student how to make predictions. This way we can just put the students into the production environment.

68c46b7a8164d1b48e40fa67a561a472.png

A classic example of this approach is the model developed in this way, DistillBERT, which is a student model of BERT. DistilBERT is 40% smaller than BERT, but retains 97% of language understanding capabilities and is 60% faster at inference. One drawback to this approach is that you still need to have a large teacher model in order to train students, and you may not have enough resources to train a teacher-like model.

Let’s see a simple example of how to do knowledge distillation in Python. A key concept to understand is Kullback–Leibler divergence, it is a mathematical concept used to understand the difference between two distributions, in fact in our case we want to understand the difference between the predictions of two models , so the trained loss function will be based on this mathematical concept.

import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
import numpy as np


# Load the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()


# Preprocess the data
train_images = train_images.reshape((60000, 28, 28, 1)).astype('float32') / 255
test_images = test_images.reshape((10000, 28, 28, 1)).astype('float32') / 255
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)


# Define the teacher model (a larger model)
teacher_model = models.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])


teacher_model.compile(optimizer='adam',
                      loss='categorical_crossentropy',
                      metrics=['accuracy'])


# Train the teacher model
teacher_model.fit(train_images, train_labels, epochs=5, batch_size=64, validation_split=0.2)


# Define the student model (a smaller model)
student_model = models.Sequential([
    layers.Flatten(input_shape=(28, 28, 1)),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])


student_model.compile(optimizer='adam',
                      loss='categorical_crossentropy',
                      metrics=['accuracy'])


# Knowledge distillation step: Transfer knowledge from the teacher to the student
def distillation_loss(y_true, y_pred):
    alpha = 0.1  # Temperature parameter (adjust as needed)
    return tf.keras.losses.KLDivergence()(tf.nn.softmax(y_true / alpha, axis=1),
                                           tf.nn.softmax(y_pred / alpha, axis=1))


# Train the student model using knowledge distillation
student_model.fit(train_images, train_labels, epochs=10, batch_size=64,
                  validation_split=0.2, loss=distillation_loss)


# Evaluate the student model
test_loss, test_acc = student_model.evaluate(test_images, test_labels)
print(f'Test accuracy: {test_acc * 100:.2f}%')

pruning

Pruning is a model compression method that I studied in my graduate thesis, and in fact, I have previously published an article on how to implement pruning in Julia: Iterative Pruning Method for Artificial Neural Networks in Julia .

Pruning was born to solve the over-fitting problem in decision trees. It actually reduces the depth of the tree by pruning the branches of the tree. This concept was later used in neural networks, where edges and/or nodes in the network are removed (depending on whether unstructured pruning or structured pruning is performed).

6b898fea12d0500c3467a2966e7e2296.png

Assuming you were to remove an entire node from the network, the matrix representing the layer will become smaller, so your model will also become smaller and therefore faster. In contrast, if we remove a single edge, the size of the matrix will remain the same, but we will place zeros at the location of the removed edge, so we will obtain a very sparse matrix. Therefore, in unstructured pruning, the advantage is not in increased speed, but in memory, since saving a sparse matrix in memory takes up less space than saving a dense matrix.

But which nodes or edges do we want to prune? It is usually the most unnecessary node or edge. It is recommended that you study the following two papers: "Optimal Brain Damage" and "Optimal Brain Surgeon and general network pruning".

Let’s look at a Python script on how to implement pruning in a simple MNIST model.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
from tensorflow_model_optimization.sparsity import keras as sparsity
import numpy as np


# Load the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()


# Preprocess the data
train_images = train_images.reshape((60000, 28, 28, 1)).astype('float32') / 255
test_images = test_images.reshape((10000, 28, 28, 1)).astype('float32') / 255
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)


# Create a simple neural network model
def create_model():
    model = Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    return model


# Create and compile the original model
model = create_model()
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])


# Train the original model
model.fit(train_images, train_labels, epochs=5, batch_size=64, validation_split=0.2)


# Prune the model
# Specify the pruning parameters
pruning_params = {
    'pruning_schedule': sparsity.PolynomialDecay(initial_sparsity=0.50,
                                                 final_sparsity=0.90,
                                                 begin_step=0,
                                                 end_step=2000,
                                                 frequency=100)
}


# Create a pruned model
pruned_model = sparsity.prune_low_magnitude(create_model(), **pruning_params)


# Compile the pruned model
pruned_model.compile(optimizer='adam',
                     loss='categorical_crossentropy',
                     metrics=['accuracy'])


# Train the pruned model (fine-tuning)
pruned_model.fit(train_images, train_labels, epochs=2, batch_size=64, validation_split=0.2)


# Strip pruning wrappers to create a smaller and faster model
final_model = sparsity.strip_pruning(pruned_model)


# Evaluate the final pruned model
test_loss, test_acc = final_model.evaluate(test_images, test_labels)
print(f'Test accuracy after pruning: {test_acc * 100:.2f}%')

Quantify

I don't think it's wrong to say that quantization is probably the most widely used compression technique out there. Again, the basic idea is simple. Usually, we use 32-bit floating point numbers to represent the parameters of neural networks. But what if we use lower precision values? We can use 16 bits, 8 bits, 4 bits, or even 1 bit and have a binary network!

what does that mean? By using lower precision numbers, the model will be lighter and smaller, but will also lose precision, providing more approximate results than the original model. This is a technology that is often used when we need to deploy on edge devices, especially on some special hardware such as smartphones, because it allows us to significantly reduce the size of the network. Many frameworks allow easy application of quantization, such as TensorFlow Lite, PyTorch or TensorRT.

Quantization can be applied before training, so we directly truncate a network whose parameters can only take values ​​within a certain range, or applied after training, so the values ​​of the parameters end up being rounded. Here we take another quick look at how to apply quantization in Python.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Dropout
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
import numpy as np


# Load the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()


# Preprocess the data
train_images = train_images.reshape((60000, 28, 28, 1)).astype('float32') / 255
test_images = test_images.reshape((10000, 28, 28, 1)).astype('float32') / 255
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)


# Create a simple neural network model
def create_model():
    model = Sequential([
        Flatten(input_shape=(28, 28, 1)),
        Dense(128, activation='relu'),
        Dropout(0.2),
        Dense(64, activation='relu'),
        Dropout(0.2),
        Dense(10, activation='softmax')
    ])
    return model


# Create and compile the original model
model = create_model()
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])


# Train the original model
model.fit(train_images, train_labels, epochs=5, batch_size=64, validation_split=0.2)


# Quantize the model to 8-bit integers
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_model = converter.convert()


# Save the quantized model to a file
with open('quantized_model.tflite', 'wb') as f:
    f.write(quantized_model)


# Load the quantized model for inference
interpreter = tf.lite.Interpreter(model_path='quantized_model.tflite')
interpreter.allocate_tensors()


# Evaluate the quantized model
test_loss, test_acc = 0.0, 0.0
for i in range(len(test_images)):
    input_data = np.array([test_images[i]], dtype=np.float32)
    interpreter.set_tensor(interpreter.get_input_details()[0]['index'], input_data)
    interpreter.invoke()
    output_data = interpreter.get_tensor(interpreter.get_output_details()[0]['index'])
    test_loss += tf.keras.losses.categorical_crossentropy(test_labels[i], output_data).numpy()
    test_acc += np.argmax(test_labels[i]) == np.argmax(output_data)


test_loss /= len(test_images)
test_acc /= len(test_images)


print(f'Test accuracy after quantization: {test_acc * 100:.2f}%')

in conclusion

In this article, we explore several model compression methods to speed up the model inference phase, which can be a critical requirement for models in production. In particular, we focus on methods such as low-rank decomposition, knowledge distillation, pruning, and quantization, explain the basic ideas, and show simple implementations in Python. Model compression is also useful for deploying models on specific hardware with limited resources (RAM, GPU, etc.), such as smartphones.

·  END  ·

HAPPY LIFE

2b8aca1340daff40791d1e7794309c9f.png

This article is for learning and communication only. If there is any infringement, please contact the author to delete it.

Guess you like

Origin blog.csdn.net/weixin_38739735/article/details/132843984