2 Introduction to Baidu’s open source AI computing chip PaddleChip

Author: Zen and the Art of Computer Programming

1 Introduction

Natural language processing (NLP), computer vision (CV), graphics processing (Graphics) and other fields have made major breakthroughs in continuous innovation. With the continuous upgrading of hardware, many well-known companies at home and abroad are currently deploying in the field of AI chip research and development. In recent years, companies such as Google, Microsoft, Apple, and Amazon have launched their own self-developed AI chip products, and at the end of 2019, they completed the first-generation CMOS chips of the 4nm process: Google Coral Edge TPU, Apple Neural Engine, Apple M1 , Amazon EC2. Recently, Baidu announced the launch of the PaddlePaddle Foundation, which aims to create a unified AI ecosystem through open and shared collaborative innovation, including AI chips, computing power platforms, training technology, application tools, service platforms, etc. This article will introduce its open source basic AI chip Paddle-Lite based on the PaddlePaddle open source framework. This is Baidu's first open source end-side computing platform for AI fields such as vision, audio, natural language, and reinforcement learning. It can achieve rapid deployment of deep learning models. , efficient execution.

2. Background introduction

With the rapid development of AI technology, all walks of life are deploying AI to solve practical problems. As one of the pioneers of AI, Yu Kailong, founder of Baidu's natural language processing department, once described his views on AI: "AI is the human subconscious, the ability to let machines think by themselves, learn by themselves, and adapt to the environment. Its emergence , allowing us to process massive data in a more efficient and cheaper way, and at the same time, it can replace humans as the mainstream at some point." Although AI has unlimited potential, when it comes to implementation, it still needs to be continuously cultivated and refined to improve the model. Effect. Therefore, how to quickly deploy AI models and shorten the model launch time has become the focus of every enterprise.

Huawei Technologies Co., Ltd. has always been committed to accelerating the development of the artificial intelligence industry. Adhering to the spirit of "foresight, innovation, and investment", it has established the world's leading artificial intelligence open source organization - Huawei Open Source Software Center, dedicated to open source artificial intelligence related technologies. Currently, it has more than ten open source projects, covering data processing, artificial intelligence algorithms, machine learning frameworks, tool software and other directions. On the road to artificial intelligence research, Huawei has always upheld the values of openness and sharing, and actively participated in the construction of open source communities. In the future, Huawei's Open Source Software Center will assist enterprises in industrial development and promote the rapid development of AI technology through the establishment of AI communities, technical exchanges, and achievement transformation. In addition, Baidu also maintains close cooperative relationships with major manufacturers such as Huawei to jointly promote innovation and progress in AI theory, technology, and industry.

Baidu's self-developed AI chip is mainly developed by the Paddle-Lite team, using the PaddlePaddle open source framework for model conversion, optimization, and deployment. Currently, the chip is open source and available for developers to use. Paddle-Lite is a lightweight, flexible, and easily scalable AI computing library. It is a set of hardware acceleration libraries optimized for ARM Cortex-M series MCUs and the PaddlePaddle framework. Paddle-Lite has significant advantages over commercial hardware products in terms of functions, performance and power consumption. It can be widely used in embedded devices, mobile terminals, servers and other scenarios to help customers quickly complete the process from massive data collection to real-time applications. of landing. Its open source architecture is shown in the figure below:

3. Explanation of core concepts and terminology

3.1 PaddlePaddle

PaddlePaddle is Baidu's open source AI computing platform. It is a modular deep learning framework that supports a variety of hardware. It has a leading position in the field of deep learning, can perform model training and prediction, supports multiple programming languages, such as Python, C++, and Java, and supports distributed multi-card training and hyperparameter search. It has rich pre-trained models, a powerful ecosystem, complete documentation, examples and tutorials, and is used by more and more companies, institutions and individuals.

3.2 ARM CPU architecture

ARM is a series of CPU architectures represented by Cortex-A and Cortex-R. These architectures integrate sensors and processing units for image processing, video processing, machine learning and other tasks, and have extremely high computing power and reliability. Currently, Baidu’s AI chip uses the ARM CPU architecture Cortex-M7.

3.3 FPGA chip

FPGA (Field Programmable Gate Array), a programmable logic gate array, is a programmable digital integrated circuit based on a gate array structure that can be highly integrated, flexible, and programmable. The FPGA chip used by Baidu's AI chip is Xilinx MYRIAD X.

3.4 MicroTVM

MicroTVM (Micro Tuning and Vetting Framework), a miniature automatic parameter tuning tool, can help developers perform lightweight and customized model optimization work on edge devices. It can reduce model running time and improve inference performance. The MicroTVM tool in Baidu's AI chip automatically optimizes the model and obtains better results than manual optimization.

4. Explanation of core algorithm principles, specific operating steps and mathematical formulas

PaddlePaddle is an open source framework, and Paddle-Lite is an open source end-side AI computing chip. The PaddlePaddle framework will be introduced in detail below.

PaddlePaddle Overview

First, let’s introduce PaddlePaddle. PaddlePaddle is a modular deep learning framework that supports multiple hardware. It has a leading position in the field of deep learning, can perform model training and prediction, supports multiple programming languages, such as Python, C++, and Java, and supports distributed multi-card training and hyperparameter search. Its characteristics are as follows:

Modularity: The PaddlePaddle framework is a modular deep learning framework. Different types of networks can be assembled together according to needs. Users only need to focus on the components they want.
Supports a variety of hardware: The PaddlePaddle framework supports a variety of hardware, including CPU, GPU, FPGA and other IPUs. It can automatically select the best algorithm configuration and automatically perform model scheduling based on the installed hardware resources.
Deep learning framework: Based on its unique design concept and technology, the PaddlePaddle framework can implement very complex neural network models, and through flexible parameter combinations, it can achieve extremely high accuracy.
Scalability: The PaddlePaddle framework can be easily expanded, including custom operators, custom layers, custom data reading methods, custom optimization algorithms, etc.
Documentation and tutorials: The PaddlePaddle framework provides detailed documentation and examples, as well as a large number of tutorials to help developers quickly master its API and usage.

How to use the PaddlePaddle framework

Install

First, download the latest version of the PaddlePaddle framework and install it as needed. We recommend installing PaddlePaddle through the pip command, pip install paddlepaddle.

pip install paddlepaddle

PaddlePaddle can be used in two ways. The first way is to directly import PaddlePaddle in the code. The second way is through the command line tool paddlerec.

command line tool paddlerec

Paddlerec is Baidu's open source high-performance, flexible model training and evaluation tool. It can help developers quickly implement model training, evaluation, and deployment. Through paddlerec, users only need to specify the model type, data set path, hyperparameters, training rounds and other information to start the training and evaluation process. Its features are as follows:

High performance: Paddlerec can make full use of cluster resources to carry out strategies such as hyperparameter search, parallelization of training, and efficient data loading to optimize the model training process.
Flexible interfaces: Paddlerec provides a wealth of model interfaces, including the native API interface of the deep learning framework and advanced model interfaces such as PaddleX and PaddleDetection.
Complete ecosystem: Paddlerec provides a complete ecosystem, including model library, data set library, and evaluation index library, and can provide one-stop services such as model development, debugging, and deployment.

Instructions

Below, we take a classification task as an example to introduce how to use the PaddlePaddle framework. Suppose we want to train an image classification model. First, we need to prepare the data set. Then, write the configuration file config.yaml. The content of the configuration file is as follows:

runner:
  train_data_dir: /path/to/train_dataset
  valid_data_dir: /path/to/valid_dataset
  batch_size: 64
model:
  class: ResNet50
  num_classes: 10
  image_shape: [3, 224, 224]
optimizer:
  class: Adam
  learning_rate: 0.001
total_epochs: 120

Here, runner represents the configuration of the model running; model represents the configuration of the model structure; optimizer represents the configuration of the optimizer; total_epochs represents the total training rounds.

Next, write the script train.py:

import paddle
from paddlenlp import Taskflow
from paddlenlp.datasets import load_dataset
from config import config


def main():
    # 获取分类数据集
    train_ds = load_dataset('clue', name='chnsenticorp', splits='train')
    dev_ds = load_dataset('clue', name='chnsenticorp', splits='dev')

    model = Taskflow("sentiment_analysis")
    inputs, labels = model.inputs["text"], model.labels

    trans_func = model.transforms()

    train_ds = train_ds.map(trans_func)
    dev_ds = dev_ds.map(trans_func)

    data_loader = model.create_data_loader(
        mode="train", dataset=train_ds, batch_size=config['batch_size'])

    metrics = {"acc": model.metrics["accuracy"]}

    model.fit(
        data_loader,
        epochs=config['total_epochs'],
        eval_data_loader=dev_ds,
        save_best_model=True,
        metrics=metrics,
        verbose=1)


if __name__ == "__main__":
    main()

Here, Taskflow represents the classification model, inputs represents the input text, and labels represents the labels. The transform function is responsible for preprocessing the data. The create_data_loader function creates a data iterator. Metrics represents evaluation indicators, and acc represents accuracy. The fit function is used to train the model.

Finally, execute the command python train.pyto start training the model.

Paddle-Lite Overview

Paddle-Lite is Baidu's open source end-side AI computing chip. It is a framework that supports a variety of hardware and can quickly deploy AI models based on deep learning frameworks. Its core components include optimizer, runtime, model loader, etc. The following will introduce the composition of Paddle-Lite.

Optimizer

Optimizer is the optimizer of Paddle-Lite, which is responsible for the optimization of the model. In the optimization phase, Paddle-Lite first analyzes the model structure, and automatically performs operations such as scheduling, merging, and fusion to obtain an efficient calculation graph. Its features include automatic scheduling, automatic optimization, automatic parallelism, and automatic mixed precision.

Runtime

Runtime is the runtime of Paddle-Lite, which is responsible for the running of the model. In the running phase, Paddle-Lite maps the calculation graph to the corresponding IP core through the driver interface to perform model inference. Currently, Paddle-Lite supports CPU, GPU, NPU and other IPU hardware, and supports two operating modes: dynamic and static.

Model Loader

Model Loader is the model loader of Paddle-Lite. It is responsible for loading model files, compiling, optimizing and other operations to obtain a model that can be executed on the device. Its features include cross-platform deployment and integrated development.

5. Specific code examples and explanations

This section shows several typical cases and leads readers to understand more features of the PaddlePaddle and Paddle-Lite frameworks.

Data processing and model training

data processing

When we use the PaddlePaddle framework to process image data, we can use related APIs, such as paddle.vision.transforms. When we deal with text data, we can use Related APIs such as paddle.text.transforms.

import paddle.vision.transforms as T

# 对图像数据进行预处理
train_transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

val_transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

# 对文本数据进行预处理
def preprocess_function(examples):
    examples['sentence'] = examples['sentence'].apply(lambda x:''.join(['[CLS]', x, '[SEP]']))
    return examples

train_ds = load_dataset('imdb', split='train').map(preprocess_function).map(train_transform)
val_ds = load_dataset('imdb', split='test').map(preprocess_function).map(val_transform)

Model training

When we use the PaddlePaddle framework to train a classification model, we can call Sequential or Layer to encapsulate the model structure. Then, set the optimizer and loss function to train the model.

import paddle.nn as nn

class SimpleNet(nn.Layer):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(768, 256)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(256, 10)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

net = SimpleNet()
criterion = nn.CrossEntropyLoss()
opt = paddle.optimizer.Adam(learning_rate=0.001, parameters=net.parameters())

for epoch in range(num_epoch):
    for step, (x, y) in enumerate(train_dataloader()):
        out = net(x)
        loss = criterion(out, y)

        loss.backward()
        opt.step()
        opt.clear_grad()

    print("Epoch {}: Loss={:.6f}".format(epoch+1, np.mean(losses)))

Model deployment

Model compression

When we deploy a model, we usually need to consider the size of the model. To reduce model size, we can perform model compression. Paddle-Lite provides two model compression solutions, quantization and cropping.

Quantify

Quantization refers to converting floating point operations into integer operations. Its purpose is to reduce the model size and improve the model calculation speed. In Paddle-Lite, we can use different quantization schemes, including uniform quantization and non-uniform quantization. Among them, uniform quantization divides the weights into a certain range, and each interval represents a quantization level. Non-uniform quantization refers to dividing equal parts according to the absolute value of the weight, and each interval represents a quantization level.

import paddlelite.lite as lite

place = lite.Place(lite.TargetType.kARM, lite.PrecisionType.kFloat)
quantize = True

converter = lite.Converter()
converter.optimizers = ["conv_transpose_eltwisecut_quant"]
converter.set_param_prefix("")
converter.target_type = place.target
converter.set_device_type(int(str(place.target)[-1]))
converter.precision_type = lite.PrecisionType.kInt8
converter.convert(model_file_path, params_file_path, quantize=quantize)

Crop

Pruning refers to deleting some unimportant nodes according to their weight during the training process to reduce the model size. Paddle-Lite provides three clipping strategies, including channel clipping, feature clipping, and structure clipping. Among them, channel clipping refers to reducing the model size by eliminating unimportant output channels when the number of output channels of the convolutional layer is large. Feature pruning refers to reducing the size of the model by eliminating unimportant feature points when the convolutional layer outputs a large number of feature maps. Structural pruning refers to reducing the model size by eliminating unimportant layers when the output size of the convolutional layer is large.

import paddlelite.lite as lite

place = lite.Place(lite.TargetType.kARM, lite.PrecisionType.kFloat)

clipper = lite.ULQClipper()
clipper.clip_model(model_file_path, params_file_path, save_dir, sample_generator, calib_table_path)

Model prediction

When we deploy the model to the target device, we may need to consider latency and memory usage. To alleviate latency, we can employ an asynchronous prediction scheme. Asynchronous prediction means that the model prediction request is sent to the back-end engine without waiting for the result to be returned, but continues to receive new requests. Paddle-Lite provides an asynchronous prediction interface. Users can put prediction requests into the queue and return them immediately.

#include <paddlelite/lite.h>

using namespace paddle::lite_api;

std::unique_ptr<PaddlePredictor> CreatePaddlePredictor(const char* model_buffer, size_t buffer_size) {
    MobileConfig config;
    config.set_model_from_memory(model_buffer, buffer_size);
    config.threads = 1; // 设置线程数为1，开启异步预测
    auto predictor = CreatePaddlePredictor(config);
    return predictor;
}

void RunAsyncInference(std::shared_future<std::vector<PaddleLiteTensor>>& future, int id) {
    try {
        auto results = future.get();

        // do something with the prediction result...
    } catch (...) {
        printf("Exception caught while running async inference on thread %d\n", id);
    }
}

void MakePrediction(std::unique_ptr<PaddlePredictor>& predictor, const cv::Mat& img) {
    int width = img.cols;
    int height = img.rows;
    auto input_tensor = predictor->GetInput(0);
    input_tensor->Resize({1, 3, height, width});
    float* data = input_tensor->mutable_data<float>();
    cv::cvtColor(img, img, cv::COLOR_BGR2RGB);
    int area = width * height;
    memcpy(data, img.data, area * 3 * sizeof(float));

    std::shared_future<std::vector<PaddleLiteTensor>> future = (*predictor)(); // 同步预测
    std::thread t([&]() { RunAsyncInference(future, 0); });    // 开启新的线程处理异步预测结果
    t.detach();                                            // 不阻塞当前线程
}