[Deep Learning] Overview of Embedded Artificial Intelligence

1. AI embedded system

1.1 Concept

An embedded system is a computer system that is "embedded" in an application. The difference between an embedded system and a traditional PC is that it is usually equipped with a dedicated software and hardware interface for a specific application, and the requirements for computing speed, storage capacity, reliability, power consumption, and volume are significantly different from those of a general-purpose PC. We can see embedded systems everywhere in our daily life, such as smartphones, multimeters, drone control systems, telecom switches, washing machines, smart TVs, automotive control systems, medical CT equipment, etc.
insert image description here

1.2 Features

Generally speaking, embedded systems have the following characteristics:

  1. High reliability, for example, the embedded system controlling the telecommunications switch needs to work 24 hours a day, with a reliability of 99.999% or higher;
  2. Low-latency response, such as the vehicle anti-lock brake system, needs to judge the speed of the vehicle in real time during emergency braking, identify the tire status, and output the brake control command within the specified time;
  3. Low power consumption, such as handheld measurement devices such as multimeters, may need to rely on batteries for months or even years;
  4. Small size, such as mobile phones, wireless noise-cancelling headphones and other portable devices need to install embedded control systems in a limited volume to meet the requirements of application scenarios.

1.3 Smart Application Requirements

Traditional embedded systems are mainly used for control, that is, receiving sensor signals, analyzing and outputting control commands. With the development of application requirements, more and more embedded systems require "artificial intelligence" to become "intelligent embedded systems". Compared with traditional "control class" embedded systems, intelligent embedded systems have been enhanced in terms of intelligent perception, intelligent interaction and intelligent decision-making.
insert image description here

1.3.1 IntelliSense

Traditional embedded systems analyze and understand signals based on fixed laws, such as signal mean and variance or its frequency domain transformation. With the expansion of applications, people need to make embedded systems understand more complex or changing scenarios. For example, let the smart camera system identify whether the scene currently being shot is a natural scenery, an indoor figure, or an urban building; the embedded system responsible for mechanical equipment monitoring must be able to identify a variety of abnormal vibration modes, and perform fault identification and classification on these vibrations. This type of perception and recognition relies on more complex analysis and judgment models, usually based on supervised training data to obtain model parameters. In contrast, traditional artificial feature selection and signal analysis algorithms are difficult to achieve complex and changeable intelligent perception.

1.3.2 Intelligent Interaction

Intelligent embedded systems require more "humanoid" two-way interaction capabilities with users, such as obtaining user instructions through voice recognition and reporting execution results through "voice", or judging user intentions through gesture recognition and facial expression recognition, and making decisions. Correct response, this ability supports embedded systems to realize various "man-machine collaboration" applications. In contrast, the traditional embedded system interaction method is limited, usually only through simple button and display interaction, the application scenarios are limited, and the efficiency of human-computer interaction is low.

1.3.3 Intelligent decision-making

The ability to make independent decisions is another important feature of modern embedded intelligent systems. For example, in an automatic driving system, the on-board embedded system needs to judge the current state and trend according to the vehicle speed, road obstacles, and traffic sign information, and within a limited time Issue travel instructions. In addition, the system needs to be able to "adapt to the situation", and when encountering an unknown state, be able to weigh the benefits and risks of actions and give appropriate action outputs. Traditional embedded systems are often based on fixed and simple logic rules in terms of intelligent decision-making. Although they are efficient and real-time, they cannot meet the requirements of various complex application scenarios for embedded systems in terms of flexibility and adaptability.

1.4 Challenges

It should be noted that the machine learning algorithm involves two parts: training and reasoning. The training part needs to access massive training samples and search for optimal algorithm parameters. It has high requirements for computing speed and power consumption, and it is difficult to implement in embedded systems. Currently, the main Rely on the GPU system to complete. Compared with the training of machine learning algorithms, the amount of computation in the inference process is much smaller. However, for embedded systems with limited resources, the implementation of machine learning inference algorithms still faces challenges and needs to be optimized from different levels.

The "smart" application requirements introduced earlier have brought challenges to the hardware and software of embedded systems, and these challenges mainly come from the computing power requirements required by these applications. For example, currently commonly used image analysis deep convolutional neural network algorithms, their underlying operations are mainly two-dimensional matrix convolution or matrix multiplication operations, and the implementation of these algorithms requires a large number of multiplication and addition operations and massive storage.
insert image description here
For embedded systems, realizing real-time video AI recognition at 10 frames per second usually requires 10 × 1 0 9 ~ 150 × 1 0 9 10 × 10^9 ~ 150 × 10^910×109150×10Nine times of multiplication and addition operations. Although it is not difficult to achieve this computing power with existing computing hardware (such as high-performance GPU graphics cards), it is necessary to meet multiple constraints such as power consumption, volume, reliability, and real-time performance. For embedded systems, this amount of computation poses a huge challenge. In addition, the amount of parameter data that some AI algorithms rely on also puts enormous pressure on the storage of embedded systems.
insert image description here
The parameter data storage capacity of the general neural network is5 × 1 0 6 ~ 140 × 1 0 6 5 × 10^6 ~ 140 × 10^65×106140×106 , using single-precision floating-point numbers to store parameter values ​​corresponds to a storage capacity of 20MB to 560MB. In contrast, the RAM storage space of traditional low-cost embedded systems often does not exceed 16MB.

With the deepening of machine learning research, many intelligent algorithms have met the requirements of commercial applications in terms of performance, and have gradually entered our lives. Many of these algorithms are implemented in the form of embedded systems, such as face recognition access control systems, smart speakers with voice interaction capabilities, and automatic driving systems based on machine vision. Although there are examples of the application of these machine learning algorithms in embedded systems, there are still many unsolved problems. The main problems and difficulties faced in implementing machine learning algorithms in embedded systems include the following aspects:

1.4.1 Operation volume

In the field of machine learning applications, especially in image recognition, two-dimensional matrices or higher-dimensional tensor operations need to be used. The core algorithm consists of a large number of two-dimensional convolutions and matrix multiplications, and some applications also require matrix decomposition, such as feature Value decomposition, QR decomposition, etc., are computationally intensive algorithms. In addition, with the rise of deep learning, the scale of neural networks continues to expand, which puts pressure on embedded systems with limited computing power. The amount of computation here refers to the number of multiplications and additions required by the neural network to perform inference operations on each image.

1.4.2 Storage size

Part of the machine learning algorithm is based on the search and comparison of feature databases, requiring access to massive data in a short period of time for feature analysis and comparison. In order to meet the real-time requirements, all the data that needs to be accessed are often stored in RAM, which brings difficulties to the allocation of limited memory resources in embedded systems. In addition, modern deep neural networks need to access a large number of weight coefficients in the calculation process. The parameter scale of the neural network exceeds the available memory size of the embedded processor subsystem, and a large number of on-chip RAM and relatively low-speed external memory need to be processed in a short time. Data exchange to complete calculations.

1.4.3 Power Consumption

The implementation of machine learning algorithms in embedded systems often needs to meet the requirements of computing power and real-time performance at the same time. Although the requirements can be met by continuously increasing the processor frequency and computing hardware resources, the price paid is the increase in operating power consumption, which limits Many machine learning algorithms are applied in scenarios where battery power or green energy such as solar power are used.


2. Implementation of Machine Learning in Embedded Systems

Deep learning algorithms require a lot of computing and storage. Currently, there are roughly two technical routes to solve this problem: one is to improve the computing efficiency of embedded processors based on customized hardware, and the other is to solve problems based on algorithm optimization. Among them, in terms of customized AI accelerated computing hardware, it is further divided into a parallelization solution based on a general-purpose multiprocessor and a solution based on a dedicated computing acceleration engine.

2.1 Based on customized hardware

2.1.1 Multiprocessor-based solution

For example, NVIDIA uses a dedicated embedded GPU to accelerate neural network operations. The advantage of this solution is that it can flexibly define the functions of each processor through software and has good compatibility with different AI algorithms. A large number of circuit resources and high power consumption, so this solution is mainly used in areas that require high performance but are not sensitive to hardware cost and power consumption, such as autonomous driving.
insert image description here

2.1.2 AI embedded system solution based on computing acceleration engine

Compared with the multi-processor-based solution, the multi-processor is replaced with multiple dedicated computing engines, such as matrix multiplication engine, convolution engine, data sorting and retrieval engine, etc. Since each computing acceleration engine has a single and clear function, Therefore, it can be fully optimized to improve operation efficiency and reduce power consumption. Under the same amount of calculation, the power consumption of this architecture is usually lower than that of multi-processor architecture. But the price paid is that the function of the hardware acceleration engine is fixed and difficult to change, and the flexibility and compatibility to different operations are lost.
insert image description here

2.2 Algorithm-based optimization

Improve the AI ​​computing capability of embedded systems through algorithm improvement and software optimization. This solution makes full use of the computing hardware characteristics of existing processors and does not require new dedicated hardware. Due to the limitations of traditional embedded processors in terms of computing power and storage capacity, there are still restrictions on the application of many high-performance AI. Many of the algorithms discussed here are implemented on general-purpose embedded processors, but they can also be applied to multiprocessor systems and can be modified in the form of computing acceleration engines.

In order to implement machine learning reasoning algorithms in embedded systems with limited resources, it is necessary to optimize at different levels. The following figure shows the description of each level of optimization from high to low.
insert image description here

2.2.1 System Solution Optimization

Consider which solution to use for a specific machine learning problem, such as a visual image classification application, whether implemented by support vector machines, deep neural networks, or random forests. Different machine learning algorithms have their own advantages and disadvantages in terms of computing power, memory, classification accuracy, and training difficulty, which need to be weighed by developers.

2.2.2 Structure optimization of machine learning inference model

For a given machine learning algorithm, consider the simplification of its operational structure, and use methods including approximation algorithms, model pruning, and feature dimensionality reduction to reduce operational complexity. In addition, for a given machine learning algorithm operation graph, redundant intermediate data calculations can be eliminated through the equivalent transformation of the operation module, such as the parameter fusion of the convolutional layer and the BN layer in the neural network.

2.2.3 Operator Optimization

Optimize the underlying computing modules of machine learning algorithms to reduce computational complexity. Specific solutions include reducing computational complexity based on approximate algorithms, such as reducing the computational complexity of matrix multiplication through low-rank approximate decomposition of matrices; fast algorithms based on transform domains to reduce computational complexity , such as converting the convolution operation into a point-by-point multiplication operation through frequency domain transformation.

2.2.4 Bit operation optimization

Lower-level optimization based on the binary representation of data, such as converting constant multiplication operations to addition and subtraction, and floating-point multiplication can also be performed by adding and subtracting its exponent field and mantissa field to achieve approximate calculations, etc.

2.2.5 Optimized for CPU hardware features

It is optimized for the specific CPU hardware used by the embedded system, including using the SIMD instructions of the specific CPU to realize data vector parallel operation, and using high-bit-width registers to simultaneously realize multiple low-bit-width data parallel calculations, etc. This part of optimization is closely related to hardware characteristics.

Among the optimization levels given above, the amount of improvement in computing efficiency brought about by optimization at each level is related to specific problems, but generally speaking, the higher the level, the greater the improvement brought about by optimization.


3. Model deployment

3.1 Process from training to deployment

Our trained models generally need to be optimized, including merging, quantization, pruning, etc. of network layers. Then get the real reasoning model and deploy it in the embedded intelligent device. Use an inference engine that is more streamlined than the training framework to perform calculations and provide services to users.
picture

3.2 Deployment platform

  • For online server-side deployment, accuracy is given priority. Distributed large models (hundreds of billions of parameters) GPT-3, Microsoft Xiaoice, etc. Not sensitive to delay, Baidu image recognition, etc.
  • Offline embedded deployment, taking into account accuracy, speed and resource consumption. Small model, face recognition access control, target detection. Latency and resource sensitivity, such as real-time video cutout to replace the background.

3.3 Model deployment method

3.3.1 Deployment of the original training framework

Training framework deployment, such as complete TensorFlow/Pytorch/Caffe, etc.
insert image description here

In the actual production environment, this deployment method basically does not exist, and it has the following disadvantages:

  • The entire framework needs to be installed;
  • poor reasoning performance;
  • Many redundant functions;
  • The memory usage is large.

3.3.2 The deployment engine of the training framework

The deployment engine that comes with the training framework. Such as TF-Lite, Pytorch-Mobile.
insert image description here

  • More streamlined than the framework itself;
  • Only supports models trained by its own framework;
  • Supported hardware/OS is limited.

3.3.3 Manual model reconstruction

  • Write C/C++ code by hand, realize the calculation graph, and load the weight data;
  • It is necessary for the constructor to have a full understanding of the model and have certain technical difficulties;
  • The workload is heavy, which is equivalent to building wheels.

3.3.4 Dedicated inference engine

High-performance neural network reasoning engine (mobile-side reasoning optical plus model), such as RCNN, MNN, Tengine.

  • Model files of mainstream frameworks can be directly loaded without model conversion;
  • Only rely on C/C++ library, no third-party library dependency;
  • Support Android/Linux/RTOS/bare board environment;
  • API interfaces such as Python/C/APP are convenient for calling in different languages.
    insert image description here
    Comparison of mainstream mobile reasoning frameworks:
    insert image description here

Guess you like

Origin blog.csdn.net/LogosTR_/article/details/126448377