Explain in detail the software transplantation design and development of NVIDIA chips in autonomous driving

Author |  Jessie

Produced|  Yanzhi

Zhiquan |  Enter the "chassis community", please add Weibo yanzhi-6, note the chassis

Past review

How Nvidia series chips are applied to the development of smart cars is enough to read these two articles (1)

How Nvidia series chips are applied to the development of smart cars is enough to read these two articles (2)

How NVIDIA series chips are used in autonomous driving development (1): Architecture and safety design
How NVIDIA series chips are used in autonomous driving development (2): Hardware power design

Analyze NVIDIA's typical architecture development framework from the perspective of underlying software development (series three)

As a universal SOC chip, NIVIDIA DRIVE Orin series can be used for various sensing and general computing tasks. Its high-quality large computing power, running performance, complete compatibility, and rich I/O interfaces can reduce system development complexity. These characteristics make Orin series chips especially suitable for application in automatic driving systems.

On the whole, the modules of the top-level SOC architecture of Orin series chips are mainly composed of three processing units: CPU, GPU and hardware accelerator. Take the current popular Orin-x as a typical example to illustrate how the NVIDIA chip is called in its software module development.

f2fc5712494099bfc2c2bfcce0113cef.png

1、CPU:

The CPU in Orin-x includes 12 Cortex-A78s, which can provide general purpose high-speed computing compatibility. At the same time, Arm Cortex R52 is based on functional safety design (FSI), which can provide independent on-chip computing resources, so that it is not necessary to add additional CPU (ASIL D) chips to provide functional safety levels.

The features supported by the CPU family include Debug debugging, power management, Arm CoreLink interrupt controller, error detection and reporting.

The CPU needs to monitor the overall performance of the chip. The performance monitoring unit in each core provides six computing units, and each unit can calculate any event in the processor. Based on the PMUv3 architecture, these computing units collect different statistics during each runtime and run on the processor and storage system.

2、GPU:

NVIDIA Ampere GPUs provide an advanced parallel processing computing architecture. Developers can use the CUDA language for development (the CUDA architecture will be described in detail later), and support various tool chains in NVIDIA (such as developing Tensor Core and RT Core APIs). A deep learning interface optimizer and real-time runtime system can deliver low latency and efficient output. Ampere GPU can also provide the following features to achieve high-resolution, high-complexity image processing capabilities (such as real-time optical flow tracking).

  • Thinning:

Fine-grained structured sparsity doubles throughput and reduces memory consumption. Floating-point processing capability: 2x CUDA floating-point performance per clock cycle.

  • cache:

Stream processor architecture can increase L1 cache bandwidth and shared memory, reducing cache miss latency. Improve asynchronous computing capabilities, post-L2 cache compression.

3、Domain-Specific:

Domain-specific hardware accelerators (DSAs, DLA, PVA) are a group of special-purpose hardware engines that realize the characteristics of computing engines such as multitasking, high efficiency, and low power consumption. Computer vision and deep learning clusters include two main engines: Programmable Vision Accelerator PVA and Deep Learning Accelerator DLA (while the latest mid-level computing power Orin n chip cancels the DLA processor).

PVA is the second generation of NVIDIA vision DSP architecture, it is a special application instruction vector processor, this processor is specially for computer vision, ADAS, ADS, virtual reality system. PVA has some key elements that can be well adapted to the field of predictive algorithms, and the power consumption and latency are very low. Orin-x needs to be used for PVA control and task monitoring through the internal R core (Cortex-R5) subsystem. A PVA cluster can accomplish the following tasks: 2 vector processing units (VPU) with vector cores, instruction cache and 3 vector data storage units. Each unit has 7 visible slots containing scalar and vector instructions. In addition, each VPU also contains 384 KBytes of 3-port storage capacity.

DLA is a fixed function engine that can be used to accelerate inference operations in convolutional neural networks. Orin-x sets up DLA alone to implement the second generation of NVIDIA's DLA architecture. DLA supports accelerated convolution, deconvolution, activation, pooling, local normalization, and fully connected layers of CNN layers. Ultimately supporting optimized structured sparsity, depthwise convolutions, a dedicated hardware scheduler to maximize efficiency.

So, how to use the computing power of Nvidia's own GPU and CPU to effectively detect computer vision development capabilities?

GPU software architecture

Most of the AI ​​algorithms used in the field of autonomous driving are parallel structures. In the field of AI, deep learning for image recognition, machine learning for decision-making and reasoning, and supercomputing all require large-scale parallel computing, which is more suitable for GPU architecture. Due to the hierarchical level of the neural network (usually the larger the number of hidden layers, the more accurate the simulation results of the neural network) will greatly affect its prediction results. GPUs that are good at parallel processing can process and optimize neural network algorithms very well. Because, each calculation in a neural network is independent of other calculations, which means that no calculation depends on the result of any other calculation, and all these independent calculations can be performed in parallel on the GPU. Usually a single convolution calculation on the GPU is slower than the CPU, but for the entire task, the CPU is almost a serial processing method, which needs to be completed one by one, so its speed is much slower than the GPU. Therefore, convolution operations can be accelerated by using parallel programming methods and GPUs.

Nvidia has formed a product matrix through CPU+GPU+DPU to fully develop the data center market. Taking advantage of the inherent advantages of GPU in the field of AI, Nvidia uses this to enter the data center market. In response to many problems such as internal bandwidth of the chip and system-level interconnection, Nvidia launched the Bluefield DPU and Grace CPU to improve the overall hardware performance.

For Nvidia's GPU, there is one raster engine (ROP) and four texture processing clusters (TPC) in a GPC, and each engine can access all storage.

619cc594547df9a1bbd2184e9cac80c6.png

Each TPC contains two stream multimedia processors (SM), each SM has 128 CUDA cores, divided into 4 separate processing modules, each module has its own instruction buffer, scheduler and tensor core. Each TPC also contains a warping engine, two texture units, and two ray tracing cores (RT Cores).

GPC is a unit of rasterization, shading, texture structure and calculation specially developed for hardware modules. The core image processing function of the GPU is implemented in the GPC. In GPC, SM's CUDA cores can perform pixel-level/vector-level/geometric shading calculations. The texture structure unit performs texture filtering and load/store fetches, saving data to memory.

Special Function Units (SFUs) handle a priori and image interpolation instructions.

Tensor Cores perform matrix multiplication to greatly accelerate deep learning inference. The RTcore unit can aid optical flow tracing performance by accelerating the traversal of the bounding volume hierarchy (BVH) and scene geometry intersection during optical flow tracing.

Finally, multi-engine processing can handle vertex extraction, tessellation, perspective transformation, attribute creation and video streaming output. SM geometry-level and pixel-level processing execution performance can ensure its high adaptability. This works well for user interfaces and complex application development, and the power of the Ampere GPU is optimized to ensure high performance while the environment provides low power consumption.

Core Common Programming Architecture CUDA

CUDA (Compute Unified Device Architecture, Unified Computing Architecture) As the central node connecting AI, the CUDA+GPU system has greatly promoted the development of the AI ​​field. The workstation (Workstation), server (Server) and cloud (Cloud) equipped with Nvidia GPU hardware use the CUDA software system and the developed CUDA-XAI library to provide machine learning and deep learning training (Train) for the AI ​​calculation of the autonomous driving system And inference (Inference) provides a corresponding software tool chain to serve numerous frameworks, cloud services, etc., which is an indispensable part of the entire NVIDIA series chip software development.

CUDA is a special computing system/algorithm customized based on the Nvidia GPU platform, and generally can only be used on Nvidia GPU systems. Here we talk about how to develop different software levels on the CUDA architecture in the NVIDIA Orin series chips from the perspective of developers.

1. CUDA architecture and data processing analysis

The following figure shows the schematic diagram of CUDA architecture, which shows the relationship between CPU, GPU, application program, CUDA development library, operating environment and driver, as shown in the figure above.

f3bd4789804e4037cb17bf55e5a687c7.png

So, how to use CUDA's own programming language to make development calls in covering CPU and GPU modules?

From the composition of CUDA architecture, it includes three parts: development library, runtime environment and driver.

"Developer Lib" is an application development library based on CUDA technology. Examples are highly optimized general-purpose math libraries, namely cuBLAS, cuSolver, and cuFFT. Core libraries, such as Thrust and libcu++; communication libraries, such as NCCL and NVSHMEM, and other packages and frameworks on which applications can be built.

The "Runtime runtime environment" provides application development interfaces and runtime components, including the definition of basic data types and functions such as various calculations, type conversion, memory management, device access, and execution scheduling.

The "Driver driver part" is the device abstraction layer of CUDA-enabled GPU, which provides an abstract access interface for hardware devices. CUDA provides a runtime environment through this layer to achieve various functions.

Applications currently developed on CUDA must have NVIDIA CUDA-enable hardware support. For the CUDA execution process, the job of the CPU is to control the execution of the GPU, schedule and assign tasks, and do some simple calculations, while a large number of tasks that require parallel computing are handed over to the GPU.

Under the CUDA architecture, a program is divided into two parts: the host side and the device side. The host side refers to the part executed on the CPU, and the device side refers to the part executed on the display chip (GPU). The program on the Device side is also called "kernel". Usually, the host-side program will copy the data to the memory of the graphics card after preparing the data, and then the display chip will execute the device-side program, and then the host-side program will retrieve the result from the memory of the graphics card. It should be noted here that since the CPU can only access the video memory through the PCI Express interface, the speed is relatively slow (the theoretical bandwidth of PCI Express x16 is 4GB/s in both directions), so it cannot be done frequently to avoid reducing efficiency.

Based on the above analysis, it can be seen that for a large number of parallel problems, using CUDA to solve the problem can effectively hide the latency of the memory, and can effectively use a large number of execution units on the display chip to process thousands of threads at the same time. Therefore, there is no way to achieve the best efficiency with CUDA if it cannot handle a large number of parallelization problems.

For this application bottleneck, Nvidia has also made great efforts to improve data access. On the one hand, the optimized CUDA improves the read and write flexibility of DRAM, making the mechanism of GPU and CPU consistent. On the other hand, CUDA provides on-chip shared memory so that data can be shared between threads. Applications can take advantage of shared memory to reduce DRAM data transfers and rely less on DRAM memory bandwidth.

In addition, CUDA can also copy data into the GPU memory at the beginning of the program, and then perform calculations in the GPU until the required data is obtained, and then copy it to the system memory. In order to allow R&D personnel to use the computing power of the GPU conveniently, Nvidia has continuously optimized the CUDA development library and drive system. The multitasking mechanism of the operating system can manage CUDA to access the runtime library of GPU and graphics programs at the same time, and its computing characteristics support the use of CUDA to intuitively write GPU core programs.

2. CUDA programming development

The application foundation of the Nvidia series chips is to provide a rich and mature set of SDK and libraries, based on which applications can be built. CUDA, a computing architecture, actually provides developers with fast and callable underlying programming codes through optimized various scheduling algorithms and software frameworks, ensuring that developers can directly call GPU parallel computing resources in the fastest and most efficient way during programming. , so as to maximize the computing efficiency of the GPU, and help the NVIDIA GPU to use its parallel computing capabilities conveniently and efficiently.

CUDA is a C-like language, and it is also compatible with C language, so it is suitable for ordinary developers to write and transplant their own codes and algorithms on it. Generally speaking, Nvidia is divided into the following three different modes in terms of programming methods. The following figure shows the relationship between various programming languages:

263d813fd88a6db773076c46fe532df4.png

 For text topics, standard languages ​​are usually used for parallel development; platform-specific languages ​​for performance optimization, such as C++, Fortran, OpenCL, etc. Among them, OpenCL can also realize the call of GPU computing capabilities, but due to its strong versatility, the overall optimization The effect is not as good as CUDA, and it has a great disadvantage in large-scale calculations. The CUDA architecture is one of the operating platforms of OpenCL, and OpenCL only provides a programmable API for the CUDA architecture. That is equivalent to the relationship between the API and the execution architecture.

Compiler directives tailored for standard and specialized languages ​​as above can bridge the gap between the two approaches by enabling incremental performance optimizations. Thereby trade-offs in performance, productivity, and code portability.

The figure below shows the performance improvement of a typical CUDA architecture compared to CPU and GPU in software programming after optimization.

c3655811e0a149494682856e218d3b9b.png

Why can CUDA perform better than CPU and GPU?

In fact, they are different calculation methods for the same calculation input. For example, consider the accumulation of 1 to 100. If it is a CPU, the final calculation result needs to call and execute the same accumulation function 100 times; and if the GPU has 4 cores for parallel computing, it is 4 main lines to calculate this at the same time. formula, the amount of calculation is 1/4 of the former; while the CUDA calculation mode can be regarded as an optimized GPU, that is, in this conventional accumulation calculation, some "cheap" methods will also be used. For example, consider that the sum of the first and last sides of the accumulation is 100 (such as 1+99, 2+98, 3+97...), therefore, you only need to consider the accumulation times of this calculation to calculate the result directly and quickly . Judging from this example, CUDA's calculation method can at least reduce the calculation amount by half compared with pure GPU.

3. CUDA's typical math library

Similar to the CPU programming library, the CUDA library is a collection of interfaces. It only needs to write the host code and call the corresponding API to realize various functions, which can save a lot of development time. And the CUDA library is formed through continuous optimization by a large number of masters with strong programming ability, so we can fully trust that these libraries can achieve good performance. Of course, it is not enough to completely rely on these libraries without knowing anything about CUDA performance optimization. We still need to manually make some improvements to dig out better performance.

Commonly used libraries on CUDA include the following:

  • cuSPARSE linear algebra library, mainly for sparse matrices and the like

  • cuBLAS is the CUDA standard linear algebra library, which has no operations specifically for sparse matrices

  • cuFFT Fourier transform for when and how kernels are loaded

  • cuRAND random number, used for random number calculation.

cuBLASLt

cuBLASLt exposes mixed-precision multiply operations using a new FP8 data type. These ops also support fusion of BF16 and FP16 biases, and fusion of FP16 biases with GELU activations for GEMMs with FP8 input and output data types.

In terms of performance, compared to BF16 on A100, FP8 GEMM is 3x and 4.5x faster on H100 PCIe and SXM, respectively. The CUDA Math API provides FP8 transformations to facilitate the use of the new FP8 matrix multiply operations.

cuBLAS 12.0 extends the API to support 64-bit integer problem sizes, leading dimensions, and vector increments. These new functions have the same API as their 32-bit integer counterparts, except that they have a _64 suffix in their names and declare the corresponding parameters as int64_t.

cublasStatus_t cublasIsamax(cublasHandle_t handle, int n, const float *x, int incx, int *result);

The corresponding 64-bit integers are as follows:

cublasStatus_t cublasIsamax_64(cublasHandle_t handle, int64_t n, const float *x, int64_t incx, int64_t *result);

cuFFT

During program initialization, cuFFT performs a series of steps (including heuristics) to determine which kernels are used and how kernel modules are loaded. Starting with CUDA 12.0, cuFFT delivers most of its cores in CUDA Parallel Thread Execution (PTX) assembly rather than binary.

When the cuFFT program is initialized, the PTX code of the cuFFT kernel is loaded by the CUDA device driver at runtime and further compiled into binary code. Thanks to the new implementation, the first available improvements will enable many new acceleration cores for the NVIDIA Maxwell, NVIDIA Pascal, NVIDIA Volta, and NVIDIA Turing architectures.

cuSPARSE

To reduce the workspace required for sparse matrix multiplication (SpGEMM), NVIDIA has released two new algorithms with lower memory usage. The first algorithm computes strict bounds on the number of intermediates, while the second allows the computation to be divided into chunks. These new algorithms benefit customers with smaller memory storage devices.

The latest INT8 support has been added to several different function modules of cusparseGather, cusparseScatter and cusparseCsr2cscEx2.

As a final note, the latest version of CUDA also incorporates lazy loading as part of it. Subsequent CUDA releases continued to enhance and extend it. From an application development perspective, choosing lazy loading doesn't require anything specific. Existing applications can use lazy loading as-is. Lazy loading is a technique that delays the loading of kernel and CPU-side modules until the application needs to load them. The default is to eagerly load all modules when the library is first initialized. This not only saves significant device and host memory, but also shortens the end-to-end execution time of the algorithm.

Summarize

This article analyzes the entire software framework construction, software algorithm scheduling, and software function construction and application from the perspective of several core software modules GPU and CUDA development in NVIDIA. The point is that you can have a good grasp of the software architecture of the entire Orin series, which echoes the hardware architecture of the previous article. Of course, from the perspective of the complete process of the entire software development, this article is only an entry-level description, and it is far from enough for refined development. In the follow-up article, we will conduct a more detailed analysis from the perspective of the operating system Drive OS/driver module Drive Work.

Scan and join the free "Smart City Smart Transportation" knowledge planet to learn more industry information and materials.

Welcome to the Intelligent Transportation Technology Group!

Contact information: WeChat ID 18515441838

Guess you like

Origin blog.csdn.net/weixin_55366265/article/details/130418187
Recommended