GPU Acceleration 02: Super detailed introductory tutorial on Python Cuda, you can learn it without a graphics card!

Python is currently the most popular programming language and is widely used in deep learning, financial modeling, scientific and engineering computing. As an interpreted language, it is often criticized by users for its slow running speed. The Numba library developed by Anaconda, a well-known Python publisher, provides programmers with Python version of CPU and GPU programming tools, which is dozens of times or more faster than native Python. Using Numba for GPU programming, you can enjoy:

Python’s simple and easy-to-use syntax;
Extremely fast development speed;
Double hardware acceleration.

In order to ensure the ease of use and development speed of the Python language and achieve the purpose of parallel acceleration, this series mainly shares GPU programming methods from the perspective of Python. For an introduction to Numba, you can refer to my other article. What’s even more exciting is that Numba provides a GPU simulator. Even if you don’t have a GPU machine at hand, you can use this simulator to learn GPU programming first!
Insert image description here
This is the second introductory article in the NVIDIA GPU series. It mainly introduces the basic process and core concepts of CUDA programming, and uses Python Numba to write GPU parallel programs. In order to better understand the hardware architecture of GPU, readers are recommended to read my first article first.

GPU hardware knowledge and basic concepts: including the difference between CPU and GPU, GPU architecture, and introduction to CUDA software stack.
Introduction to GPU programming: Mainly introduces CUDA kernel functions, Thread, Block and Grid concepts, and uses Python Numba for simple parallel computing.
Advanced GPU programming: mainly introduces some optimization methods.
GPU programming practice: solving complex problems using Python Numba.

First introduction to GPU programming

The soldiers and horses have not moved, but the food and grass go first. Before starting GPU programming, you need to clarify some concepts and prepare relevant tools.

CUDA is a GPU programming framework provided by NVIDIA to developers. Programmers can use this framework to easily write parallel programs. The first article in this series mentioned that the CPU and main memory are called host, and the GPU and video memory (graphics card memory) are called devices. The CPU cannot directly read video memory data, and the GPU cannot directly read the host data. To store data, the host and the device must communicate with each other through the bus (Bus).
Insert image description here
GPU and CPU architecture

Before performing GPU programming, you need to confirm whether the CUDA toolbox is installed. You can use echo $CUDA_HOME to check the CUDA environment variable. If the return value is not empty, it means that CUDA has been installed. You can also install CUDA directly using the conda command in Anaconda:

$ conda install cudatoolkit

Then you can use the nvidia-smi command to check the graphics card situation, such as the number of graphics cards on this machine, the CUDA version, the processes running on the graphics card, etc. What I have here is a 32GB video memory version of Telsa V100 machine.
Insert image description here
nvidia-smi command returns results

Install Numba library:

$ conda install numba

Check whether CUDA and Numba are installed successfully:

from numba import cuda
print(cuda.gpus)

If there is no problem with the above steps, you can get the result: <Managed Device 0>…. If there is no GPU on the machine or the above packages are not installed, an error will be reported. When the CUDA program is executed, it will exclusively use one card. If there are multiple GPU cards on your machine, CUDA will select card No. 0 by default. If you're sharing the machine with other people, it's a good idea to negotiate who is using which card. Generally, the CUDA_VISIBLE_DEVICES environment variable is used to select a certain card. For example, select GPU card No. 5 to run your program.

CUDA_VISIBLE_DEVICES='5' python example.py

If you don’t have a GPU device at hand, Numba provides a simulator for users to learn and debug. You only need to add an environment variable to the command line.

Mac/Linux:

export NUMBA_ENABLE_CUDASIM=1

Windows:

SET NUMBA_ENABLE_CUDASIM=1

It should be noted that the simulator is just a debugging tool. Using Numba in the simulator will not speed up the program and may be slower. Moreover, there is no guarantee that a program that can be run in the simulator will be able to run on a real GPU. , in the end it still depends on the GPU.

With the above preparations, we can start our GPU programming journey!

The difference between GPU programs and CPU programs

The execution sequence of a traditional CPU program is as shown in the figure below:
Insert image description here

CPU programs are executed sequentially and generally require:

initialization.
CPU calculation.
Get the calculation result.

In CUDA programming, the CPU and main memory are called Host, and the GPU is called Device.
Insert image description here

When the GPU is introduced, the calculation process becomes:

Initialize and copy the necessary data to the video memory of the GPU device.
The CPU calls the GPU function and starts multiple GPU cores to perform calculations at the same time.
CPU and GPU asynchronous calculation.
Copy the GPU calculation results back to the host to obtain the calculation results.

A GPU program called gpu_print.py looks like this:

from numba import cuda

def cpu_print():
    print("print by cpu.")

@cuda.jit
def gpu_print():
    # GPU核函数
    print("print by gpu.")

def main():
    gpu_print[1, 2]()
    cuda.synchronize()
    cpu_print()

if __name__ == "__main__":
    main()

Use CUDA_VISIBLE_DEVICES=‘0’ python gpu_print.py to execute this code, and the result is:

print by gpu.
print by gpu.
print by cpu.

The difference from traditional Python CPU code is:

Use from numba import cuda to introduce the cuda library
Add the @cuda.jit decorator to the GPU function to indicate that the function is a function that runs on the GPU device. The GPU function is also called a kernel function.
When the main function calls the GPU kernel function, you need to add an execution configuration such as [1,2]. This configuration tells the GPU how much parallel granularity to perform calculations at the same time. gpu_print1,2 means starting 2 threads at the same time to execute the gpu_print function in parallel, and the function will be executed twice in parallel. Below we’ll dive into how to set up execution configurations.
The startup method of the GPU kernel function is asynchronous: after starting the GPU function, the CPU will not wait for the GPU function to complete execution before executing the next line of code. When necessary, cuda.synchronize() needs to be called to tell the CPU to wait for the GPU to finish executing the kernel function before performing subsequent calculations on the CPU side. This process is called synchronization, which is the red line in the GPU execution flow chart. If the cuda.synchronize() function is not called, the execution result will also change, and "print bycpu." will be printed first. Although the GPU function comes first, the program does not wait for the GPU function to finish executing, but continues to execute the subsequent cpu_print function. Since there is a certain delay in the CPU calling the GPU, the subsequent cpu_print is executed first, so the result of cpu_print is printed first. come out.

Thread hierarchy

In the previous program, the kernel function was executed twice in parallel by the GPU. When doing GPU parallel programming, you need to define execution configurations to tell how to perform parallel calculations. For example, in the example printed above, whether to execute 2 times in parallel, or 8 times, or 200,000 times in parallel, or 20 million times. . The number of 20 million is too big, far more than the number of GPU cores. How to reasonably distribute 20 million calculations to all GPU cores. To solve these problems, you need to understand CUDA's Thread hierarchy.
Insert image description here

CUDA calls the operation defined by the kernel function a thread (Thread). Multiple threads form a block (Block), and multiple blocks form a grid (Grid). Such a grid can define thousands of threads, which solves the problem of executing tens of thousands of operations in parallel. For example, change the previous program to execute 8 times in parallel: you can use 2 blocks and 4 threads in each block. The original code can be changed to gpu_print2, 4, where the first number in square brackets indicates how many blocks there are in the entire grid, and the second number in square brackets A number indicates how many threads a block has.

In fact, thread is a software concept in programming. From a hardware perspective, threads run on a CUDA core, blocks composed of multiple threads run on a Streaming Multiprocessor (see the first article in this series for details on the concept of SM), and a grid composed of multiple blocks runs on a GPU graphics card.
Insert image description here
CUDA provides a series of built-in variables to record the sizes and index subscripts of threads and blocks. Take a configuration like [2, 4] as an example: the blockDim.x variable indicates that the size of the block is 4, that is, each block has 4 threads, and the threadIdx.x variable is from 0 to blockDim.x - 1 (4-1 =3) index subscript, record which thread this is; the gridDim.x variable indicates that the size of the grid is 2, that is, each grid has 2 blocks, and the blockIdx.x variable is a variable from 0 to gridDim.x - 1 (2-1=1) index subscript, record which block this is.
Insert image description here

The position number of a certain thread in the entire grid is: threadIdx.x + blockIdx.x * blockDim.x.
Insert image description here
Using the above variables, we can enrich the previous code:

from numba import cuda

def cpu_print(N):
    for i in range(0, N):
        print(i)

@cuda.jit
def gpu_print(N):
    idx = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x 
    if (idx < N):
        print(idx)

def main():
    print("gpu print:")
    gpu_print[2, 4](8)
    cuda.synchronize()
    print("cpu print:")
    cpu_print(8)

if __name__ == "__main__":
    main()

The execution result is:

gpu print:
0
1
2
3
4
5
6
7
cpu print:
0
1
2
3
4
5
6
7

The GPU function here prints the current thread number in each CUDA thread, which plays the same role as the CPU function for loop. Because the calculation content in the for loop does not depend on each other, that is to say, a certain loop only concentrates on doing its own thing, and the i-th loop does not affect the j-th calculation of the loop. Therefore, such a mutually independent for loop is very suitable for Do parallel computing in CUDA thread. In actual use, we generally replace the mutually independent for loops in the CPU code with CUDA code.

This code prints 8 numbers. The kernel function has a parameter N, N = 8. What if we only want to print 5 numbers? The current execution configuration has a total of 2 * 4 = 8 threads. The number of threads 8 does not match the number of executions 5. However, we have written the judgment statement of if (idx < N) in the code, and the judgment will help us filter No calculations required. We only need to pass N = 5 to the gpu_print function. CUDA will still start 8 threads, but threads greater than or equal to N will not be calculated.
Note that when the number of threads is inconsistent with the number of calculations, such a judgment statement must be used to ensure that the calculation of one thread will not affect the data of other threads.
Insert image description here

Block size setting

Different execution configurations will affect the speed of the GPU program. It generally requires multiple debuggings to find a better execution configuration. In actual programming, the execution configuration [gridDim, blockDim] should refer to the following method:

The block runs on SM. The number of CUDA cores in different hardware architectures (Turing, Volta, Pascal...) is different. Generally, the block size blockDim (the second parameter in the execution configuration) needs to be set according to the current hardware. The number of threads in a block is preferably a multiple of 32, 128, or 256.Note that due to current hardware design, the block size cannot exceed 1024.
The size of the grid gridDim (the first parameter in the execution configuration), that is, the number of blocks in a grid can be divided by the total number of times N divided by blockDim, and rounded up.

For example, if we want to start 1000 threads in parallel, we can set blockDim to 128, 1000 ÷ 128 = 7.8, rounded up to 8. When used, the execution configuration can be written as gpuWork8, 128. CUDA starts a total of 8 * 128 = 1024 threads. Only the first 1000 threads are used in actual calculations. , the extra 24 threads are not calculated.

Note that these variables are easy to confuse. Let’s make it clear again: blockDim is the number of threads in the block, and the maximum threadIdx in a block does not exceed blockDim; gridDim is the number of blocks in the grid, and the maximum blockIdx in a grid does not exceed gridDim.

In the above discussion, the block and grid sizes are both one-dimensional. The execution configuration used in actual programming is often more complex. The block and grid sizes can be set to two-dimensional or even three-dimensional, as shown in the figure below. This part will be discussed in the next article.
Insert image description here
Thread Block Grid

memory allocation

As mentioned earlier, the GPU reads data directly from the video memory during calculation, so the data must be copied from the main memory to the video memory every time it is calculated. In CUDA terms, the data must be copied from the host to the device. The power of CUDA is that it can automatically copy data from the host to the device without the need for programmers to write it in the code. This method is very convenient for programmers and does not require a lot of changes to the original CPU code.

Let’s take vector addition as an example and write a kernel function for vector addition as follows:

@cuda.jit
def gpu_add(a, b, result, n):
    # a, b为输入向量，result为输出向量
    # 所有向量都是n维
    # 得到当前thread的索引
    idx = cuda.threadIdx.x + cuda.blockDim.x * cuda.blockIdx.x
    if idx < n:
        result[idx] = a[idx] + b[idx]

Initialize two 20 million-dimensional vectors and pass them to the kernel function as parameters:

n = 20000000
x = np.arange(n).astype(np.int32)
y = 2 * x
gpu_result = np.zeros(n)

# CUDA执行配置
threads_per_block = 1024
blocks_per_grid = math.ceil(n / threads_per_block)

gpu_add[blocks_per_grid, threads_per_block](x, y, gpu_result, n)

Integrate the above code, compare it with the CPU code, and verify the correctness of the result:

from numba import cuda
import numpy as np
import math
from time import time

@cuda.jit
def gpu_add(a, b, result, n):
    idx = cuda.threadIdx.x + cuda.blockDim.x * cuda.blockIdx.x
    if idx < n:
        result[idx] = a[idx] + b[idx]

def main():
    n = 20000000
    x = np.arange(n).astype(np.int32)
    y = 2 * x

    gpu_result = np.zeros(n)
    cpu_result = np.zeros(n)

    threads_per_block = 1024
    blocks_per_grid = math.ceil(n / threads_per_block)
    start = time()
    gpu_add[blocks_per_grid, threads_per_block](x, y, gpu_result, n)
    cuda.synchronize()
    print("gpu vector add time " + str(time() - start))
    start = time()
    cpu_result = np.add(x, y)
    print("cpu vector add time " + str(time() - start))

    if (np.array_equal(cpu_result, gpu_result)):
        print("result correct")

if __name__ == "__main__":
    main()

The running results show that the GPU code is 10+ times slower than the CPU code!

gpu vector add time 13.589356184005737
cpu vector add time 1.2823548316955566
result correct

It is said that GPU is dozens or hundreds of times faster than CPU? The main reasons why GPU is much slower than CPU here are:

The calculation of vector addition is relatively simple. The CPU's numpy has been optimized to the extreme and cannot highlight the advantages of the GPU. The actual problems we have to solve are often much more complicated than this. When solving complex problems, the optimized GPU code will be much faster than the CPU. code.
This code uses CUDA's default unified memory management mechanism and does not optimize data copying. CUDA's unified memory system is that when the GPU runs to a certain piece of data and finds that it is not on the device side, it then copies the data to the host side. After the kernel function is executed, all the memory is copied back to the main memory. In the above code, the two input vectors are read-only and there is no need to copy them back to main memory.
This code is not pipeline optimized. CUDA does not calculate 20 million pieces of data at the same time. It generally works in a batch pipeline: while calculating a certain batch of 20 million data, the next batch of data is copied from the main memory. The calculation occupies the CUDA core, and the data copy occupies the bus. The required resources are different and there is no competition with each other. This mechanism is called pipelining. This part will be discussed in the next article.

In Reason 2, the problem that programmers should use their brains to think about is left to CUDA to solve, which increases time overhead. Therefore, the disadvantage of CUDA's very convenient unified memory model is that the calculation speed is slow. For reason 2, we can continue to optimize this program and tell the GPU which data needs to be copied to the device and which needs to be copied back to the host.

from numba import cuda
import numpy as np
import math
from time import time

@cuda.jit
def gpu_add(a, b, result, n):
    idx = cuda.threadIdx.x + cuda.blockDim.x * cuda.blockIdx.x
    if idx < n :
        result[idx] = a[idx] + b[idx]

def main():
    n = 20000000
    x = np.arange(n).astype(np.int32)
    y = 2 * x

    # 拷贝数据到设备端
    x_device = cuda.to_device(x)
    y_device = cuda.to_device(y)
    # 在显卡设备上初始化一块用于存放GPU计算结果的空间
    gpu_result = cuda.device_array(n)
    cpu_result = np.empty(n)

    threads_per_block = 1024
    blocks_per_grid = math.ceil(n / threads_per_block)
    start = time()
    gpu_add[blocks_per_grid, threads_per_block](x_device, y_device, gpu_result, n)
    cuda.synchronize()
    print("gpu vector add time " + str(time() - start))
    start = time()
    cpu_result = np.add(x, y)
    print("cpu vector add time " + str(time() - start))

    if (np.array_equal(cpu_result, gpu_result.copy_to_host())):
        print("result correct!")

if __name__ == "__main__":
    main()

The result of running this code is:

gpu vector add time 0.19940638542175293
cpu vector add time 1.132070541381836
result correct!

At this point, you can see that the GPU speed is finally much faster than the CPU.

Numba is more friendly to Numpy, and Numpy data types must be used in programming. The more commonly used memory allocation functions are:

cuda.device_array(): Allocate an empty vector on the device, similar to numpy.empty()
cuda.to_device(): Copy the host’s data to the device

ary = np.arange(10)
device_ary = cuda.to_device(ary)

cuda.copy_to_host(): Copy the device data back to the host

host_ary = device_ary.copy_to_host()

Summarize

The Python Numba library can call CUDA for GPU programming. The CPU side is called the host, the GPU side is called the device, and the function running on the GPU is called the kernel function. When calling the kernel function, execution configuration is required. To tell CUDA how much parallel granularity to calculate. When using GPU programming, data must be copied between the host and the device reasonably.
Insert image description here
GPU program execution process

The basic process of CUDA programming is:

Initialize and copy the necessary data to the video memory of the GPU device.
Use a certain execution configuration to call CUDA kernel functions with a certain parallel granularity.
CPU and GPU compute asynchronously.
Copy the GPU calculation results back to the host.

About Python learning materials:

If you have any difficulties in learning python, you can scan the CSDN official certification QR code below on WeChat to join python communication and learning
Exchange questions and help each other, there are good ones here Learn tutorials and development tools.
(Python part-time resources + Python full set of learning materials)

1. Learning routes in all directions of Python

The technical points in all directions of Python have been compiled to form a summary of knowledge points in various fields. Its usefulness is that you can find corresponding learning resources according to the above knowledge points to ensure that you learn more comprehensively.
Insert image description here

2. Essential development tools for Python

Insert image description here

3. Python video collection

Watch zero-based learning videos. Watching videos is the fastest and most effective way to learn. It is easy to get started by following the teacher's ideas in the video, from basic to in-depth.
Insert image description here

4. Practical cases

Optical theory is useless. You must learn to follow along and practice it in order to apply what you have learned to practice. At this time, you can learn from some practical cases.
Insert image description here

5. Python exercises

Check learning results.
Insert image description here

6. Interview materials

We must learn Python to find a high-paying job. The following interview questions are the latest interview materials from first-tier Internet companies such as Alibaba, Tencent, Byte, etc., and Alibaba bosses have given authoritative answers. After reviewing this set of interview materials, I believe everyone can find a satisfactory job.
Insert image description here

Finally, never let down the passion you had at that time, and become stronger and better together.