CUDA C Best Practices - CUDA Best Practices (1)

This document is the most useful TOP3 in the official CUDA manual.
ps: The translation of the full text will be exhausting.

foreword

The role of documentation

What this document does, is to help developers get the best performance out of NVIDIA GPUs. It is recommended to read in order, this document will greatly improve your understanding of program efficiency.

object oriented

You need to know C and have CUDA installed. Install from here . It is best to also take a look at the document "CUDA C Programming Guide". (A major feature of this document is that if the space is not enough, let you go to the Programming Guide)

Evaluate, Parallelize, Optimize, Implement

CUDA program loop

This diagram is the center of the whole document (APOD), first you have to evaluate your program, the initial speedup will be achieved, tested, and run with minimal optimizations, this loop can be run again and again, by again Spot optimization opportunities, speed up again and run a faster version.

evaluate

For an existing project, the first step is to evaluate the application to locate the parts that are responsible for most of the execution time. Learning this, developers can estimate the bottleneck of parallel programs and can speed up the GPU. Requires understanding of Amdahl's and Gustafson's laws.

parallel

After identifying the pain points, the developer needs to parallelize the program. You can use an existing parallelization library or add a parallel flag to the compiler. But many programs require refactoring to be parallelized and CUDA makes this easy.

optimization

When parallelization is complete, developers can focus on optimization. The first step is to clarify the needs of the application, optimize and implement the program in iterations, and do not need to increase the speed very much from the beginning. Furthermore, optimization can start at different levels, from overlapping computations and data transfers to fine-grained floating-point operations, while profiling tools can help you with the direction of the next optimization step.

implement

After optimization, compare the actual results with the expected results and repeat the APOD cycle. Deploying the current program before going into deeper optimization has many benefits, such as allowing users to evaluate the current application and reducing the risk of the application because it is a gradual evolution rather than a reformation .

Recommendations and Best Practices

This document has a priority rating for optimizations, ensuring that all high-level optimizations are completed before lower-priority optimizations are performed. Of course this priority is not absolute, the documentation just provides common scenarios.

1. Evaluate the application

Bulabula bullshit, illustrating the importance of parallel computing. In order to adapt to modern processors, including GPUs, the most important first step is to identify the pain points of the program and determine whether it can be parallelized.

2. Heterogeneous computing

Although the GPU is mainly used to process images, it is also very powerful in computing. CPUs and GPUs are not the same, and it is important to understand the differences in order to use CUDA efficiently.

2.1 Difference between host and device

  1. thread resource

    The CPU has very few threads (just a few dozen), while the GPU has tens of thousands of threads.

  2. thread

    Threads on CPUs are huge entities and context switching is time consuming for them, while GPUs are the opposite because GPUs have a lot of registers allocated to threads. In simple terms, CPUs are designed to run a small number of threads with minimal latency, while GPUs are designed to maximize throughput for a large number of threads.

  3. RAM

    They all have their own memory and are connected to the PCIe bus.

This is the initial discussion of the differences between the two, other differences will be discussed in other parts of the document, knowing these differences will help you optimize your work: try to run sequential jobs on the host and parallel jobs on the GPU

2.2. Which parts should run on the GPU

  1. Obviously a dataset that does the same operations on a large scale. This requires a lot of threads
  2. Use data in a pattern with good consistency, otherwise it will result in a small speedup
  3. Data transfer between the host and the device should be minimized.
    • It is not necessary to pass data to very few threads. For example, pass in two N×N matrices, calculate the sum, and then pass it back. There are N^2 calculations but 3N^2 data to be transmitted, the ratio is 1:3 or O(1), but if you calculate the product, it is O(N), which is better. Or those more complex operations such as trigonometric functions. Anyway, remember that there is overhead in transferring data, right?
    • Data should remain on the device as much as possible. Between two Kernels, the data should be kept on the data as much as possible. For example, the addition of the two matrices above may be used for subsequent operations after the operation, so it should be left. If this is the case, the data is to be run on the GPU, even though it might run faster on the host. A slower Kernel may benefit from this, as will be explained in detail in Chapter 9.

3. Program Analysis

3.1. Analysis

Many programs do most of the work with very little code. Using the profiler, the developer can find such points and make a list of parallel possibilities.

3.1.1. Creating an Analysis

The most important thing is to find the function with the longest execution time. And the most important thing about analyzing the program is to make sure that the workload is similar to reality. You can use gprof to test:

gprof profiler

3.1.2. Analyzing Pain Points

As we can see from the above graph, the genTimeStep() function takes almost one third of the total time, which is the function we should optimize. And it can be seen that other functions also take up most of the time such as: calcStats() and calcSummaryData(). Parallelizing these functions can also speed up the program, but take your time.

3.1.3. Recognize which parts can be parallelized

To get the most performance gains from CUDA, you must first find a way to parallelize your existing serial code.

3.1.3.1. Strong scaling and Amdahl's law

Here are the two, please see here: Amdahl and Gustafson's Law in Parallel Computing

Amdahl is to see how much your program can speed up even if your parallel part is perfect (running time is 0).

3.1.3.2. Weak scaling and Gustafson's law

Gustafson's Law assumes that the ratio of serial to parallel execution remains constant, reflecting the additional cost of setting up and dealing with larger problems. (Actually, I don't understand too much)

3.1.3.3. Implementing strong/weak scaling

Know which scale is right for your application, for some programs the problem size is constant, such as the force between two molecules, and for others the problem size increases with the number of processors, such as a fluid Monte Carlo Luo simulation, large workload can provide large accuracy.

4. Parallelize the program

After identifying the pain points, the developer needs to parallelize the program. You can use an existing parallelization library or add a parallel flag to the compiler. But many programs require refactoring to be parallelized and CUDA makes this easy.

5. Get started

Although implementing parallelism for a particular application is complex, there are some key steps required.

5.1. Parallel Libraries

CUDA provides some parallel libraries such as cuBLAS, cuFFT and the like. It is very convenient to use these libraries if it is more in line with the requirements. In addition to the cuBLAS library for line generation, cuFFT for Fourier transform, with special emphasis on the Thrust template library. This library contains many commonly used parallel algorithms, which can be combined to complete complex algorithms. You can use it to quickly prototype a CUDA application.

5.2. Parallel compilers

This is the way to let the compiler talk code in parallel by setting special flags. For example, the #progra unroll tag used in the unroll operation. OpenACC provides many such directives. Click here to go to the official website of OpenACC

5.3. Parallelism in code

In addition to the ready-made methods above, of course, the programmers still need to type the code manually. We can rewrite the found pain points ourselves as parallel. When we test and find that many functions take the same time, we need to refactor the code, and you need to know that refactoring the code into parallel is good for future architectures, so this work is worth it .

6. Get the correct answer

It's not easy to find bugs in parallel programs, because it has too many threads, and floating point calculations, etc., can cause unexpected errors. This chapter introduces the points that can lead to errors and tells you how to fix them.

6.1. Authentication

6.1.1. Comparison reference

The first step is to compare the new results with the reference results to make sure that the results match the criteria applicable to any algorithm. Some calculations want the same result for every bit, but this is not always possible, especially when calculating floating point numbers. It is worth noting that the methods used to verify the numerical results are easily extended to verify the performance of the results. We need to be sure that the results are correct, but also to increase the efficiency.

6.1.2. Unit Testing

For better testing, we can write the Kernel function as a combination of many device functions instead of one big global function. (It should be noted here that if you do nothing to the global memory, your compiler will think that part of your code is dead code for you to remove, so you must do something when testing) In addition, if you use host To define device instead of just using device, this function can be tested on the CPU, which can give us more confidence in testing.

6.2. Debugging

You can use CUDA-GDB, which I also wrote, see here for details: Debug cu programs with cuda-gdb
or use NVIDIA Parallel Nsight to debug: http://developer.nvidia.com/nvidia-parallel-nsight
and some third parties Debugger: http://developer.nvidia.com/debugging-solutions

6.3. Numerical precision

Most errors in floating-point precision stem from the way floating-point numbers are calculated and stored. Provide a website: floating-point precision

6.3.1. Single-Precision vs Double-Precision

Devices with compute capability 1.3 and above all provide double-precision floating-point calculations. Greater precision can be achieved compared to single precision. Be careful when using it.

6.3.2. Floating-point arithmetic is not associative

This means that the values ​​of (A+B)+C and A+(B+C) in floating-point numbers are not necessarily the same, so pay attention to the possibility that you may change the position of the operand, so that the result is not correct, this problem not only exists In CUDA, any parallel floating-point computing system may have such a problem.

6.3.3. Converting double to single precision

for example

float a;
...
a = a*1.02;

This code is calculated on the GPU, it will be single-precision, but running on the host computer will convert 1.02 to double-precision and then all the results will become double-precision, so the results will be different. And we can fix 1.02 to 1.02f as a single-precision floating-point number.

6.3.4. IEEE 754 Standard

All CUDA devices follow the IEEE 754 standard, except for some special cases, which differ in the Features and Technical Specifications of the CUDA C Programming Guide

6.3.5. x86 80-bit computing

x86 machines can also perform 80-bit floating-point calculations, which are different from 64-bit calculations. To get similar results, try not to let x86 do this plane. It is operated with the instruction FLDCW.

7. Optimize CUDA applications

When parallelization is complete, developers can focus on optimization. The first step is to clarify the needs of the application, optimize and implement the program in iterations, and do not need to increase the speed very much from the beginning. Furthermore, optimization can start at different levels, from overlapping computations and data transfers to fine-grained floating-point operations, while profiling tools can help you with the direction of the next optimization step.

8. Performance testing

To optimize code, it is important to know how to measure accurately and to know the role that bandwidth plays in optimization. This chapter will mainly focus on these two contents.

8.1. Timing

8.1.1. Using CPU timers

Going into detail about CPU timing is beyond the scope of this article, but it is important to know that such methods exist. It must be principled, to make the CPU and GPU events happen synchronously, you can call the cudaDeviceSynchronize() function, which can block the CPU thread until the GPU completes the work. While there is code that can synchronize the CPU and the stream, it is not suitable for timing since the streams are usually interleaved. It is important to note that this timing method will stall the GPU pipeline operation, so try to use it as little as possible.

8.1.2. Using CUDA GPU timers

Timing can be done using the API provided by CUDA:

timing

cudaEventRecord() puts start and stop into the default stream. The device will record a timestamp when the stream arrives at this event. cudaEventElapsedTime() is the time difference between return start and stop.

8.2. Bandwidth

8.2.1. Calculating the theoretical bandwidth

Just need to know the clock frequency and bit width of the GPU. For example: 1.85GHz and 384 bits, double data rate. It is calculated like this:

(1.85*10^9*(384/8)*2)/10^9 = 177.6 GB/s

This is the principle: first convert GHz to Hz, then 384/8 is converted to bytes, ×2 is double the data rate, and /10^9 is converted to GB.

8.2.2. Calculate the actual bandwidth

Formula: ((Br+Bw)/10^9)/time

It is the actual transfer data divided by the time. For example, if there is a 2048*2048 matrix transmission, it is calculated like this: (2048×2048×4×2)/10^9/time

4 is a number four bytes, 2 is read and write.

8.2.3. Using the Visual Profiler to measure throughput

On devices of compute capability 2.0 or higher, the Visual Profiler can provide throughput information for different memories. include:

  • Requested Global Load Throughput
  • Requested Global Store Throughput
  • Global Load Throughput
  • Global Store Throughput
  • DRAM Read Throughput
  • DRAM Write Throughput

Where requested is the Kernel's request for data.

In the end, both actual throughput and requested throughput are useful. The former allows you to see how much efficiency your code can achieve in hardware, while the latter compares to the former to see how much memory is wasted in aggregate operations. For global memory, this data is displayed by Global Memory Load Efficiency and Global Memory Store Efficiency.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325979585&siteId=291194637