An Even Easier Introduction of CUDA (translation)

 

This is a very simple introduction to the CUDA, CUDA is NVIDIA under the name of a very popular parallel computing platforms and models. I wrote about CUDA in 2013, a brief introduction , the article all these years has been very popular. But CUDA became simpler, while the GPU has become a lot faster. Today is the time to update one (even simpler) introduced.

Just use one of CUDA C ++ to create massively parallel CUDA application method. It allows you to use the powerful C ++ programming language to develop high-performance algorithm accelerated by thousands of parallel threads running on the GPU. Many developers in this way accelerate their computing and bandwidth requirements in the application, including artificial intelligence revolution libraries and frameworks (called deep learning) to support ongoing.

So you've heard of CUDA, and you are interested in learning how to use it in your own applications. If you are a C or C ++ programmers, this post should give you a good start. To continue, you need a GPU (Windows, Mac or Linux, any NVIDIA GPU should do) with a CUDA-enabled computer or cloud instances with the GPU (AWS, Azure, IBM SoftLayer and other cloud service) provider have them). You also need to install the free CUDA Toolkit.

let's start!

Starting Simple

We'll start with a simple C ++ program starts, the program adds elements of two arrays, with each array containing one million elements.

#include <iostream>
#include <math.h>

// function to add the elements of two arrays
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20; // 1M elements

  float *x = new float[N];
  float *y = new float[N];

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the CPU
  add(N, x, y);

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  delete [] x;
  delete [] y;

  return 0;
}

First, compile and run C ++ programs. The above code in a file and save it as add.cpp, then compiled C ++ compiler. I'm on a Mac, so I use the clang ++, but you can use g ++ on Linux, using MSVC on Windows.

g++ add.cpp -o add

Then he run:

 

As expected, it prints out the sum of no error and then exit (I promise seemingly wrong). Now I want to run this calculation (parallel) on many core GPU. In fact, the first step is very easy.

First of all, I just need to convert our add function as a function of GPU can run, called the CUDA kernel. To do this, I have to do is add a specifier __global__ in a function that tells CUDA C ++ compiler which is a function of a run on the GPU and can be called from CPU code.

// CUDA Kernel function to add the elements of two arrays on the GPU
__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

These functions are called __global__ kernels' , code running on the GPU commonly referred device code , the code running on the CPU is the host code .

Memory Allocation in CUDA

To be calculated on the GPU, I need to allocate memory GPU can access. CUDA in unified memory ( Unified Memory ) memory to provide a single system for all GPU and CPU can access. We can call cudaMallocManaged () distributes data Unified Memory, this function will return a pointer can be accessed from the host (CPU) code or device (GPU) code. To release the data simply pass a pointer to cudaFree () can be.

I just need to by calling cudaMallocManaged () replace the above code to new calls and calls delete [] replaced by a call cudaFree .

// Allocate Unified Memory -- accessible from CPU or GPU
  float *x, *y;
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  ...

  // Free memory
  cudaFree(x);
  cudaFree(y);

Finally, I need to start the Add () Kernel , which it calls on the GPU. <<< >>> triangular bracket syntax specified CUDA kernel. I simply add it to the caller added before the parameter list.

add<<<1, 1>>>(N, x, y);

Very simple! I understand the details in angle brackets as soon as possible; now you need to know is that this line started a GPU threads to run the add ().

One more thing: I need to wait until the CPU can access the complete results of cores (CUDA kernel boot because not block the calling thread CPU). To this end, I simply call on the CPU before a final error check cudaDeviceSynchronize () .

This is the complete code:

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the GPU
  add<<<1, 1>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);
  
  return 0;
}
View Code
CUDA document file extension .cu. Therefore, this code is stored in a file named add.cu in, and use the CUDA C ++ compiler to compile nvcc.

This is only the first step, because the time of writing, this kernel only a single thread is correct, because it is run each thread will execute added to the entire array. Furthermore, there is a race condition , since a plurality of parallel threads to read and write the same location.

Note: On Windows, you need to make sure is set to x64 to "Configuration Properties" "Platform" in the project in the Microsoft Visual Studio.

Profile it!

I think the easiest way to find out how long to run the kernel is to use nvprof , this is the CUDA Toolkit comes with command-line GPU analyzer. Just type in the command line nvprof ./add_cuda :

 [Note: the translator used here is the element 1G, so the gap is relatively large]

Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  463.25ms         1  463.25ms  463.25ms  463.25ms  add(int, float*, float*)

These are cut output nvprof, showing a call to be added. NVIDIA Tesla K80 accelerator takes about half a second (Translator's Note: 1M elements), but in my three years ago in the Macbook Pro, NVIDIA GeForce GT 740M takes about half a second.

 

Here we use and faster to exercise his right.

Picking up the Threads

Now that you have a running thread in a kernel, and some of the calculations, then how to make it parallel? The key is the CUDA <<< >>> 1,1 syntax. This is called to perform the configuration (execution configuration), which tells how many parallel threads with CUDA on GPU running. There are two parameters, but let's start with changing the second parameter (this parameter is): the number of threads in the thread block (the number of threads in a thread block). CUDA GPU using a size 32 running kernel thread blocks, so the 256 threads is a reasonable size [did not know]:

add<<<1, 256>>>(N, x, y);

 

If I use this to change the run code only, it will perform a calculation for each thread, rather than spread throughout the computing parallel threads. To properly do, I need to modify the kernel. CUDA C ++ keywords provided by the kernel to get the index running thread. Specifically, threadIdx.x including its block index of the current thread, B lockDim.x contains the number of threads in the block. I just modified cycle to stride through the array with parallel threads.

__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
} 
the Add FUNC not that big change. Indeed, the index j is set to 0 and the stride set to 1 so that the first version on the same semantics.
Save the file as add_block.cu compiled again and nvprof run it in. The results are as follows:

 

Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  2.7107ms         1  2.7107ms  2.7107ms  2.7107ms  add(int, float*, float*)

这是一个很大的加速(463ms到2.7ms),但是从我从1个线程到256个线程就不足为奇了。 K80比我的小Macbook Pro GPU(3.2ms)更快。 让我们继续获得更多的表现。

【译者的从144.272s到2.923s,另外这个只是add这个函数的执行时间,而不是整个程序的时间,这次整个程序的运行时间依然有半分钟。】

 

Out of the Blocks

CUDA GPU有许多并行处理器被分组为流式多处理器( Streaming Multiprocessors)也叫SMs。 每个SM可以运行多个并发线程块。 例如,基于Pascal GPU架构的Tesla P100 GPU具有56个SMs,每个SM能够支持多达2048个活动线程(active threads)。 为了充分利用所有这些线程,我应该使用多个线程块启动内核。

到目前为止,您可能已经猜到执行配置的第一个参数指定了线程块的数量。 并行线程块共同构成了所谓的网格。 由于我要处理N个元素,每个块有256个线程,所以我只需要计算得到至少N个线程的块数。 我只是将N除以块大小(如果N不是blockSize的倍数,则要小心舍入)。

int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);

 .

我还需要更新内核代码以考虑整个线程块网格。 CUDA提供gridDim.x,它包含网格中的块数,blockIdx.x包含网格中当前线程块的索引。 图1说明了使用blockDim.x,gridDim.xthreadIdx.x在CUDA中索引到一个数组(一维)的方法。 我们的想法是每个线程通过计算其块开头的偏移量(块索引乘以块大小:blockIdx.x * blockDim.x)并在块(threadIdx.x)中添加线程索引来获取其索引。 代码blockIdx.x * blockDim.x + threadIdx.x在CUDA中经常使用。

__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

更新的内核还设置了网格中线程总数(blockDim.x * gridDim.x)的stride。 CUDA内核中的这种类型的循环通常称为网格跨步循环(grid-stride loop)。 

将文件另存为add_grid.cu编译后使用nvprof运行结果如下:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  94.015us         1  94.015us  94.015us  94.015us  add(int, float*, float*)

这是另一个28倍的加速,从在K80的所有SM上运行多个块! 我们只使用K80上的2个GPU中的一个,但每个GPU有13个SM。 请注意我的笔记本电脑中的GeForce有2个(较弱的)SM,运行内核需要680us。

 

Summing Up

以下是Tesla K80和GeForce GT 750M上三个版本的add()内核性能的概述。

Exercises

  1. Browse the CUDA Toolkit documentation. If you haven’t installed CUDA yet, check out the Quick Start Guide and the installation guides. Then browse the Programming Guideand the Best Practices Guide. There are also tuning guides for various architectures.
  2. Experiment with printf() inside the kernel. Try printing out the values of threadIdx.xand blockIdx.x for some or all of the threads. Do they print in sequential order? Why or why not?
  3. Print the value of threadIdx.y or threadIdx.z (or blockIdx.y) in the kernel. (Likewise for blockDim and gridDim). Why do these exist? How do you get them to take on values other than 0 (1 for the dims)?
  4. If you have access to a Pascal-based GPU, try running add_grid.cu on it. Is performance better or worse than the K80 results? Why? (Hint: read about Pascal’s Page Migration Engine and the CUDA 8 Unified Memory API.) For a detailed answer to this question, see the post Unified Memory for CUDA Beginners.

 

Where To From Here?

我打算用更多CUDA编程材料跟进这篇文章,但为了让你现在忙,有一系列旧的介绍性帖子你可以继续(我计划将来根据需要更新/替换):

There is also a series of CUDA Fortran posts mirroring the above, starting with An Easy Introduction to CUDA Fortran.

You might also be interested in signing up for the online course on CUDA programming from Udacity and NVIDIA.

There is a wealth of other content on CUDA C++ and other GPU computing topics here on the NVIDIA Parallel Forall developer blog, so look around!

 

Guess you like

Origin www.cnblogs.com/SsoZhNO-1/p/11209696.html