write first

This book was published in 2011. According to the speed of computer development, it is already an ancient book. However, because of its simplicity and ease of understanding, it is still recommended as an introductory book. Cover first:
cuda by example
Since the book is relatively old and the purpose of learning is different, only the content related to the basic code is introduced here, and the content of image processing is skipped.
Also the code for this book is here: csdn resources

The first two chapters of popular science

Talk about the changes of CUDA, and then the second chapter talks about how to install CUDA. If you can't install it, please go here: Install CUDA .

Chapter 3 Introduction to CUDA C

output hello world
```
#include<stdio.h>

__global__ void kernel() {
  printf("hello world");
}

int main() {
  kernel<<<1, 1>>>();
  return 0;
}
```
The difference between this program and a normal C program is worth noting
- The definition of the function has the label __global__, indicating that the function is running on the GPU
- In addition to the regular parameters, the function call also adds the <<<>>> decoration. And the number in it will pass a CUDA runtime system, as for what it can do, we will talk about it in the next chapter.

Advanced Edition


#include<stdio.h>

__global__ void add(int a,int b,int *c){
  *c = a + b;
}
int main(){
  int c;
  int *dev_c;
  cudaMalloc((void**)&dev_c,sizeof(int));
  add<<<1,1>>>(2,7,dev_c);
  cudaMemcpy(&c,dev_c,sizeof(int),cudaMemcpyDeviceToHost);
  printf("2 + 7 = %d",c);
  return 0;
}

This involves the memory exchange between the GPU and the host. cudaMalloc opens up a space in the GPU's memory, and after the operation, the memory has the calculated content, and then uses the cudaMemcpy function to copy the content from the GPU. . It's that simple.

Chapter 4 CUDA C Parallel Programming

This chapter begins to reflect the charm of CUDA parallel programming.
Here is the code for summing an array

#include<stdio.h>

#define N   10

__global__ void add( int *a, int *b, int *c ) {
    int tid = blockIdx.x;    // this thread handles the data at its thread id
    if (tid < N)
        c[tid] = a[tid] + b[tid];
}

int main( void ) {
    int a[N], b[N], c[N];
    int *dev_a, *dev_b, *dev_c;

    // allocate the memory on the GPU
    cudaMalloc( (void**)&dev_a, N * sizeof(int) );
    cudaMalloc( (void**)&dev_b, N * sizeof(int) );
    cudaMalloc( (void**)&dev_c, N * sizeof(int) );

    // fill the arrays 'a' and 'b' on the CPU
    for (int i=0; i<N; i++) {
        a[i] = -i;
        b[i] = i * i;
    }

    // copy the arrays 'a' and 'b' to the GPU
    cudaMemcpy( dev_a, a, N * sizeof(int),
                              cudaMemcpyHostToDevice );
    cudaMemcpy( dev_b, b, N * sizeof(int),
                              cudaMemcpyHostToDevice );

    add<<<N,1>>>( dev_a, dev_b, dev_c );

    // copy the array 'c' back from the GPU to the CPU
    cudaMemcpy( c, dev_c, N * sizeof(int),
                              cudaMemcpyDeviceToHost );

    // display the results
    for (int i=0; i<N; i++) {
        printf( "%d + %d = %d\n", a[i], b[i], c[i] );
    }

    // free the memory allocated on the GPU
    cudaFree( dev_a );
    cudaFree( dev_b );
    cudaFree( dev_c );
    return 0;
}

The point is that the most difficult thing for beginners to understand is the kernel function:

 __global__ void add( int *a, int *b, int *c ) {
    int tid = blockIdx.x;
    if (tid < N)
        c[tid] = a[tid] + b[tid];
}

The biggest difference between GPU programming and CPU programming is also reflected here, that is, the array sum does not need to loop! Why not loop, because the tid here can do the work of the entire loop. The tid here is also the id of the thread. Each thread is responsible for the operation of one number in the array, so the 10 loop operations are divided into ten threads to handle at the same time. The kernel function here can be executed concurrently at the same time, and the value of the tid inside is different.

Chapter 5 Thread Cooperation

GPU logical structure

This chapter begins to introduce the relevant knowledge of thread blocks and grids, that is, the meaning of the numbers in <<<>>>. First of all, let's talk about what a thread block is. As the name suggests, it is a block composed of threads. The logical structure of the GPU is shown in the following figure:
gpu logical structure
This figure is from the official NVIDIA document, where CTA is the thread block, Grid is the grid composed of thread blocks, each thread block has several warps, and then there is the smallest unit in the thread warp Threads (the documentation will call them lanes, translated into threads within the warp).
After a little introduction to the basics, we begin to introduce the content of this chapter. The content of this chapter is mainly based on the following facts:

We note that the hardware limits the number of thread blocks to no more than 65535. Likewise, the hardware limits the number of threads in each thread block that starts the kernel.

Because of this limitation, we need some more complex combinations to operate on larger length arrays, rather than just using threadIdx naive things.
We provide the following kernels for manipulating longer arrays:

__global__ void add(int *a, int *b, int *c) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    while (tid < N) {
        c[tid] = a[tid] + b[tid];
        tid += blockDim.x * gridDim.x;
    }
}

Well, after understanding int tid = threadIdx.x + blockIdx.x * blockDim.x;this sentence, this chapter is successfully completed. First of all, why is x, and is there y, z? The answer is yes, but here (yes, in this book), it is not used. In fact, thread blocks and grids are not only one-dimensional, thread blocks actually have three dimensions, and grids also have two dimensions. So there is the phenomenon of .x. Of course, we don't need to care about these things, just treat them as having only one dimension. Then look at the picture below:
thread grid

This is a grid of threads with only one dimension. Among them, threadIdx.x is the number of each thread in its respective thread block, that is, thread 0, thread 1 in the figure. But the problem is that there is thread 0 in each block, but what should I do if I want this different thread 0 to operate in different positions. Introduced blockIdx.x, which represents the label of the thread block. With the label of the thread block, multiplied by the number of threads contained in each thread block blockDim.x, each thread can be assigned an increasing label. , programmers can operate relatively long array subscripts.

But the problem comes again. If the array is too large, I can't use all the threads to do one-to-one correspondence. tid += blockDim.x * gridDim.x;This sentence is used here to make a thread operate several subscripts. How to achieve it? After processing the current tid position, let tid increase the number of threads, blockDim.x is the total number of threads in a block, and gridDim.x is the number of all blocks in a grid, so The multiplication is the number of all threads.

So far, thread cooperation is finished. Another more intuitive picture:
More intuitive grid diagram

Shared memory

Shared memory is a good thing. It can only be used inside the block, and the access speed is extremely fast. It seems that a part of the shared memory is divided from the L1 cache closest to the operator, so it is extremely fast. So we're going to put this thing to use.
The example here is an example of a dot product, that is: you

end up with a sum. The main idea is as follows:

The first half plus the second half:
To sync, don't waver
The specific code to hand over the final work with small parallelism to the CPU
is Aunt Sauce:

__global__ void dot(float *a, float *b, float *c) {
    //建立一个thread数量大小的共享内存数组
    __shared__ float cache[threadsPerBlock];
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    int cacheIndex = threadIdx.x;
    float temp = 0;
    while (tid < N) {
        temp += a[tid] * b[tid];
        tid += blockDim.x * gridDim.x;
    }
    //把算出的数存到cache里
    cache[cacheIndex] = temp;
    //这里的同步，就是说所有的thread都要达到这里之后程序才会继续运行
    __syncthreads();
    //下面的代码必须保证线程数量的2的指数，否则总除2会炸的
    int i = blockDim.x / 2;
    while (i != 0) {
        if (cacheIndex < i)
            cache[cacheIndex] += cache[cacheIndex + i];
        //这里这个同步保证了0号线程不要一次浪到底就退出执行了，一定要等到都算好才行
        __syncthreads();
        i /= 2;
    }
    if (cacheIndex == 0)
        c[blockIdx.x] = cache[0];
}

Among them, the array c is actually only a part of the result. Finally, the number of blocks c will be returned, and then the cpu will perform the final addition.

Chapter 9 Atomic Operations

An atomic operation, that is, like the PV operation of the operating system, can only be performed by one thread at a time. The advantage is naturally that there will be no errors caused by simultaneous reading and writing, and the disadvantage is obviously that it increases the running time of the program.

Calculate histogram

Principle: Suppose we want the statistical data range to be [0,255], so we define an unsigned int histo[256]array, and then our data is data[N], we traverse the data array, and then histo[data[i]]++we can calculate the histogram at the end. Here we introduce atomic operations

__global__ void histo_kernel(unsigned char *buffer, long size,
        unsigned int *histo) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;
    while (i < size) {
        atomicAdd(&(histo[buffer[i]]), 1);
        i += stride;
    }
}

The atomicAdd here is that only one thread can operate at the same time, which prevents other threads from operating. However, it is extremely slow. The book says that since taking this, it is four times slower than the CPU. So we need something else.

Upgraded Compute Histogram

The reason why using atomic operations is slow is that when the amount of data is large, there will be many operations on a data bit at the same time, so the operations are queued, and this time, we first stipulate that there are 256 threads in the thread block ( This number is not necessarily), then define a temporary shared memory inside the thread to store temporary histograms, and then finally sum these temporary histograms. In this way, the scope of the conflict has changed from all the global threads to 256 threads in the thread block, and since there are only 256 data bits, the data conflicts caused by this will be greatly reduced. See the following code for details:

__global__ void histo_kernel(unsigned char *buffer, long size,
        unsigned int *histo) {
    __shared__ unsigned int temp[256];
    temp[threadIdx.x] = 0;
    //这里等待所有线程都初始化完成
    __syncthreads();
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int offset = blockDim.x * gridDim.x;
    while (i < size) {
        atomicAdd(&temp[buffer[i]], 1);
        i += offset;
    }
    __syncthreads();
    //等待所有线程完成计算，讲临时的内容加总到总的直方图中
    atomicAdd(&(histo[threadIdx.x]), temp[threadIdx.x]);
}

Chapter 10 Streams

Page-locked memory
This kind of memory is locked into the host memory after you apply for it, and its physical address is fixed. This access will increase the efficiency.
CUDA
stream The concept of stream is the same as the concept of multi-threading in java. You can put different work into different streams, so that you can perform some operations concurrently, such as executing kernel during memory copying:

some optimizations are discussed later in the article The method, but the pro test is invalid, it may be that cuda's support for streams has changed, and the knowledge about streams will be mentioned in future blog posts.

Chapter 11 Multi-GPU

This chapter mainly read the first section of zero-copy memory. It is also very easy to understand that a piece of memory is opened on the CPU, and the GPU can directly access it without copying it to the GPU's video memory. As for the gap and difference in performance with page-locked memory, experiments are needed to verify

====================2017.7.30 update =========================
Reading the code When I found that there are three function prefixes:
(1) __host__ int foo(int a){} is the same as foo(int a){} in C or C++, it is a function called and executed by the CPU
(2) __global__ int foo (int a){} represents a kernel function, which is a set of parallel computing tasks executed by the GPU, called in the form of foo<<>>(a) or the driver API. Currently the global function must be called by the CPU and emit parallel computing tasks to the GPU's task calling unit. With the further improvement of GPU programmability, it may be called by GPU in the future.
(3) __device__ int foo(int a){} represents a function called by a thread in the GPU. Since the GPU of Tesla architecture allows threads to call functions, it is actually implemented by directly compiling the __device__ function into the binary code after expanding it in the form of __inline, which is not a real function.

Specifically, the functions defined by the device prefix can only be executed on the GPU, so common functions cannot be called in the device-modified functions; the global prefix allows CUDA to run on both CPU and GPU devices, but it cannot run on the CPU either. Common functions in the host; ordinary functions modified by the host prefix are default, and ordinary functions can be called.

"GPU high-performance programming CUDA combat" (CUDA By Example) reading notes