GPU(Device)用于异构计算，CPU(Host)控制GPU。极简入门。API(CUDA Driver API+CUDA Runtime API+CUDA Libraries)

hello world

// main.cu
#include <iostream>
#include <stdio.h>

void cpu_hello(){
    printf("cpu\n");
}
__global__ void cuda_hello(){
    printf("gpu\n");
}

int main() {
    cpu_hello();
    cuda_hello<<<1,10>>>();
    cudaDeviceSynchronize();//witout this no output

    std::cout << "Hello, World!" << std::endl;
    return 0;
}

编译

nvcc -arch=sm_86 -o CUDATEST main.cu -run
30**用sm_86编译，如果不通过可以减小

# https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-compilation
nvcc x.cu
 -gencode arch=compute_50,code=sm_50
 -gencode arch=compute_60,code=sm_60
 -gencode arch=compute_70,code=\"compute_70,sm_70\"

嵌入与计算能力5.0和6.0兼容的二进制代码以及与计算能力7.0兼容的PTX和二进制代码。

global function and others

`https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-enabled-gpus`		–
`__global__ void fun(){}`	在设备上执行，可从主机（计算能力> 5.0设备）调用	必须无返回值
`__device__`	在设备上执行，只能从设备调用
`__host__`	在主机上执行，只能从主机调用。	可省略

块和线程

2块，3线程(执行2*3次 global)	add<<<2,3>>>(a,b,c)	和mpi类似，使用块号或者线程号进行细节操作
girdDim.x	块的个数
blockDim.x	一个块的线程数
blockIdx.x	块的索引
threadIdx.x	一个块中线程的索引
在所有线程中的index	`blockIdx.x*blockDim.x+threadIdx.x`	`blockIdx.x*3+threadIdx.x`

dim3

	`dim3 dimBlock(x,y);Kernel<<<2,dimBlock>>>(argv)`	`blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;`
	`dim3 dimGrid(x,y,z)；Kernel<<<dimGrid,2>>>(argv)；`

全为3D的索引（第几块的第几个线程）：

int blockId = blockIdx.x+ blockIdx.y * gridDim.x+ blockIdx.z * gridDim.x * gridDim.y; （第几块（对应最初始公式中的blockIdx.x）：blockIdx.*为系数， gridDim.*为次数）
int Idx = blockId * (blockDim.x * blockDim.y * blockDim.z) + (threadIdx.z * (blockDim.x * blockDim.y)) + (threadIdx.y * blockDim.x)+ threadIdx.x; （后边部分对应于最初始公式的threadIdx.x）
矩阵加法 https://b23.tv/AWa1Pml

Grid-Stride循环

CUDA编程-Grid-Stride循环实现的灵活Kernel函数
之前总是假设单个grid里的线程可以一次性处理整个数组，但实际上难以实现。
循环的步长是blockDim.x * gridDim.x，这是Grid中线程的数量，

__global__
void saxpy(int n, float a, float *x, float *y)
{
    
    
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) 
        y[i] = a * x[i] + y[i];
}

__global__
void saxpy(int n, float a, float *x, float *y)
{
    
    
    for (int i = blockIdx.x * blockDim.x + threadIdx.x; 
         i < n; 
         i += blockDim.x * gridDim.x) 
      {
    
    
          y[i] = a * x[i] + y[i];
      }
}

错误处理

How to do error checking in CUDA(如何在CUDA里做错误检查)
cudaError err = cudaGetLastError();可以处理没有返回值的global报错

内存

统一内存

cudaMallocManaged(); 统一内存，cpu gpu都能访问,使用时会自动切页，但是程序降速
int* x;cudaMallocManaged(&x, sizeof(int) * 2);

设备内存

c	cuda c	备注
malloc	cudaMalloc	申请)显存
memcpy	cudaMemcpy	同步执行函数，且具有方向参数
free	cudaFree	释放现存
	cudaMallocHost

cudaMemPrefetchAsync(prefeatch预取)

UM 分配存在切页过程，cudaMallocManaged 一般搭配使用cudaMemPrefetchAsync
后边的参数设置是放到cpu还是gpu
写在传入global函数之前int deviceId; cudaGetDevice(&deviceId); cudaMemPrefetchAsync(pointerToSomeUMData, size, deviceId); // Prefetch to GPU device.
cudaMemPrefetchAsync(pointerToSomeUMData, size, cudaCpuDeviceId); // Prefetch to host.

shared memory

__shared__ float sh_arr[128]
https://www.jianshu.com/p/8d17817a7488

constant

__constant__ 只读，全局

SM（stream multiprocessor）: 流处理器

GPU:每个GPU有若干个SM，每个SM并行而独立运行
定义流：cudaStream_t s1;
创建流：cudaStreamCreate(&s1);
使用流：func_kernel<<< blocks,threads,0,s1 >>>
销毁流：cudaStreamDestory(s1);
CUDA随笔之Stream的使用

CUDA同步操作

原子操作函数

和多线程的原子含义相同，自动加去锁保持线程的独占操作

__global__ void increment_atomic(int *g)
{
    
    
	int i = blockIdx.x * blockDim.x + threadIdx.x; 
	i = i % ARRAY_SIZE;  
	atomicAdd(& g[i], 1);//原来为g[i] = g[i] + 1;  
}

`_syncthreads（）`

线程块内线程同步，保证线程块内所有线程都执行到统一位置

_threadfence()
一个线程调用__threadfence后，该线程在该语句前对全局存储器或共享存储器的访问已经全部完成，执行结果对grid中的所有线程可见。

_threadfence_block()
一个线程调用__threadfence_block后，该线程在该语句前对全局存储器或者共享存储器的访问已经全部完成，执行结果对block中的所有线程可见。

以上两个函数的重要作用是，及时通知其他线程，全局内存或者共享内存内的结果已经读入或写入完成了。

cudaStreamSynchronize()/cudaEventSynchronize()
主机端代码中使用cudaThreadSynchronize():实现CPU和GPU线程同步
kernel启动后控制权将异步返回，利用该函数可以确定所有设备端线程均已运行结束

跟踪分析器

NVIDIA Nsight GPU 跟踪分析器

https://developer.nvidia.com/nvidia-visual-profiler

归约算法规约∑

CUDA专家手册：GPU编程权威指南

CG

tips

因为GPU流处理器原因,线程个数为32倍数最好
以__device__ shared__为关键词声明的变量会被分配至SM上的shared memory，可以由block内的全部线程所共享，生命周期也随着block的结束而结束。（下图位置），为了编程方便，可以直接写为__shared
获取GPU温度：https://github.com/jordanbonilla/Read_GPU_Temperature_CUDA
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
获取显卡算力
Hillis Steele Scan

programming class

image processing

CUDA编程《二》用GPU写一个卷积 https://www.bilibili.com/video/BV1c4411z7o2/?
BGR转灰度图的程序，希望能让大家稍微明白一点CUDA程序如何写
Image Filtering using CUDA
https://github.com/teknoman117/cuda/tree/master/imgproc_example
https://github.com/LitLeo/OpenCUDA
The CMake version of cuda_by_example
https://github.com/paramhanji/CUDA-CNN
https://github.com/cheesinglee/cuda-PHDSLAM

teaching class

https://github.com/PacktPublishing/Learn-CUDA-Programming
https://github.com/csc-training/CUDA
玩游戏学cuda,可这个为啥是python代码呀[飙泪笑],写的就是python呀，这个项目用的是numba
车道线识别之——增强黄色车道线 https://blog.csdn.net/YaoJiawei329/article/details/111032256?
cuda编程资源

c++高性能：多进程 cuda编程

hello world

编译

global function and others

块和线程

dim3

Grid-Stride循环

错误处理

内存

统一内存

设备内存

cudaMemPrefetchAsync(prefeatch预取)

shared memory

constant

SM（stream multiprocessor）: 流处理器

CUDA同步操作

原子操作函数

`_syncthreads（）`

跟踪分析器

NVIDIA Nsight GPU 跟踪分析器

https://developer.nvidia.com/nvidia-visual-profiler

归约算法规约∑

CG

tips

programming class

image processing

teaching class

猜你喜欢

c++高性能：多进程 cuda编程

hello world

编译

global function and others

块和线程

dim3

Grid-Stride循环

错误处理

内存

统一内存

设备内存

cudaMemPrefetchAsync(prefeatch预取)

shared memory

constant

SM（stream multiprocessor）: 流处理器

CUDA同步操作

原子操作函数

_syncthreads（）

跟踪分析器

NVIDIA Nsight GPU 跟踪分析器

https://developer.nvidia.com/nvidia-visual-profiler

归约算法 规约∑

CG

tips

programming class

image processing

teaching class

猜你喜欢

`_syncthreads（）`

归约算法规约∑