CUDA programming

Function declaration:

  •   __global__ void KernelFunc()  Executed:device  Callable:host
  •   __device__ float DeviceFunc()  ......:device    ......:device
  •   __host__ float HostFunc()    ......:host     ......:host

  __global__: 

      The return value must be void

  __device__:

      It used to be inline by default, now there are some changes.

  Global and Device functions:

    1. Use recursion as little as possible
    2. don't use static variables
    3. use malloc sparingly
    4. Be careful with function calls implemented through pointers
  • Vector data type:
    •   char[1-4],uchar[1-4]
    •   short[1-4],ushort[1-4]
    •   int[1-4],uint[1-4]
    •   long[1-4],ulong[1-4]
    •   longlong[1-4], ulonglong[1-4]
    •   floa [1-4]
    •   double1,double2
  • vector data type
    •   Applicable to both host and device code, constructed by the function make_<type name>

int2 i2 = make_int2(1, 2);

float4 f4 = make_float4(1.0f, 2.0f, 3.0f, 4.0f);

    • Access via .x, .y, .z, and .w

int2 i2 = make_int2(1, 2);

int x = i2.x;

int y = i2.y;

  • math function
    •   Partial function list
      • sqrt,rsqrt
      • exp,log
      • sin, cos, tan, sincos
      • asin,acos,atan2
      • trunc,ceil,floor
    • Intrinsic function built-in function
      • Only for Device
      • faster but less accurate
      • Prefixed with __, for example:
        • __exp,__log,__sin,__pow,......
  • thread synchronization
    • Fast threads can be synchronized
    • Call __syncthreads to create a barrier
    • Each thread waits at the call site for all threads in the block to execute to this place, and then all threads continue to execute subsequent instructions

Mds[i] = Md[j]

__syncthreads()

func(Mds[i], Mds[i+1])

  • thread scheduling
    • Wrap a set of threads within a block  
      • G80/GT200 - 32 threads
      • run on the same SM
      • The basic unit of thread scheduling
      • threadIdx value is continuous
      • An implementation detail - theoretically
        • WrapSize
  • memory model
    • Register Registers
      • Dedicated to each thread
      • Fast, on-chip, readable and writable
      • What will be the result of increasing Kernal's register usage?
    • Register Register
      • per SM
        • up to 768threads
        • 8K registers
    • Local Memory
      • stored in global memory
        • scope per thread
      • Used to store arrays of automatic variables
        • Access via constant index
    • Shared Memory
      • each block
      • Fast, on-chip, readable and writable
      • Full speed random access
      • Each SM includes 8 blocks, 16KB shared memory
    • Global Memory
      • Long delay (100 cycles)
      • off-chip, readable and writable
      • Random access affects performance
      • Host host class read and write
      • GT200
        • Bandwidth: 150GB/s
        • Capacity: 4GB
      • G80 - 86.4GB/s
    • Constant Memory
      • Short latency, high bandwidth, read-only when all threads access the same location
      • Stored in gloabl memory but cached
      • Host host can read and write
      • Capacity: 64KB
  1. Globl and constant variables
    • Host can be accessed through the following functions:
      • cudaGetSymbolAddress()
      • cudaGetSymbolSize()
      • cudaMemcpyToSymbol()
      • cudaMemcpyFromSymbol()
    • constants variables must be declared outside the function

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324612941&siteId=291194637
Recommended