Function declaration:
- __global__ void KernelFunc() Executed:device Callable:host
- __device__ float DeviceFunc() ......:device ......:device
- __host__ float HostFunc() ......:host ......:host
__global__:
The return value must be void
__device__:
It used to be inline by default, now there are some changes.
Global and Device functions:
-
- Use recursion as little as possible
- don't use static variables
- use malloc sparingly
- Be careful with function calls implemented through pointers
- Vector data type:
- char[1-4],uchar[1-4]
- short[1-4],ushort[1-4]
- int[1-4],uint[1-4]
- long[1-4],ulong[1-4]
- longlong[1-4], ulonglong[1-4]
- floa [1-4]
- double1,double2
- vector data type
- Applicable to both host and device code, constructed by the function make_<type name>
int2 i2 = make_int2(1, 2);
float4 f4 = make_float4(1.0f, 2.0f, 3.0f, 4.0f);
-
- Access via .x, .y, .z, and .w
int2 i2 = make_int2(1, 2);
int x = i2.x;
int y = i2.y;
- math function
- Partial function list
- sqrt,rsqrt
- exp,log
- sin, cos, tan, sincos
- asin,acos,atan2
- trunc,ceil,floor
- Intrinsic function built-in function
- Only for Device
- faster but less accurate
- Prefixed with __, for example:
- __exp,__log,__sin,__pow,......
- Partial function list
- thread synchronization
- Fast threads can be synchronized
- Call __syncthreads to create a barrier
- Each thread waits at the call site for all threads in the block to execute to this place, and then all threads continue to execute subsequent instructions
Mds[i] = Md[j]
__syncthreads()
func(Mds[i], Mds[i+1])
- thread scheduling
- Wrap a set of threads within a block
- G80/GT200 - 32 threads
- run on the same SM
- The basic unit of thread scheduling
- threadIdx value is continuous
- An implementation detail - theoretically
- WrapSize
- Wrap a set of threads within a block
- memory model
- Register Registers
- Dedicated to each thread
- Fast, on-chip, readable and writable
- What will be the result of increasing Kernal's register usage?
- Register Register
- per SM
- up to 768threads
- 8K registers
- per SM
- Local Memory
- stored in global memory
- scope per thread
- Used to store arrays of automatic variables
- Access via constant index
- stored in global memory
- Shared Memory
- each block
- Fast, on-chip, readable and writable
- Full speed random access
- Each SM includes 8 blocks, 16KB shared memory
- Global Memory
- Long delay (100 cycles)
- off-chip, readable and writable
- Random access affects performance
- Host host class read and write
- GT200
- Bandwidth: 150GB/s
- Capacity: 4GB
- G80 - 86.4GB/s
- Constant Memory
- Short latency, high bandwidth, read-only when all threads access the same location
- Stored in gloabl memory but cached
- Host host can read and write
- Capacity: 64KB
- Register Registers
- Globl and constant variables
- Host can be accessed through the following functions:
- cudaGetSymbolAddress()
- cudaGetSymbolSize()
- cudaMemcpyToSymbol()
- cudaMemcpyFromSymbol()
- constants variables must be declared outside the function
- Host can be accessed through the following functions: