NVIDIA CUDA highly parallel processor programming (eight): parallel mode: histogram calculation


Introducing Atomic Operations and Privatization
In the previous parallel modes, the tasks of computing elements are assigned to each thread, and the threads will not interfere with each other. But in practice, the output of many parallel computing modes will be affected by other threads. Parallel histogram computing is one example, and its output may be modified by all threads. Therefore, when threads update output elements, the relationship between threads must be coordinated. A baseline approach is presented below that uses atomic operations to serialize updates to each element, but the baseline approach is inefficient. There is also an efficient privatization method that can significantly increase execution speed.

background

A histogram shows the frequency of occurrences of data in consecutive numerical intervals. In the most common form of a histogram, intervals of values ​​are plotted along the horizontal axis, and the frequency of a data item in each interval is represented as the height of a rectangle (or bar) rising from the horizontal axis. For example, the English of large-scale parallel processors: "programming massively parallel processors" alphabet statistics are as follows:
insert image description here
many fields rely on histograms to summarize data sets for data analysis, such as computer vision, different objects have different performances on the histogram of. After the image is segmented and the histograms of these sub-regions are analyzed, the objects in the image can be roughly identified.

The histogram statistics of strings can be achieved using the following sequential code:

sequential_histogram(char *data, int length, int *histo){
    
    
	for(int i = 0;i < length;i++){
    
    
		int alphabet_position = data[i]-'a';
		if(alphabet_potition >= 0 && alphabet_position < 26){
    
    
			histo[alphabet_position/4]++;
		}
	}
}

Since the modern CPU L1 cache has at least a few KB, almost all access to the histo array can hit the cache; but access to data is limited by reading DRAM. So the performance limitation of this code mainly comes from the access to data.

use atomic operations

An intuitive way to count character text in parallel is to divide an array into segments, and one thread processes a segment. But this method will encounter conflicts, such as the following situation:
insert image description here
in the first round of the loop, threads 0, 1, and 2 will increase the mp counter by 1, and conflicts will occur at this time, because the elements in the histo array are updated It needs to go through 3 stages: read-modify-write. Maybe 3 threads read out the mp counter at the same time and write it back after updating, then the value in the counter is 1, not 3. This is called a disordered condition . The figure below shows this situation in detail:
insert image description here
the value of histo[x] is in the brackets. In the assumption of Figure A, the read-modify-write of thread 1 is carried out continuously, followed by thread 2, then the final calculation result is correct. In the assumption of Figure B, thread 2 reads the value of histo[x] before thread 1 writes, and thread 2 writes again after thread 1 writes, which is equivalent to thread 1 not writing, and the result is 1, which is a calculation error.
The following picture is similar:
insert image description here

 So we need a time constraint to prevent thread interleaving that occurs in Figure B above. Time constraints can be enforced through atomic operations. An atomic operation on a memory location means that when a read-modify-write operation occurs on a memory location, other read-modify-write operations are not allowed to overlap, that is, the read-modify-write operation is integrated, so this method is called an atomic operation . When a read-modify-write operation is completed, other operations can continue. In addition, the atomic operation does not specify the sequence of thread execution, and the situation in Figure A in the above two figures may occur.

Atomic addition can be performed using the following built-in functions.

int atomicAdd(int *address, int val);

atomicAdd receives two parameters, address is the address pointing to a 32-bit word, this word can be in global memory or in shared memory; val is the number to be added. After adding val to the word pointed by address, write the new value to the same address, and return the original old value at address.

code show as below:

#include<cuda.h>
#include<stdio.h>
#include<stdlib.h>
#define BLOCK_SIZE 16
#define HISTO_SIZE 7

__global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo){
    
    
	int i = threadIdx.x + blockIdx.x * blockDim.x;
	int section_size = (size - 1)/(blockDim.x * gridDim.x) + 1;
	int start = i*section_size;
	//every threads handle section_size consecutive elements, may except the final thread.
	for(int k = 0;k < section_size;k++){
    
    
		//判断要处理的元素是否超过buffer的长度
		if(start+k < size){
    
              
			int alphabet_position = buffer[start+k] - 'a';
			//判断是否是小写字母
			if(alphabet_position >= 0 && alphabet_position < 26)
				//使用atomicAdd进行原子加法,这里是最重要的
				atomicAdd(&(histo[alphabet_position/4]), 1);
		}
	}
}

void histogram(unsigned char *arr, long n, unsigned int * histo){
    
    
    unsigned char *d_arr;
    unsigned int *d_histo;
    int size = n * sizeof(char);
    cudaMalloc(&d_arr, size);
    cudaMemcpy(d_arr, arr, size, cudaMemcpyHostToDevice);
    cudaMalloc(&d_histo, HISTO_SIZE*sizeof(int));
    histo_kernel<<<ceil((float)n/(BLOCK_SIZE)), BLOCK_SIZE>>>(d_arr, n, d_histo);
    cudaMemcpy(histo, d_histo, HISTO_SIZE*sizeof(int), cudaMemcpyDeviceToHost);
    cudaFree(d_arr);
    cudaFree(d_histo);
}

int main(){
    
    
    unsigned char arr[] = {
    
    "i am happy today, because i wrote a csdn blog and get many likes"};
    long n = 64;
    unsigned int *histo = (unsigned int *)malloc(HISTO_SIZE*sizeof(int));
    histogram(arr, n, histo);
    for(int i = 0; i < HISTO_SIZE;i++){
    
    
        printf("%d\n", histo[i]);
    }
    free(arr);
    free(histo);
    return 0;
}

String: "i am happy today, because i wrote a csdn blog and get many likes" output result:
insert image description here

Block Partitioning vs. Interleave Partitioning

In the previous method, we divided the elements in buffer[] into contiguous blocks of elements and assigned each block to a thread. Such partitions are called block partitions . On CPUs, where parallel execution usually involves only a small number of threads, block partitioning is usually the best execution strategy because each thread's sequential access pattern makes good use of cache lines. Every time an element is loaded from the main memory, the next few elements will be loaded into the cache line together, and the last few accesses can directly hit the cache. Since each CPU cache typically only supports a small number of threads, there is little interference with cache usage by different threads.
However, when block partitioning is applied to GPU, the performance of the program is largely limited by the bandwidth of the global memory because the accesses to the global memory cannot be merged. Therefore, we need to modify the division method of the buffer array so that threads accessing consecutive locations in the same warp can realize memory consolidation.
insert image description here
The above figure is just an example, there is a block, and a block has 4 threads. In general, all threads process the first blockDim.x * gridDim.x elements of buffer in the first iteration, the next blockDim.x * gridDim.x elements in the second iteration, and so on.

The kernel function is as follows:

__global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo){
    
    
	unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
	/*第一次循环线程索引作为数组索引,往后每次循环数组索引加 blockDim.x * gridDim.x, 
	每次循环处理 blockDim.x * gridDim.x 个元素。*/
    for (unsigned int i = tid; i < size; i += blockDim.x * gridDim.x ) {
    
    
        int alphabet_position = buffer[i] - 'a';
        if(alphabet_position >= 0 && alphabet_position < 26)
            atomicAdd(&(histo[alphabet_position/4]), 1);
    }
}

The interleaved partitioned kernel is several times more efficient than the block partitioned kernel.

Latency and throughput of atomic operations

As shown in the figure below, when an atomic operation is performed on the same memory location, only one operation is in progress. The duration of each atomic operation is approximately the latency of a memory read (the left part of the atomic operation time) plus the latency of a memory write (the right part of the atomic operation time). The length of these time periods for each read-modify-write operation (typically hundreds of clock cycles) is the minimum amount of time for each atomic operation, thereby limiting the throughput or execution speed of the atomic operation.
insert image description here

Atomic operations in the cache

The previous section pointed out that long memory accesses can translate into low throughput when performing atomic operations on the same location . So an obvious way to improve the throughput of atomic operations is to reduce the latency of accesses to the same location. Caching is the primary tool for reducing memory access latency. Recent GPUs allow atomic operations to be performed in the last level cache, which is shared between all SMs. During an atomic operation, if the updated variable is found in the last level cache, it is updated in the cache. If it can't be found in the upper cache, it triggers a cache miss and is taken to the cache that updated it. Since variables updated by atomic operations tend to be heavily accessed by many threads, these variables remain in cache once brought in from DRAM. Since the access time to the last-level cache is tens of cycles rather than hundreds of cycles, throughput can be improved by at least an order of magnitude simply by allowing atomic operations to be performed in the last-level cache. However, the improved throughput is still not enough for many applications.

Leveraging Shared Storage - Privatization

 Since the access latency of shared memory is very low, reducing latency directly improves throughput. A technique called privatization is often used to solve the problem of output interference in parallel computing. The idea is to replicate highly contended output data structures into private copies so that each subset of threads can update its private copy. This allows access to private copies with lower latency. These private copies can greatly improve the throughput of updating data structures. The downside is that the private copy needs to be merged into the original data structure after the computation is done. The degree of competition and the costs of merger must be carefully weighed. Therefore, in massively parallel systems, privatization is usually for a subset of threads rather than individual threads.

code show as below:

#include<cuda.h>
#include<stdio.h>
#include<stdlib.h>
#define BLOCK_SIZE 16
#define NUM_BINS 7

__global__ void histogram_privatized_kernel(unsigned char *buffer, unsigned int *histo, unsigned int num_elements){
    
    

	unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
	extern __shared__ unsigned int histo_s[];
	/*initialize private array in shared memory, each iteration initialize blockDim.x elements
	in shared memory, until reaching the final element.*/
    for(unsigned int binIdx = threadIdx.x; binIdx < NUM_BINS; binIdx +=blockDim.x) {
    
    
        histo_s[binIdx] = 0u;
    }
    __syncthreads();
	//histogram. similar to interleaved method, but store result in shared memory. 
	for (unsigned int i = tid; i < num_elements; i += blockDim.x*gridDim.x) {
    
    
        int alphabet_position = buffer[i] - 'a';
        if (alphabet_position >= 0 && alphabet_position < 26) atomicAdd(&(histo_s[alphabet_position/4]), 1);
    }
    __syncthreads();
    //congragate results in shared memory to global memory.
    for(unsigned int i = threadIdx.x; i < NUM_BINS;i += blockDim.x){
    
    
        atomicAdd(&(histo[i]), histo_s[i]);
    }
}

void histogram(unsigned char *arr, long n, unsigned int * histo){
    
    
    unsigned char *d_arr;
    unsigned int *d_histo;
    int size = n * sizeof(char);
    
    cudaMalloc(&d_arr, size);
    cudaMemcpy(d_arr, arr, size, cudaMemcpyHostToDevice);
    cudaMalloc(&d_histo, NUM_BINS*sizeof(int));
    histogram_privatized_kernel<<<ceil((float)n/(BLOCK_SIZE)), BLOCK_SIZE, sizeof(int)*NUM_BINS>>>(d_arr, d_histo, n);
    cudaMemcpy(histo, d_histo, NUM_BINS*sizeof(int), cudaMemcpyDeviceToHost);
    cudaFree(d_arr);
    cudaFree(d_histo);
}

int main(){
    
    
    unsigned char arr[] = {
    
    "i am happy today, because i wrote a csdn blog and get many likes"};
    long n = 64;
    unsigned int *histo = (unsigned int *)malloc(NUM_BINS*sizeof(int));
    histogram(arr, n, histo);
    for(int i = 0; i < NUM_BINS;i++){
    
    
        printf("%d\n", histo[i]);
    }
    free(arr);
    free(histo);
    return 0;
}

The kernel of the privatization method is mainly divided into three parts:

  1. Allocate a private dynamic shared array within each thread block.
  2. Use the methods in the interleaved partitions for statistics, and store the statistical results in a shared array within each block.
  3. Collects the results from the shared arrays within all blocks into a global array.

For the principle and usage of dynamic arrays in shared memory, refer to this blog

aggregate atomic operations

In image processing, images such as the sky have a large number of identical elements, which will lead to congestion of atomic operations, resulting in reduced throughput of parallel histogram calculations. The way to solve this problem is to reduce multiple shared memory accesses to one access.
The kernel is as follows

__global__ void histogram_privatized_kernel(unsigned char *buffer, unsigned int *histo, unsigned int num_elements){
    
    

	unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
	//initialize private array in shared memory
	extern __shared__ unsigned int histo_s[];
    for(unsigned int binIdx = threadIdx.x; binIdx < NUM_BINS; binIdx +=blockDim.x) {
    
    
        histo_s[binIdx] = 0u;
    }
    __syncthreads();
    
	//histogram
    int prev_index = -1;
    unsigned int accumulator = 0;
    unsigned int curr_index;
    for(unsigned int i = tid; i < num_elements; i += blockDim.x*gridDim.x) {
    
    
        int alphabet_position = buffer[i] - 'a';
        if (alphabet_position >= 0 && alphabet_position < 26) {
    
     
            curr_index = alphabet_position/4;
            if(prev_index != curr_index){
    
    
                if(accumulator > 0)
                    atomicAdd(&(histo_s[prev_index]), accumulator);
                prev_index = curr_index;
                accumulator = 1;
                
            }else{
    
    
                accumulator++;
            }
        }
    }

    if(accumulator > 0)
        atomicAdd(&(histo_s[prev_index]), accumulator);

    __syncthreads();
    //congragate to global memory
    for(unsigned int i = threadIdx.x; i < NUM_BINS;i += blockDim.x){
    
    
        atomicAdd(&(histo[i]), histo_s[i]);
    }
    
}

The above code maintains 3 more register variables curr_index, prev_index, accumulator. When the element received by a certain thread is the same as the previous element, that is, curr_index is equal to prev_index, the accumulator is incremented; otherwise, the accumulator is added to the prev_index index of the histo array, curr_index is assigned to prev_index, and accumulator is assigned 1.

In this way, the number of atomic addition operations can be reduced when a large number of elements are the same, thereby improving throughput.

Guess you like

Origin blog.csdn.net/weixin_45773137/article/details/125584905