[OpenCV] Memory access optimization

Random access vs. sequential access

sequential access

void BM_ordered(benchmark::State &bm) {
    
    
	for(auto _: bm) {
    
    
#pragma omp parallel for
		for(size_t i = 0; i<n; i++) {
    
    
			benchmark::DoNotOptimize(a[a[i]]);
		}
		benchmark::DoNotOptimize(a[a[i]]);
	}
}

BENCHMARK(BM_ordered);

random access

void BM_random(benchmark::State &bm) {
    
    
	for(auto _ : bm) {
    
    
#pragma omp parallel for
		for( size_t i = 0; i < n; i++) {
    
    
			size_t r = randomize(i) % n;
			benchmark::DoNotOptimize(a[a[i]]);
		}
		benchmark::DoNotOptimize(a[a[i]]);
	}
}

result:
Insert image description here

  • Random access is much less efficient than sequential access
  • Random access will only access one of the floats, which causes the nearby 64 bytes to be read into the cache, but only 4 bytes are used, and then the remaining 60 bytes are not used, resulting in waste. 94% bandwidth
  • Although continuous and sequential access is ideal, when using data structures such as hash tables, it is inevitable to obtain random addresses through the hash function for access, and the Value type may be less than 64 bytes, which wastes bandwidth.

Solve random access according to the largest block (4096 bytes)

  • The solution is to adjust the block size to be larger, such as 4KB, that is, 64 cache lines instead of one
  • Such a random access will be followed by 64 sequential accesses, which can be detected by the CPU, thereby initiating cache line prefetching, avoiding the waste of idling waiting for data to arrive.
void BM_random_64B(benchmark::State &bm) {
    
    
	for(auto _ : bm) {
    
    
#pragma omp parallel for
		for( size_t i = 0; i < n/16; i++) {
    
    
			size_t r = randomize(i) % (n/16);
			for (size_t j = 0; j < 16; j++) {
    
    
				benchmark::DoNotOptimize(a[a[i]]);
			}
		}
		benchmark::DoNotOptimize(a[a[i]]);
	}
}
void BM_random_4KB(benchmark::State &bm) {
    
    
	for(auto _ : bm) {
    
    
#pragma omp parallel for
		for( size_t i = 0; i < n/1024; i++) {
    
    
			size_t r = randomize(i) % (n/1024);
			for (size_t j = 0; j < 1024; j++) {
    
    
				xxx
			}
		}`
	}
}

Insert image description here

The importance of page alignment

  • Why 4KB? Because the operating system uses paging to manage memory, the program's memory is posted page by page in the address space. Some places may be inaccessible or have not been allocated. Then set this page to an unavailable state, and access it will An error occurred and entered kernel mode.
  • Therefore, the hardware is on the safe side and prefetching cannot cross page boundaries, otherwise unnecessary page faults may be initiated. So we choose the page size, because we can't prefetch sequentially across pages, so it doesn't matter if we cut it off.
  • We can use mm_alloc to apply for a section of memory whose starting address is aligned to the page boundary, so that there will be no cross-page phenomenon within each block.

Why is writing slower than reading?

Insert image description here

  • Writing seems to take 2x as long as reading
  • Writing and reading at the same time takes the same time as writing alone.
  • It seems that writing to an array will also read the array at the same time, causing twice the bandwidth?

The granularity of writes is too small causing unnecessary reads

  • The smallest unit of cache and memory communication is the cache line: 64 bytes.
  • When the CPU tries to write 4 bytes, because the remaining 60 bytes have not changed, the cache does not know whether the CPU will use those 60 bytes next, so it has to read the complete 64 bytes from memory and modify The 4 bytes are the data given by the CPU, and then the opportunity is selected for association.
  • This results in that although the data is not read, the cache is actually read from the memory, thus wasting 2 times the bandwidth.

Guess you like

Origin blog.csdn.net/qq_30340349/article/details/131316542