Rule of Programming: C

Before doing large memory writes: _mm*_stream_s[i,s,d]*: non-temporal memory store: When data is produced and not (immediately) consumed again, the fact that memory store operations read a full cache line first and then modify the cached data is detrimental to performance. This operation pushes data out of the caches which might be needed again in favor of data which will not be used soon. This is especially true for large data structures, like matrices, which are filled and then used later. Before the last element of the matrix is filled the sheer size evicts the first elements, making caching of the writes ineffective.

NTA: non-temporal accesses is the streaming option for reading memory circumventing caches: _mm*_stream_load_si (SSE). This loads the target bytes into several buffers, each buffer is sized as cacheline. And subsequent access to the buffers speeds up. These buffers not affecting caches, thus load buffers will not lead to any cache to be evicted. But loading buffers will possibly evict another buffer loaded by load_si.

Note: the non-temporal memory read/write will not accelerate the speed we read/write, it prevents useful caches evicting from caches. So the NT read/write alone will not speed up the memory read/write. Consider normally, write to memory(not modifying), will read the memory into cache, then we write the cache. The dirty cacheline will be evicted out when cache is not enough and written back to memory. Thus the time cpu spends are loading memory into cache and write to cache. While using NT memory writings, cpu will directly locate the memory address and send the data to RAM. The time consumes in locating memory address is as much time consuming as cache loading, however, writing memory is a lot more time consuming. Also note that write to memory/cache has a write-combining feature. (write to subsequent memory locations of same cacheline can be combined into one write instruction). Also note when using NT memory writings, FENCES need to be manually initiated.

Why much slower when using NT writing? Here is another POV: that NT writes to memory is based on cacheline, thus write to a location inside a cacheline without using the write-combining logic to fill a full cacheline will cause cpu to load the original cacheline to cache again and modify it with data, making the whole step much slower. (Now I think this POV is true.)

Thus when doing streaming load, you load memory on a buffer sized basis (cacheline), like dispatching a sequence of stream_load in a line.

NT memory accessing are very nicely optimized for sequential data access, and caches can help to cover up some-but-not-all of the ramdom accesses to memory.

猜你喜欢