From false sharing to cache consistency, this is actually closely related to us

Insert picture description hereThis work is licensed under the Creative Commons Attribution-Non-Commercial Use-Share 4.0 International License Agreement in the same way .
Insert picture description hereThis work ( Lizhao Long Bowen by Li Zhaolong creation) by Li Zhaolong confirmation, please indicate the copyright.

introduction

Caching is obviously an extremely important thing for our programmers, but many people actually ignore its existence most of the time. This is really a very regrettable thing. Today we will not talk about Redis, memcache, and Tair. Such a distributed cache does not talk about the logical cache in the operating system, such as the four-level cache in the VFS [1]. We are just talking about two problems in the cache in the CPU, the 缓存一致性sum false sharingproblem.

These two issues I think in most cases, people will not pay too much attention to them, because how much can such low-level things have to do with our programming? The answer is closely related , they may inadvertently affect the efficiency of our code, but we don't know, so it seems that learning them is necessary. But don't worry, this problem can be easily avoided with the help of code display.

false sharing

Insert picture description here

This is the distribution of caches on multiple cores in a CPU. We can see that although they are in the same CPU, they still have their own private caches. The picture shows the L1 and L2 level caches. This seems to be no problem, because the more caches, the higher the hit rate (temporal locality and spatial locality) under normal circumstances. Of course, this is normal. If we introduce multithreading, we all know one All threads in the process share the 多级页表sum mm_struct, that is, they can share almost all data, but the threads are also the unit of scheduling, which means that they can run on different core[2] or different CPUs, so the modification will naturally be written differently. The cache.

If you have just written at this time cache line, other coremodifications have been made [2], and it will be introduced at this time 缓存一致性. Of course, we have not talked about it yet. We know for the time being that its function is to ensure that the value returned by the read operation of a certain address must be The latest value of that address , which means that multiple cache lines of different cores need to synchronize data, which is obviously not a small overhead.

At this time, suppose a situation is that two threads are constantly modifying variables on the same cache line (including the same one), then what will happen? That is, two cache lines will be invalidated and reloaded continuously.

Let's take a look at false sharingthe definition of the wiki :

In computer science, false sharing is a performance-degrading usage pattern that can arise in systems with distributed, coherent caches at the size of the smallest resource block managed by the caching mechanism. When a system participant attempts to periodically access data that will never be altered by another party, but those data share a cache block with data that are altered, the caching protocol may force the first participant to reload the whole unit despite a lack of logical necessity. The caching system is unaware of activity within this block and forces the first participant to bear the caching system overhead required by true shared access of a resource.

In fact, it is what we described above.

Let's write a piece of code to verify, my machine configuration is as follows 4 x Intel® Core i5-7200U CPU @ 2.50GHz. The code is actually very simple. It is to apply for two global variables, so that they can be arranged tightly in the same cache line, and then the two threads are modified 1 billion times respectively. Just take a look at the time consumption, in order to ensure two Threads do not run on the same core, we can set the affinity of the CPU, the specific code is as follows:

bool SetCPUaffinity(int param){
    
    
    cpu_set_t mask;         // CPU核的集合

    CPU_ZERO(&mask);        // 置空
    CPU_SET(param,&mask);   // 设置亲和力值,把cpu加到集合中 https://man7.org/linux/man-pages/man3/CPU_SET.3.html
    // 第一个参数为零的时候默认为调用线程
    if (sched_setaffinity(0, sizeof(mask), &mask) == -1){
    
       // 设置线程CPU亲和力
        return false;
        // 看起来五种errno没有必要处理;
    } else {
    
    
        return true;   
    }
}

int num0;
int num1;

void thread0(int index){
    
    
    SetCPUaffinity(index);
    int count = 100000000;  // 1亿
    while(count--){
    
    
        num0++;
    }
    return;
}

void thread1(int index){
    
    
    SetCPUaffinity(index);
    int count = 100000000;
    while(count--){
    
    
        num1++;
    }
    return;
}

int main(){
    
    
    vector<std::thread> pools;
    pools.push_back(thread(thread0, 0));
    pools.push_back(thread(thread1, 1));
    for_each(pools.begin(), pools.end(), std::mem_fn(&std::thread::join));
    return 0;
}

The execution results are as follows:
Insert picture description here

We are making a little modification to the code, that is, to make the two threads serial, only need to modify the mainfunction part, the code is as follows:

int main(){
    
    
    vector<std::thread> pools;
    pools.emplace_back(thread(thread0, 0));
    pools[0].join();
    pools.emplace_back(thread(thread1, 1));
    pools[1].join();
	return 0;
}

Insert picture description here
Shocked, almost four times the difference.

Obviously, as long as we make those two variables not in the same cache line, there will be no such problem. We can make a modification to make the variables modified by the two threads 64-bit aligned. Of course, the 64 here is not fixed, and we need Modify according to your own machine configuration, you can execute to cat /proc/cpuinfoview one of them cache_alignment, or you can directly execute cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_sizeto view [4]:

int num0 __attribute__ ((aligned(64)));
int num1 __attribute__ ((aligned(64)));

Of course, it can also be written like this:

int num0;
char arr[60];	// 可设置一些标记位
int num1;

We can see that the consumption has dropped again.
Insert picture description here
Insert picture description here

Then we can try to allocate the two threads to the same CPU to see the effect. When setting the CPU affinity, set the same parameters. Main is probably as follows:

int num0;
int num1;

int main(){
    
    
    vector<std::thread> pools;
    pools.emplace_back(thread(thread0, 1));
    pools.emplace_back(thread(thread1, 1));
    for_each(pools.begin(), pools.end(), std::mem_fn(&std::thread::join));
    return 0;
}

Insert picture description here
You can see that the result is in line with expectations.

Cache coherency

false sharingThe reason for the problem is that it is necessary 缓存一致性to synchronize the consistency between the caches to ensure that each CPU can read the latest value. The first time I was interested in this question was when I was thinking about how to achieve atomic operations. At that time, I knew that the general CAS operation should be done based on CMPXCHG(Intel x86)instructions. The next step is the bus lock and cache coherency. The former is to issue a system bus. #lockSignal, other CPUs cannot use the system bus at this time, and the operation is guaranteed to be atomic in an exclusive manner. Of course, we can also see that this locking granularity is too large, because the use LOCK#is to lock the communication between the CPU and the memory, which makes other processors unable to manipulate the data of its memory address during the lock, and of course it is not necessarily atomic. The operation must use #LOCKsignals, namely;

Beginning with the P6 family processors, when the LOCK prefix is prefixed to an instruction and the memory area being accessed is cached internally in the processor, the LOCK# signal is generally not asserted. Instead, only the processor’s cache is locked. Here, the processor’s cache coherency mechanism ensures that the operation is carried out atomically with regards to memory. See “Effects of a Locked Operation on Internal Processor Caches” in Chapter 8 of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, the for more information on locking of caches.

For Intel486 and Pentium processors, even if the memory area being locked is cached in the processor, the LOCK# signal is always declared on the bus during the LOCK operation.
For P6 and newer processor series, if the memory area locked during the LOCK operation is cached as write-back memory in the processor performing the LOCK operation and is completely contained in the cache line, the processor may not declare on the bus The LOCK# signal. Instead, it will modify the memory location internally and allow its cache coherency mechanism to ensure that the operation is performed atomically. This operation is called "cache lock". The cache coherency mechanism will automatically prevent two or more processors that have cached the same memory area from modifying the data in that area at the same time.

For details, please refer to [6], [7].

Let's continue. After modifying the memory, there may be some modified data in the cache. In order to read the latest data every time we read, we need cache consistency to do this. The more classic here is undoubtedly the MESI[8] protocol. Intel processors use the MESIF[10] protocol evolved from MESI , while AMD uses the MOESI[9] protocol, basically referring to the MESIuse of a state machine to solve the problem.

I don’t want to talk about the specific implementation of one of the protocols, because there have been many articles stating these. What I want to talk about is based on the simple learning of these protocols [10][11][12]. That is, how to maintain cache consistency in a distributed system.

In fact, the following scenes popped up in my mind as soon as this question was raised:

  1. Chubby's client-side cache consistency
  2. Chubby's keeplive mechanism
  3. Cache consistency of client chunk information in GFS
  4. The lease mechanism of master and chunk in GFS
  5. Cache database and database consistency

In fact, 1, 3, and 5 in the above five scenarios must be cache coherence in our understanding, so why did I put 2, 4 in it?

Let's first take a look at 缓存一致性the definition of Wiki[13] :

In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, which is particularly the case with CPUs in a multiprocessing system.

In computer architecture, cache consistency is the consistency of shared resource data that is ultimately stored in multiple local caches. When the client in the system maintains a cache of general memory resources, data inconsistency may cause problems, especially for the CPU in a multi-processing system.

Obviously we can see that the definition of the cache is called shared resource data, which means that the scope is very wide, so can we regard the lease maintained between the two ends as a cache? To be more detailed, such as distributed locks, can we use the lock information maintained by both parties as a cache? In the " Security and Prospects ChubbyGo demonstration of " I analyzed in order to avoid deadlock distributed lock timeout mechanism to introduce the biggest problem, in fact, at this time it seems the biggest problem is to make the distributed lock lock server(锁服务器)and client(持锁者)view the same, in fact, seems to nature That is 缓存一致性, even I think the consistency we are discussing in distributed is not a special cache consistency ? Of course it is my personal opinion. If there are any mistakes, please Haihan.

After thinking about the above, I put 2, 4 in. The former is Chubby's solution to the timeout mechanism for distributed locks, and the latter is the lease masterfor chunkserver maintenance in GFS .

The fifth scenario solution can be referred to [14][18], although I think this is a distributed transaction (strong consistency) problem.

One and three are the most intuitive data consistency. A scenario is described in [16]2.7. The solution is very simple. It directly blocks write requests and sends cache invalidation information to all clients that have this modified data cache. After all the replies end the write operation , the cache consistency is obviously satisfied at this time, but the write efficiency is obviously also very low. As for the third, it is described in the second paragraph of [15] 3.2. Although it is very concise, it also reveals a very important point, which is that chunkafter the invalidation, it can be restored to no longer hold the lease , which can actually greatly reduce the overhead of writing operations. . The above two points solved this problem in two stages respectively.

As for two and four, it is leasea solution for the maintenance of both systems by the two systems . Although there is no detailed description in [15], the description in [16] 2.8 keepalive机制is very clear. The essence is to maintain the timeout period. After the client timeout, it is considered The resource (cache) becomes invalid and is valid when the lease is re-maintained; and because of the time difference between the two parties and the uncertainty of communication, the client will have a grace period, which can be understood in detail [16]. The method I used in ChubbyGo to fenceavoid this problem.

Obviously we can find a very interesting place. No solution in a distributed system adopts a cache coherence strategy like the operating system cache, which is similar to MESIa state machine transition based on message passing.

why? I think the reason is that network communication is different from the system bus. The delay of the former is not supreme (for example, a router of the only route is damaged), and bus communication is basically reliable . That is so in the bus on the way messaging is indeed very efficient, because basically there is no wasted time (in two, four maintenance timeout; several rounds of messaging in a distributed transaction 2PC, 3PC), each step is Required. If this is done in a distributed system, there will be a possibility of operation blocking for a long time, because a request may be delayed for a long time (one does not arrive, the rest will succeed, making the system's fault tolerance poor).

to sum up

It can also be seen from this that the same problem will have different solutions under different constraints, especially in a distributed system because of the unreliable communication delay, which makes many problems a lot more difficult.

Although I am only twenty years old, I really hope that in my lifetime I can see that modern communication technology can really ensure that the data is delivered within a certain time delay (shorter).

reference:

  1. " Let's talk about Linux IO again "
  2. " (Concept) Multiple CPUs and Multi-core CPUs and Hyper-Threading (Hyper-Threading) "
  3. https://en.wikipedia.org/wiki/False_sharing
  4. " The Impact of CPU Cache on Performance "
  5. https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-devices-system-cpu
  6. " Talk about the LOCK instruction of the CPU "
  7. " Intel LOCK Prefix Command "
  8. https://en.wikipedia.org/wiki/MESI_protocol
  9. https://en.wikipedia.org/wiki/MOESI_protocol
  10. https://www.realworldtech.com/common-system-interface/5/
  11. " "The Big Talk Processor" Cache Coherency Protocol MESI "
  12. " Talk about the Cache Coherence Protocol "
  13. Cache coherence wiki
  14. " Distributed Cache (Consistency) "
  15. Thesis "The Google File System"
  16. 论文《The Chubby lock service for loosely-coupled distributed systems》
  17. 论文《Leases: an efficient fault-tolerant mechanism for distributed file cache consistency 》
  18. " Analysis of Distributed Database and Cache Double Write Consistency Scheme "

Guess you like

Origin blog.csdn.net/weixin_43705457/article/details/112391734