What is the cap that is often asked

Before clarifying this problem, let's first understand what the distributed CAP theorem is.

According to the definition of Baidu Encyclopedia, the CAP theorem is also known as the CAP principle, which refers to the three characteristics of Consistency, Availability, and Partition tolerance in a distributed system. Two of the three cannot have both.

1. Definition of CAP

Consistency:

"All nodes see the same data at the same time", that is, after the update operation is successful and returns to the client, the data of all nodes at the same time is completely consistent, which is distributed consistency. The problem of consistency is inevitable in concurrent systems. For the client, consistency refers to the problem of how to obtain updated data during concurrent access. From the perspective of the server, it is how the update is replicated and distributed to the entire system to ensure that the data is eventually consistent.

Availability:

Availability refers to "Reads and writes always succeed", that is, the service is always available, and it is the normal response time. Good usability mainly means that the system can serve users well, and there is no bad user experience such as user operation failure or access timeout.

Partition Tolerance:

That is, when a distributed system encounters a node or network partition failure, it can still provide external services that satisfy consistency or availability.

Partition fault tolerance requirements can make the application appear to be in a functioning whole although it is a distributed system. For example, in the current distributed system, one or several machines are down, and the other remaining machines can still operate normally to meet the system requirements, which has no experience impact on users.

Second, the proof of CAP theorem

Now we will prove, why can't all three characteristics be satisfied at the same time?

Assuming that there are two servers, one holds application A and database V, and the other holds application B and database V. The networks between them can communicate with each other, which is equivalent to two parts of a distributed system.

When the consistency is met, the data of the two servers N1 and N2 are the same at the beginning, DB0=DB0. When the availability is met, the user will get an immediate response regardless of whether it is requesting N1 or N2. In the case of satisfying the partition fault tolerance, when either of N1 and N2 is down, or the network is unavailable, it will not affect the normal operation of N1 and N2 with each other.

When the user requests data to be updated to the server DB0 through the A application in N1, then the server DB0 in N1 becomes DB1. Through the data synchronization update operation of the distributed system, the database V0 in the N2 server is also updated to DB1. At this time , The data obtained by the user through B initiating a request to the database is the immediately updated data DB1.

The above is the normal operation situation, but in the distributed system, the biggest problem is the network transmission problem. Now suppose an extreme situation, the network between N1 and N2 is disconnected, but we still have to support this kind of network abnormality. Is to meet the partition fault tolerance, so can this meet the consistency and availability at the same time?

Suppose that the network suddenly fails during the communication between N1 and N2, and a user sends a data update request to N1, then the data DB0 in N1 will be updated to DB1, because the network is disconnected, the database in N2 is still DB0;

If at this time, a user sends a data read request to N2, because the data has not been synchronized, the application cannot immediately return the latest data DB1 to the user, what should I do? There are two options. First, sacrifice data consistency and respond to the old data DB0 to the user; second, sacrifice availability and block waiting until the network connection is restored and the data update operation is completed before responding to the user with the latest data DB1.

The above process is relatively simple, but it also shows that a distributed system that wants to meet the partition fault tolerance can only choose one of consistency and availability. In other words, a distributed system cannot satisfy all three characteristics at the same time. This requires us to make trade-offs when building the system. So, how to choose a better strategy?

Third, the choice strategy

The three characteristics of CAP can only satisfy two of them, so there are three strategies to choose between:

CA without P: If P is not required (partitioning is not allowed), then C (strong consistency) and A (availability) can be guaranteed. But giving up P also means giving up the scalability of the system, that is, the distributed nodes are limited, and there is no way to deploy child nodes. This is contrary to the original intention of the distributed system design.

CP without A: If A (available) is not required, it is equivalent to maintaining strong consistency between servers for each request, and P (partition) will cause the synchronization time to be extended indefinitely (that is, waiting for data synchronization to be completed before normal access to the service) In the event of a network failure or message loss, the user's experience must be sacrificed and the user will be allowed to access the system after all the data is consistent. There are actually many systems designed as CP, and the most typical ones are distributed databases, such as Redis and HBase. For these distributed databases, data consistency is the most basic requirement, because if this standard is not met, then it is better to directly use relational databases, and there is no need to waste resources to deploy distributed databases.

AP wihtout C: To be highly available and allow partitioning, you need to give up consistency. Once partitioning occurs, nodes may lose contact. For high availability, each node can only use local data to provide services, and this will lead to inconsistencies in global data. A typical application is like a mobile phone snapping up scene in a certain meter. The page may indicate that there is inventory when you browse the product in the first few seconds. When you have selected the product and are ready to place an order, the system prompts you that the order has failed and the product has been sold out. . This is actually to first ensure the normal service of the system in terms of A (availability), and then make some sacrifices in terms of data consistency. Although it will affect some user experience to some extent, it will not cause serious blockage in the user's shopping process.

Three, summary

Nowadays, for most large-scale Internet application scenarios, there are many hosts and scattered deployments, and the scale of clusters is getting larger and larger, and there will only be more and more nodes. Therefore, node failures and network failures are normal, so the partition fault tolerance is also It has become a problem that a distributed system must face. Then you can only choose between C and A. But for traditional projects, it may be different. Take the bank's transfer system. When it comes to money, you cannot make a slight concession to data consistency. C must ensure that if there is a network failure, it would rather stop the service. Make a trade-off between and P.

All in all, there is no best strategy. A good system should be architected based on business scenarios, and only suitable ones are the best.