CAP Theory in Distributed Systems

Article directory

        I. Introduction

What is CAP theory?

        Two, nature

2.1 Consistency (data consistency)

2.2 Availability 

2.3 Partition tolerance (partition tolerance)

        3. How to choose CAP

        4. Common misconceptions about CAP

        5. Insufficiency of CAP


I. Introduction

What is CAP theory?

The CAP theory means that in a distributed system, at most two of the three items of Consistency, Availability, and Partition tolerance can be satisfied at the same time. CAP theory is the most basic and important theory in distributed systems.


Two, nature

2.1 Consistency (data consistency)

The data changes together to keep the data uniform. Ensure that the data content stored in the distributed nodes is consistent.

When does the data change?

Data will change if and only if the service containing the data receives an update request. The data update request only includes three requests of adding, deleting, and modifying data, and these three requests are collectively referred to as a write request. Therefore, the data will only change when a write request is made.

How can the data be said to change together? Whether the data changes are consistent or not needs to be checked by the read request, so what is the basis for the judgment of the read request?

Suppose there are two nodes in a distributed system, and each node contains some data that needs to be changed. If after a write request, both nodes have data changes. Then, the read request reads all the changed data, and we call this data modification a consistent data change.

However, this is not yet complete consistency. Because the system cannot run normally forever. What if there is a problem within the system that causes the nodes of the system to fail to undergo consistent changes? When we do this, it means that read requests who want to see the latest data are likely to see old data, or get a different version of the data. At this time, in order to ensure the external data consistency of the distributed system, we choose not to return any data. 

have to be aware of is

        The CAP theory is talking about the choice in a certain state, which is different from the actual engineering theory. The CAP theorem mainly describes the state. CAP itself is based on the state, based on the transient state, is a descriptive theory, it does not solve engineering problems. Since CAP books academic theory, not engineering theory, it will discard a lot.

2.2 Availability 

Availability in CAP is a requirement for results. It requires the nodes in the system to be able to process and respond to any received requests, whether they are write requests or read requests. It's just that it must have two conditions that must be met:

Condition 1: The returned result must be within a reasonable time range. This reasonable time is determined according to the business. The business says that it must be returned within 100ms. The reasonable time is 100ms, and it needs to be returned within 1s, which is 1s. If the business is set at 100ms, and the result is returned in 1s, then the system does not meet the availability.

Condition 2: It is required that all nodes in the system that can normally accept requests return results. This has two meanings:

  • If the node cannot receive requests normally, such as downtime, the system crashes, but other nodes can still receive requests normally, then the system is still available, that is, partial downtime does not affect the availability index.
  • If the node can receive the request normally, but finds that there is a problem with the internal data of the node, it must also return the result, even if there is a problem with the returned result. For example, the system has two nodes, one of which has data three days ago and the other two minutes ago, if a read request goes to the node containing data three days ago, the node must return the data three days ago data, even if it may not be reasonable.

        Availability can only guarantee that users can read something when they read it, but it cannot guarantee that what users read is the latest. If a node disconnects from other nodes, it does not know whether its local data is up-to-date, but it also directly returns the data if there is a user request.

2.3 Partition tolerance (partition tolerance)

There are many nodes in a distributed storage system, and these nodes communicate through the network. However, the network is unreliable. When there is a problem with the communication between nodes, it is said that the current distributed storage system has a partition. But it is worth mentioning that the partition is not necessarily caused by a network failure, it may also be caused by a machine failure.

For example, there are two nodes A and B in a distributed storage system. When some underlying network devices such as routers and switches fail between A and B, there is a communication problem between A and B, which is also called a partition between A and B. When A goes down, there is also a problem with the communication between A and B. This situation is also called a partition between A and B.

To sum up, as long as in a distributed system, if there is a problem with the communication between nodes, partition will occur. In other words, if a partition problem occurs, the entire distributed storage system needs to continue to run. The entire distributed node cannot stop running just because of a partition problem.


3. How to choose CAP

At this point we already know that when designing a distributed system, we can only choose two of the three properties C, A, and P.

However, in a distributed system, P is inevitable. If you do not choose P, once a partition error occurs, the entire distributed system will be completely unusable. So for distributed systems, we can only consider how to choose consistency and availability when partition errors occur.

According to the choice of consistency and availability, open source distributed systems are often divided into CP systems and AP systems.

When a partition failure occurs in a system, any client request will be stuck or timed out, but each node in the system will always return consistent data. Such a system is a CP system, such as Zookeeper.

When a partition failure occurs in a system, the client can still access the system, but some of the data obtained by the client is new data, and some is still old data. Such a system is an AP system, such as Eureka.

When an internal problem occurs in a distributed system, we have two choices:

  • Accommodating external services
  • Let external services accommodate us

Accommodating external services means that we cannot affect the business operation of external services because of our own problems, so availability should be given priority. To accommodate external services, we must give priority to consistency.


4. Common misconceptions about CAP

Misunderstanding 1: Distributed systems give up one of C or A because of the CAP theorem

Many people think that a distributed system only has availability or consistency, and there is no complete availability and consistency function.

Because the probability of such problems occurring in P is very low, the system should have perfect data consistency and availability before partition problems occur.

Misunderstanding 2: The choice between C and A is for the entire distributed system, and only the choice between C and A can be considered as a whole

When partitioning occurs, the choice of consistency and availability is actually local, not for the entire system.

It may be to make some choices in some subsystems, and it may even be necessary to make a choice of consistency and availability for a certain event or data.

The operation of a distributed system is a choice again and again. When different events occur at different stages and at different times, it is impossible to have exactly the same choice.

Misunderstanding 3: The three properties of CAP only have two options of yes and no, not a range

The three properties of the CAP theory are not of the Boolean type, nor are they consistent or inconsistent, available or unavailable, or partitioned or not partitioned. Rather, all three properties are range types.


5. Insufficiency of CAP

1. The CAP theorem itself does not consider the problem of network delay. It believes that consistency takes effect immediately. However, it takes time to maintain consistency, which leads to distributed systems often choosing AP systems.

2. Consistency and availability are not just a matter of choosing one or the other, but some important differences. When consistency is emphasized, it does not mean that availability is completely unavailable. When usability is emphasized, some technical means are often used to ensure that the data is ultimately consistent.

3. From an engineering point of view, CAP theory is just a description of a state. It tells people what state a distributed system may be in when something goes wrong. But the state is likely to change, how to switch between states, how to repair, how to restore does not provide direction.

Guess you like

Origin blog.csdn.net/HAOMINGS/article/details/127080688