I have to say that every developer should understand database consistency

Imagine assigning a value to a variable and then reading it immediately, only to find that the write just now does not work at all. Is it crazy? 

x = 42assert(x == 42)  # 抛出异常

This situation may be encountered when using distributed data storage with weak consistency guarantees. You may ask: "Wait, shouldn't the database solve the consistency problem for me?" After the update operation, the actual data will be updated immediately or it will take a while, depending on whether the database provides this guarantee.

The consistency guarantees provided by some databases are a bit counterintuitive, but their purpose is to provide high availability and high performance. There are also some databases that allow you to choose whether you want better performance or stronger guarantees, such as Azure's Cosmos DB and Cassandra. Therefore, you need to understand the pros and cons.

 

Anatomy of a database request

Let's take a look at what happens next when you send the request to the database. In an ideal situation, your request will be executed immediately:

However, we do not live in an ideal world. Your request needs to be sent to the data store, then processed, and finally the response is sent to you. All these operations require a certain amount of time and cannot be completed in an instant:

The best guarantee that the database can provide is that the request is executed at some point between the call and the completion. You might think that this is not a big deal. After all, you are used to this when writing a single-threaded application. For example, if you assign 1 to x, and then read the value of x, you will definitely get 1, provided that there are no other threads. Write the same variable. However, when you use data storage to replicate the data state to multiple computers in order to achieve high availability and scalability, then everything becomes unknown. In order to understand why this happens, let's explore the pros and cons that system designers must weigh when implementing reads in the simplified model of distributed databases.

Suppose we have a distributed key-value store, which consists of a set of replicas. A leader is selected among the replicas, and this is the only node that can accept writes. After the leader receives the write request, it will write data to other replicas asynchronously. Although all replicas receive the same updates in the same order, they are received at different points in time.

Your task is to come up with a strategy to handle read requests, what should you do? You can read data from the leader or other replicas. If all reads go through the leader, then throughput will become a bottleneck and cannot exceed the amount of data that a single node can handle. If any replica can serve the request to read, then the throughput can be greatly improved, but in this case, the system state obtained by the two clients (observers) may be inconsistent, because the leader and the replica and between There may be delays between copies.

Simply put, we need to weigh the pros and cons between the consistency of the system as seen by the observer and the performance and high availability of the system. In order to understand this relationship, we need to accurately define consistency. We can refer to the consistency model (https://jepsen.io/consistency), which defines the system status view that the system status observer may experience.

 

Strong consistency

If the client's write and read operations can only be sent to the leader, it seems that each request is made atomically at a specific point in time, as if there is only one copy of the data. No matter how many replicas there are, and no matter how delayed each replica is, as long as the client directly queries the leader, from their point of view, there is only one data copy. 

Since the request will not be serviced immediately, and only one node provides the service, the request must be executed during the call and completion period. Another way of thinking is that after the request is completed, all observers can see its side effects:

Since other participants can see the request between the invocation and completion of the request, real-time performance must be guaranteed. This guarantee has a theoretical consistency model called linear consistency, also known as strong consistency. Linear consistency is the strongest consistency the system can provide for a single object request.

What if the client sends a read request to the leader, but when the request arrives, the leader has been abolished, but the server that received the request thinks it is still the leader, what should I do? If the request is processed by the previous leader, the strong consistency of the system cannot be guaranteed. To prevent this from happening, the hypothetical leader first needs to contact most of the replicas to confirm whether he is still the leader. Only when it is still the leader can it execute the request and send the response back to the client. This process greatly increases the time required for reading.

 

Sequential consistency 

So far, we have discussed the practice of processing reads in order by the leader. But this approach creates a bottleneck, which limits the throughput of the system. Most importantly, the leader also needs to contact most replicas in order to process reads. In order to improve read performance, we should allow replicas to process requests.

Although the replica will lag behind the leader, it receives updates in the same order as the leader. If client A only queries copy 1, and client B only queries copy 2, then the two clients see different states at different points in time because the copies are not fully synchronized:

In this consistency model, for all observers, operations occur in the same order, but when the side effects of operations will be seen by the observers, this model cannot provide any real-time guarantee. This model is called sequential consistency. The difference between sequential consistency and linear consistency is that the former lacks real-time guarantees.

A simple application of this model is a producer/consumer system synchronized with the queue: the producer node is responsible for writing to the queue, and the consumer is responsible for reading. Producers and consumers see the same order of items in the queue, but consumers lag behind producers.

 

Final consistency

Although we managed to improve the read throughput, we had to pin the client to a copy. What should we do if the copy fails? In order to improve storage availability, we can allow clients to query any copy. However, in terms of consistency, this step requires a high price. Suppose there are two copies 1 and 2, where copy 2 is behind copy 1. If the client queries copy 1 immediately after querying copy 2, then it will see the past state, which can be very confusing. The only guarantee the client has is that if the writing of the system stops, all copies will eventually converge to the final state. This consistency model is called eventual consistency.

It is very difficult to build applications on top of eventually consistent data storage because its behavior is different from what you are used to writing single-threaded applications. Any small error may gradually spread, and it is difficult to debug and reproduce. However, not all applications require linear consistency, so eventual consistency also has some use. You need to make wise choices and carefully consider whether the guarantees provided by your data storage can meet the needs of your application. If you want to record the number of website visits, then eventual consistency will be your first choice, because it doesn't matter whether the number returned by reading is somewhat out of date. But for payment systems, strong consistency is absolutely indispensable.

 

PACELC theorem

In addition to the models introduced in this article, there are many models related to consistency. But the basic idea behind it is inseparable: the stronger the consistency guarantee, the longer the waiting time for a single operation, and the lower the availability of storage in the event of a failure. This relationship is also known as the PACELC theorem: When performing network partitioning (P) in a distributed computer system, we must perform between availability (Availability, A) and consistency (Consistency, C) Choose, otherwise (Else, or E) even if the system does not have any partitions, we must choose between latency (Latency, or L) and consistency (Consistency, or C).

If you think this article is helpful to you, you can like it and follow it to support it, or you can follow my public account, there are more technical dry goods articles and related information sharing, everyone can learn and progress together!

 

Guess you like

Origin blog.csdn.net/weixin_50205273/article/details/108597609