(Transfer) How to solve the problem that network partition brings to distributed database?

Original address  : http://os.51cto.com/art/201307/403298.htm

2013-07-17 11:12 Maizi Mai Mai Zi Mai 

In OpenStack, the database is the primary source of "state" for the primary system. The database provides OpenStack with a state component and transfers the "sharing" problem of the state to the database, so solving the problem of expansion of OpenStack is actually to solve the problem of expansion of the database itself. This article will analyze the problems brought by "network partition" to database expansion, and how to avoid and solve them in OpenStack components.

In OpenStack, the database is the primary source of "state" for the primary system. Most Core Projects use traditional relational databases as system data and state storage. In addition, Ceilometer uses MongoDB, and other Incubator Projects use Redis as queue or state storage. The database provides OpenStack with a state component and transfers the "sharing" problem of the state to the database, so solving the problem of expansion of OpenStack is actually to solve the problem of expansion of the database itself. For example, the most troublesome problem of OpenStack HA Solution is the expansion of traditional relational databases or other data storage. The root of the database expansion problem is that it does not support distribution and good scalability, and this root cause will lead to the largest distributed system. Nightmare - "Network Partitioning".

The following will analyze the problems caused by "network partition" to database expansion, and how to avoid and solve them in OpenStack components.

consistency

Modern software systems are built from a series of "components" that communicate with each other over an asynchronous, unreliable network. Understanding a trustworthy distributed system requires analysis of the network itself, and "state" sharing is one of the most important issues.

As an example, when you publish a blog post, you might wonder what happens after you hit the "publish" action:

1. Visible to everyone from now on;

2. Will be visible to your connection from now on, others will be delayed;

3. You may also be invisible temporarily, but will be visible in the future;

4. Now visible or invisible: An error occurred

5. …… etc

Different distributed systems have trade-offs and decisions that affect consistency and durability. For example, Dynamo-like systems specify consistency through NRW. If N=3, W=2, R=1, you will get:

1. You may not see an update right away

2. Update data will survive a node failure

If you write like Zookeeper, you get a strong consistency guarantee: the write operation will be visible to everyone, for example, the write operation will still be guaranteed after less than half of the nodes fail. If you write like MySQL, depending on your transaction consistency level, your write operation consistency will be visible to everyone, you, or eventually consistent.

network partition

Distributed usually assumes that the network is asynchronous, meaning that the network can cause arbitrary duplication, loss, delay, or out-of-order message passing between nodes. In practice, the TCP state machine ensures that messages are transmitted between nodes without loss, repetition, and timing. However, at the socket level, the node will block, time out and so on when sending and receiving messages.

Detecting network failures is difficult because the only information we can get about the state of other nodes is through the network, and latency is indistinguishable from network failures. A fundamental network partition problem arises here: high latency can be considered a failure. When a partition is created, we have no way of knowing what happened to the other nodes: are they still alive? Or has it crashed? Did you receive a message? Is trying to respond. When the network is finally restored, we need to re-establish the connection and try to resolve the inconsistency in the inconsistent state.

Many systems enter a special degraded mode of operation when resolving partitions. CAP theory also tells us how to achieve consistency or high availability, but few database systems can reach the limit of CAP theory, and most just lose data.

What follows is an introduction to how some distributed systems behave in the event of a network failure.

Traditional Database and 2PC

Traditional SQL databases such as MySQL and Postgresql provide a series of different consistency levels, and usually can only write to one primary. We can think of these databases as CP systems (CAP theory). If a partition occurs, the entire system will Not available (because of ACID).

So is traditional database really strong consistency? They all use the 2PC strategy to submit requests:

1. Client commit

2. The server side writes and then responds

3. The client receives the response and completes the submission

png;base649362479ed5433924

In these three steps, the possible inconsistency is between 2 and 3. When the server write operation is completed but the response is not received by the client, whether it is a timeout or a network failure, the client will think that the operation has not been completed. done, when in fact the database has been written. This results in inconsistent behavior. That is, the error that the client gets does not explain whether the server has written or not.

2PC is not only widely used in traditional SQL databases, but also a large number of users implement 2PC on top of MongoDB to complete multi-key-value transaction operations.

So how to solve this problem? This problem must be accepted first, because the probability of network failure is relatively low, and the failure occurs exactly between the completion of the server write operation and the response from the client. This makes affected operations very rare, and in most businesses this failure is acceptable. In contrast, if you must implement strong consistency, you should put it into action in the business, for example, all transaction write operations are idempotent and reentrant. In this way, when encountering network problems, retry can be done regardless of whether the write operation is completed or not. Finally, some databases can get the transaction ID, by tracking the transaction ID you can re-evaluate whether the transaction is complete after a network failure, through the database to check its recorded transaction ID after the network is restored and roll back the corresponding transaction.

Our options in OpenStack are limited. Currently, not all write operations are idempotent in various projects, but fortunately, OpenStack data loss in the rare 2PC protocol exception is acceptable.

Say it again

Redis is often seen as a shared heap because of its easy-to-understand consistency model, and many users use Redis as a message queue, lock service, or primary database. Redis running an instance on a server is regarded as a CP system (CAP theory), so consistency is its main purpose.

png;base64f63e2b67b8a23c15

Redis clusters are usually active and standby, the primary node is responsible for writing and reading, and the slave node is only used for backup. When the primary node fails, the slave node has the opportunity to be promoted to the primary node. However, because the primary node and the slave node are asynchronously transmitted, 0~N seconds of data loss will occur after the slave node is promoted to the primary node. At this time, the consistency of Redis has been broken, and the cluster of Redis mode is not a CP system!

Redis has an official component called Sentinel (refer to Redis Sentinel, which connects the Sentinel instance in a way similar to Quorum, then detects the status of the Redis cluster, and replaces the failed primary node with the slave node. Redis officially claims that this is an HA solution, through Redis Sentinel to build a CP system.

png;base64945bcfcf5952aaec

Consider the situation of Redis Sentinel when the network is partitioned. At this time, the Redis cluster is divided into two parts by the network. The large area where Redis Sentinel is located may promote the Slave node as the primary node. If the client has been connecting to the original primary node at this time, there will be two primary nodes (split-brain problem, split-brain problem)! That is, Redis Sentinel does not prevent clients from connecting to the Old primary node. At this point, clients already connected to the old primary node will be written to the old primary node, and new clients will be written to the new primary node. At this point, the CP system has been completely paralyzed. Although the Redis cluster has always been running, it will not be an AP system because it relies on Quorum to promote slave nodes.

png;base649b4412151ec52f63

If you use Redis as the Lock service, this problem will become fatal. This will cause two clients to acquire the same lock at the same time after the partition and succeed. The lock service must be a strict CP system, like Zookeeper.

If you use Redis as the queue, then you need to accept that an item may be distributed zero, one or two times, etc. Most distributed queues are guaranteed to be distributed at most once or at least once, and the CP system will mention exactly once Distribution then brings higher latency. If you need to explicitly use Redis as a queue service, you must accept the instability that the queue service may cause after network partitions.

If Redis is used as the database, then it is conceivable that the database built with Redis Sentinel cannot be called a database.

Finally, with the current Redis, it can only become a Cache using the officially provided components. To build a distributed Redis head to WheatRedis .

MongoDB

MongoDB adopts a clustering method similar to Redis. The primary node serves as a single-point write operation service and then writes to the replication nodes asynchronously. However, MongoDB has a built-in primary election and replication state machine, which makes the entire MongoDB communicate and select a suitable slave node after the primary node fails. Then MongoDB supports specifying the primary node to confirm that the slave node has written the write operation to the log or actually wrote it, that is, a certain performance loss is exchanged for stronger consistency when the primary node fails.

So can MongoDB be considered a strict CP system? It is still a similar problem to Redis. After the network partition, when the primary node is in a small partition, the node in the large partition will elect a new primary node. At this time, when the partition is in progress, the two nodes will At the same time (this is no problem), and then when the partition is restored, the old primary node in the small partition will send the operations during the split brain to the new primary node, and conflicts may occur at this time!

png;base641c859b729c1ee13

So how to face this problem? Accept it, first of all, the concept of this conflict can be solved on the client side like 2PC, and MongoDB currently has WriteConnern to solve this problem, but it will cause a huge performance impact.

Dynamo

Dynamo is a red book that appeared when the traditional primary-slave model encountered problems, and products borrowing from Dynamo appeared a lot in a period of time.

The systems mentioned earlier are all CP-oriented, or at least CP-oriented. Amazon designed the Dynamo with a clear AP orientation. In Dynamo, it is naturally partition-friendly, each node is equal, and different levels of consistency and availability are specified through NWR. The principles of Dynamo will not be elaborated here ( Dynamo , everyone trying to understand distributed systems should be very familiar with the Dynamo paper, even though it faces many problems, but the thinking and changes in the design of Dynamo that appear in the paper are invaluable of.

So what happened to Dynamo when the partition happened? First of all, according to the recommended setting of NWR (W+R>N), the small area cannot get new write operations, and new objects will be written in the large area. Then after the partition is restored, the objects of the cell lag and collide with new objects. There are many conflict resolution strategies here, such as the client timestamps used by Cassandra and the Vector clock of Riak. If they cannot be resolved, the conflict may be overwritten or pushed to the business code.

Then Dynamo itself does not have any way to judge whether a node is data synchronized, nor can it be judged, only through complete data comparison, and this process is expensive and inflexible. Therefore Dynamo mentioned that it is impossible to achieve strong consistency (W+R>N), and the faulty node will only be eventual consistency.

So fix Dynamo's problem as before and accept it. First your data can be designed to be immutable, then your data decisions can be discarded or stale in rare cases, or use CRDTs to design your data structures. Regardless, Dynamo is always a good idea and it pushes the boundaries of distributed design.

BigTable

The systems mentioned above are all distributed-oriented, either AP or CP. Then Bigtable is an AC system. Although we have always introduced the partition problem, we also need to consider Bigtable in a centralized design. Both HBase and HDFS are such designs, they avoid the partition problem and achieve very good results under a single IDC. The centralized design will not be discussed in detail here because it does not consider partitioning at all.

Thinking of Distributed Database System

Through the above analysis, we can understand the difficulty of building a distributed database cluster, whether it is synchronous replication, asynchronous replication, Quorum or others, in the face of network partitions, any struggle is powerless, network errors mean "I don't know” not “I failed”.

Building a "correct" distributed database system is usually agreed upon in several ways: 1. Accepting rare problems 2. Using open source software, distributed systems can generate a great "vortex" in "theoretically correct distributed algorithms" " and "Actual System Used". A buggy system but a correct algorithm is more acceptable than a bad design. 3. Use the problem for correct design, such as using [CRDTs](http://pagesperso-systeme.lip6.fr/Marc.Shapiro/papers/RR-6956.pdf) 4. The split-brain problem is the original sin of partitioning, How to resolve legacy after split-brain is the right solution

summary

How to achieve HA on OpenStack is the direction that OpenStack officials and other distribution companies are striving for, and the key lies in the HA and consistency of data storage. In this direction, we solve the key issue of "network partition". The analysis and the landing thinking on different types of databases, you can get how to avoid, solve and accept it on it. By thinking about these issues on the OpenStack product, we can have a stronger foundation on the HA Solution.

References

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327044889&siteId=291194637