Data redundancy in distributed storage

Insert picture description hereThis work is licensed under the Creative Commons Attribution-Non-Commercial Use-Share 4.0 International License Agreement in the same way .
Insert picture description hereThis work ( Lizhao Long Bowen by Li Zhaolong creation) by Li Zhaolong confirmation, please indicate the copyright.

introduction

The same problem is understood differently at different stages of learning. At the beginning, there was not enough understanding of this field. The angle of view of the problem is not only low, but also single, which will make it easy to look at the problem according to a certain solution, such as this " simple " problem, that is 数据冗余, at the very beginning At that time, this problem is the same in my eyes 一致性协议, because we can do data redundancy based on the consistency protocol. Of course, the most important thing is that this is strong consistency, but it is well known that it is strong consistency (strong consistency from the server perspective, of course, we Ignoring the network delay every day) is extremely affecting performance, so most of the time we will not do this. We may retreat and choose other consistency models. This will lead to changes in the data redundancy strategy. This is mine. I didn't notice in [1] this article. I was busy thinking about consistency and completely forgot that the behavior of showing consistency is actually data redundancy.

Secondly, this article is not a very general definition like [2], but to describe several redundancy strategies used in real life, infer the consistency and to match the concepts in [2].

Most of the information in the article comes from the papers of major companies, and there is no problem in terms of correctness.

In this article, we only mention the data redundancy strategy, without mentioning anything else.

Conformance Agreement

The strong consensus protocol may be the first hurdle that most friends who study distributed theory have encountered since they entered the industry. They were first abused by Raft (multi), then ravaged by Paxos, and finally discovered that there are atomic broadcasts such as ZAB. protocol. In addition to the Byzantine Generals problem, there are countless algorithms used for Bitcoin such as PBFT, Pow, PoS, and DPoS; without the limitation of strong consistency, there are also very weak consensus protocols such as Gossip (usually used Maintain cluster membership); the restrictions are weaker and there are replication protocols such as ViewStamped replication. Of course, I don't know much about this. Friends who are interested can check it out [5].

The maintenance of data redundancy based on the consistency protocol is not to say, it is essentially the process of synchronizing logs between nodes.

ceph

Event distribution in ceph is done through CRUSHalgorithms, that is master, inodewe can calculate one pgidby pgidgetting it from the cluster , through which we can get a set OSDof information, and then we can get their real storage information through map.

OSDThe first node of the list is Primary, and the rest are all Replica. The client completes all writing to the object PG on the main OSD (main host). These objects and PG are assigned a new version number, and then written to the copy OSD. When each copy is completed and responds to the main node, the main The node is updated, and the client is finished writing. Here is a two-stage submission. When the copies are written to the memory Ack, the master node can reply when all the Ack is received. When the copies are submitted to the disk, it will reply Commit. The master node will also reply when all the commits are received. Give Clienta reply, which greatly reduces the response speed of the client.

The client reads directly on the master node. This method saves the complicated synchronization and serialization between the client's replicas. The visible W configuration of this method is all nodes, which means that the consistency of the copy can be reliably maintained during error recovery. Of course, if one Replicafails, the operation will fail, because there is no way to receive all the replies.

The paper did not mention what to do at this time. I think that at this time (after the timeout), the OSD can be recalculated, and then replicated again. When the OSD is restored, it will be added to the original partnership, and it will also be based on the recent change log of PG. To synchronize the logs, this ensures consistency.
Insert picture description here

GFS

The data redundancy of GFS comes from two aspects, one is Masterthe state copy of the node, and the other is ChunkServerthe data copy. This involves another problem, namely, the selection of the master. The former uses Chubby to reliably choose the master (acquiring a lock), and the latter determines the master node by Masterissuing leases. We see that neither method uses a consensus protocol. However, the master node can be uniquely designated.

In fact, there masteris not much description about the implementation in the redundant paper, but it is mentioned that all operation logs and checkpoint files of the Master server are copied to multiple machines. In addition, the prerequisite for the successful submission of the modification operation to the master server state is that the operation log is written to the master server's standby node and the local disk. In this way, once the master selection through Chubby is successful when the master server is down, the copy of these very complete data can be quickly switched to the normal running state, because it has all the data of the original master node and has been placed on disk.

In addition to the first time, there is another type of server in GFS called shadow servers . These shadow servers provide read-only access to the file system when the "master" Master server is down. They are shadows, not mirrors, so their data may be updated more slowly than the "master" Master server, usually less than 1 second. Such an operation makes the "master" master go offline, but the service will not go offline as a whole, because the shadow can also provide access to the original data. For those files that do not change frequently, or applications that allow a small amount of out-of-date data, the shadow server can improve the efficiency of reading and improve the availability of the entire system .

Why is the general shadow data slower? Because Chunk's information is maintained by the shadow itself and Chunk communication, but the creation and modification of the copy can only be completed by the master, if the modification is to notify the shadow first, and then update after receiving the reply may cause the shadow data to be newer than the master , This is intolerable, so the shadow in GFS chooses to read a copy of the log of the currently ongoing operation, and changes the internal data structure in exactly the same order as the main Master server, ensuring consistency at one time, but of course it will There is a synchronization time, which causes the data to be slightly older.

Then we talk about ChunkServerdata redundancy, the following is the update process:
Insert picture description here
every time I see this picture, I want to say that the separation of control flow and data flow can only be said to be clever. .

GFS uses a lease mechanism to maintain the consistency of the order of changes among multiple replicas. The master node establishes a lease for a copy of Chunk, which is called the master Chunk. The main Chunk serializes all the change operations of the Chunk, and all copies follow this sequence for modification operations.

It can be seen in section 3.1 of the paper that the copying of data is a strongly consistent process, because any errors generated by any copy will be returned to the client. First, the data must be executed successfully in the main chunk, otherwise there will be no sequence of operations. , So if any Replicasequence of execution operations fails, this message will be passed to the main chunk. At this time, the client request is considered to be a failure, and the data is in an inconsistent state at this time. The client repeats the failed operation to handle this mistake.

As for why leases are used to ensure consistency between multiple copies and leases are used, the explanation given in the paper is this: the purpose of designing the lease mechanism is to minimize the management burden of the Master node . This is actually not easy to understand. My idea is this. It should be said that it is for more efficient chunk management . Because Chunk management is a dynamic process in the eyes of GFS, the master's management of chunks includes but is not limited to the following points:

  1. The master checks the current distribution of copies, and then moves the copies to make better use of hard disk space and load balance more effectively
  2. The selection of a new Chunk is similar to the one when it was created: balancing hard disk usage, limiting the number of ongoing cloning operations on the same Chunk server, and distributing copies between racks
  3. When the number of effective copies of Chunk is less than the user-specified replication factor, the Master node will replicate it again

If each Chunk is configured as a Paxos/Raftgroup, migration (for the utilization of the hard disk) will be very troublesome [9], and the overhead of the Chunk will also become relatively large, because the overhead of the strong consistency protocol itself is very large.

Of course the above are just my personal thoughts.

Dynamo

What Dynamo describes is actually the unowned node part in [2], because [2] this article is an article I wrote after reading DDIA, and now it seems that the content of that section in DDIA is actually related to [ 10] Written in this paper.

Dynamo uses optimized consistent hashing for data distribution. Obviously, there are two most intuitive methods for data redundancy. Each node is counted as master and slave (chain type is also possible), or for extreme availability like Dynamo The introduction of unowned node writes.

Insert picture description here
In addition to storing each key in its range locally, Dynamo also copies these keys to the N-1 successor nodes in the clockwise direction on the ring. Take the example above. If a key is inserted in the interval [A,B), the BCD node will store this key. In other words, the node D will store the key in the range (A,B], (B,C) and ( All keys in C,D].

A list of nodes responsible for storing a specific key is called a preference list, and any storage node in Dynamo is eligible to receive any read and write operations on the key from the client, which means that it may appear here. Multi-master node write. In order to ensure consistency, in general configuration W+R > N, we call the node that processes read or write operations as the coordinator. The read and write process is as follows:

  1. When receiving a write request, the coordinator generates a new version of the vector clock and writes the new version locally. The coordinator then sends the new version (along with the new vector clock) to the top N reachable nodes in the preferred list. If at least W-1 nodes return a response, then the write operation is considered successful.

  2. When receiving a read request, the coordinator requests all existing versions of data from the top N reachable nodes in the preferred list for the key, then waits for R responses, and then returns the results to the client. If the final coordinator collects multiple versions of the data, it returns all versions that it believes are not causal. Different versions will be coordinated, and replace the current version, and finally written back.

Friends who don’t know about vector clocks can check the information.

There is another problem here, that is, how to synchronize the direct values ​​of multiple nodes, because the write is multi-node, we don’t know which keys other nodes have different from ourselves, do we need to send all of them every time we synchronize Key? Of course, it is not necessary. Dynamo uses to MerkleTreeachieve synchronization between replicas. Here is a shortcoming of consistent hashing with virtual nodes. It will cause a lot of MerkleTreefailures when the master node is newly added , and it is difficult to recover. The tree can be reconstructed from the existing data, because it key rangeis destroyed, how to optimize this article will not talk about it.

How Dynamo achieves extreme usability through such unowned node writes will not be discussed in this article.

repeat

Redis's data redundancy scheme is master-slave replication, and the high-availability scheme is Sentinelthat the sentinel acts as an election node. When in use Redis Cluster, slotthe master node that has at least one serves as the election voting node, and the slave node is the election node. After the slave node election is successful, it will be executed slave no one, and the slot assignment of the original master node will be revoked so that these slots point to itself. Broadcast after the host PONG, this Gossip packet is used to notify the completion of the failover.

Because of the problem of Redis implementation, in fact, the master-slave replication does not have the slightest consistency at all, because the slave will be updated after the master update is successful, and the update process is asynchronous, and there is no Quorumconcept. The data to be sent is first stored in the redisClientstructure. In the reply buffer buf, synchronization will only be performed after receiving a writable event from this client, and this is at least the next event loop.

This means that there is no consistency in Redis data redundancy. Using Redis as a distributed lock is theoretically nonsense, of course, theoretically, after all, the master node is down, and the core data happens to not be synchronized to the slave server. The probability is too low, but after the base is large, anything can happen.

The Redis Clusterdata redundancy scheme in China is still master-slave replication, and the basic synchronization process is the same as the above, but the failure discovery, the failover process is different, and the role of initiating the election is also different. This article will not discuss it.

bigtale

Bigtable data redundancy to give us a new idea, namely the aid of other components for data redundancy, Bigtablethe use of GFScome to the right Tablet, and Redo Pointdata redundancy.
Insert picture description here

I think this is the more common method, that is, to use a mature distributed storage component as a network file system, which can give the upper application a simple and powerful abstraction. Of course, one thing must be clear in your mind. It is not a distributed cache, which means it is very common to operate for a few hundred milliseconds at a time, so it is necessary to use space/time locality to cache these data in user mode, and it is also necessary when writing buffer. That is, in the figure memtable, it is true that this may cause data loss. After all, the data is not in the disk. Here you can implement a redo log Redo pointto restore one memtable.

For log submission bigtable has made an optimization, see section [11] commit log.

Because a GFS operation is very expensive, the user mode needs a lot of cache (cache), BigTable uses two levels of cache, the scan cache is the first level of cache, the main cache is the Key-Value pair obtained by the Tablet server through the SSTable interface; Block The cache is the second level cache, and the cache is the Block of the SSTable read from GFS. For applications that often read the same data repeatedly, scanning the cache is very effective.

memcache

The highest level of data redundancy is that there is no need for data redundancy!

How does a full-memory cache have data redundancy? You are talking about redundancy. Even the routing information is stored in the client, of course, if it is for data security, of course, there is no need for data redundancy. But there may be redundancy due to efficiency .

But obviously when a new cluster is added, the cache hit rate will be very low at this time, which will weaken the ability to isolate back-end services. This can be regarded as a special cache avalanche. At this time, the Facebook team’s approach [12] is cold Cluster warm-up , that is, we can let the client of this new cluster retrieve data from the cluster client that has been running for a long time, so that the time for this cold cluster to rise to full load will be greatly shortened.

The following is the architecture of the Facebook team using the memcache cluster: I
Insert picture description here
just said that there is no need for redundancy, why did it happen again? As mentioned earlier, in order to be efficient, the facebook team not only achieved regiondata redundancy between the two, but also achieved regioninternal data redundancy.

It is described in [12]5 as follows:

  • We designate one region to hold the master databases and the other regions to contain read-only replicas; we rely on MySQL’s replication mechanism to keep replica databases up-to-date with their masters.
  • We designate a region to hold the primary database, and other regions contain read-only replicas; we rely on MySQL's replication mechanism to keep the replica database in sync with the primary database.

In this case you can make use of the advantages of a multi-data center, and can make the read operation in the region, whether memcacheor Storage Clusterdelay are very low.

In the second regioncopy, there is the following description in [12]:

We use replication to improve latency and the efficiency of the memcached server.

Obviously when the request volume of a certain key exceeds the single machine load, we need to make some optimizations, otherwise there may be 缓存击穿problems. Generally speaking, there are two methods at this time:

  1. Division based on the primary key (based on hash or data characteristics).
  2. Full copy (optimized read operation)

The latter is a data redundancy. So how to choose the above two options? I think there are two factors:

  1. The size of the hot key collection
  2. The cost difference between requesting multiple keys and requesting single keys

The former is easy to understand, how to have only one hot key, and how to divide it is useless. At this time, full replication is the kingly way. I think this is also the reason why Alibaba Cloud Tair [13] chose the multi-copy solution to prevent hotspots. Every time I mention Tair, I have to say something, I'm so happy! Happy God eternal God! !

But when the hot spots are relatively uniform and there is no super hot key, obviously data fragmentation is a very good method.

The cost of single-key and multi-key is the problem described in [12]:

  • Consider a memcached server holding 100 items and capable of responding to 500k requests per second. Each request asks for 100 keys. The difference in memcached overhead for retrieving 100 keys per request instead of 1 key is small. To scale the system to process 1M requests/sec, suppose that we add a second server and split the key space equally between the two. Clients now need to split each request for 100 keys into two parallel requests for 50 keys. Consequently, both servers still have to process 1M requests per second. However, if we replicate all 100 keys to multiple servers, a client’s request for 100 keys can be sent to any replica. This reduces the load per server to 500k requests per second
  • Consider a memcached server with 100 data items, capable of processing 500K requests per second. Find 100 primary keys for each request. In memcached, the cost difference between querying 100 primary keys for each request and querying 1 primary key is very small . In order to expand the system to handle 1M requests/sec, suppose we add a second server and evenly distribute the primary key to the two servers. Now the client needs to split each request containing 100 primary keys into two parallel requests containing 50 primary keys. As a result, both servers still have to handle 1M requests per second. Then, if we replicate all 100 primary keys to two servers, a client request containing 100 primary keys can be sent to any replica. This reduces the load on each server to 500K requests per second.

CRAQ

What I talked about before is how some industrial software uses data redundancy mechanism, but no matter how it changes, it basically cannot escape the fate of master-slave replication (except for no master), but the implementation strategy is different because of the difference in consistency. No matter, is there only a master-slave load for data redundancy?

Of course not, chain replication is also an excellent choice, not to mention the optimized CRAQ of chain replication.

Let's take a look at the performance comparison results in [14]:
Insert picture description here
Insert picture description here

Because chain storage reads and writes at two nodes, these two operations can be concurrent while ensuring strong consistency, and if the master and slave want to ensure strong consistency, both reads and writes need to go through the master node, like zk In this way, the FIFO consistency of the client's perspective is not counted. Therefore, the performance of chain storage is higher than that of write operations in the range of 0% to 25% for update operations. And even if chain storage does not guarantee strong consistency, it naturally guarantees the FIFO from the client's perspective, and does not need to maintain one like zk zxid.

The optimized CRAQ of chain storage can greatly improve read throughput while ensuring strong consistency. And for higher write throughput, CRAQ also allows to reduce the consistency requirements, that is, final consistency. This means that old data may be returned within a period of time (that is, before the write is applied to all nodes).

Simply mention it without going into details, please refer to [14][15] for details.

to sum up

Later, I plan to write another article about event distribution in distributed storage, which is also a very interesting question.

After summarizing these, I immediately feel that my mind is clearer. I think there are very few articles on the Internet that are more detailed but simple to describe this content. This can be regarded as a small contribution to beginners in this field.

Because the content summary like this is a long frontline process, it just happens that the recent spring recruits, the distributed-related knowledge can only be written to write this article, so this article is not over yet, I will learn new things later. I will still add things.

The author's level is limited, and Haihan is asked to correct the errors.

reference:

  1. " Talk about a little understanding of distributed consistency "
  2. " Data replication between nodes "
  3. " Consensus Algorithm "
  4. " Paxos Algorithm Overview and Derivation "
  5. ViewStamped replication revisited 解读
  6. Ceph: A Scalable, High-Performance Distributed File System
  7. Google File System
  8. The Chubby lock service for loosely-coupled distributed systems
  9. " Raft Algorithm: Cluster Member Change Problem "
  10. Dynamo: Amazon’s Highly Available Key-value Store
  11. Bigtable: A Distributed Storage System for Structured Data
  12. Scaling Memcache at Facebook
  13. " 2017 Double 11 Technology Revealed-Distributed Cache Service Tair's Hot Data Hashing Mechanism "
  14. Chain Replication for Supporting High Throughput and Availability
  15. Object Storage on CRAQ

Guess you like

Origin blog.csdn.net/weixin_43705457/article/details/113732113