ceph: a highly reliable, high-performance, scalable, and automated distributed file system

introduction

The study of the GFS paper has taught me what functions and how to implement a distributed file system, that is, to achieve the storage and access of extremely large-scale data under a certain consistency. The centralized design of GFS (HDFS) is simple and effective to achieve our needs, but it leaves a problem, that is, the centralized design makes this system a single point of storage problem when it is expanded, and access will also occur. The single point of the problem. Although the mainframe is used on the central node to get many times the scale, there is an upper limit in the end. The emergence of ceph solves these problems. It uses groundbreaking algorithms to make the entire system a decentralized structure, theoretically supports unlimited expansion, and enables the entire system to achieve automatic fault tolerance and load balancing. This is indeed very Exciting passage, these are mainly achieved by some unique algorithms in ceph. The most important points are the CRUSH algorithm, dynamic metadata management and RADOS (Reliable Automatic Distributed Object Storage).

Of course, the starting point of this article is to learn the design ideas of a good distributed file system based on the paper. It is true that ceph is open source. There is a chance to see the implementation of the key parts to consolidate what you have learned, but the class this semester is really helpless. Too many, just attending classes and completing homework accounted for 70% of the usual study time, and I have to prepare for spring recruits later, so I am really powerless. Fortunately, Gou Xun knows that China's software supply chain summer has a ceph project. You can choose this community to try it next year. Even if you fail to apply, it is a good learning opportunity.

Overview

First of all, the first paragraph of the paper clarified the design goals of ceph, namely:

A distributed file system, which provides excellent performance, reliability and scalability.

These three words basically explain the advantages of ceph. After a little thought, it is difficult to realize these three points at the same time, because some of them are contradictory. For example, strong scalability will lead to a surge in the number of machines, which will inevitably affect performance. Reliability inevitably requires data redundancy, which also affects performance, so good design is needed to solve these problems. I think there is another advantage of ceph that cannot be ignored is the ultra-large-scale storage. This ultra-large scale is theoretically unlimited, because there is no single point in the entire system.

I mentioned earlier that ceph is open source, but it is not a kind of open source project like HDFS. According to the description in [4], there are generally three sources of open source projects: one is a topic made by a big cow in the school. The papers were released enough and then open sourced; second, the products made by the big cows in the enterprise, by chance, so open source; third, some big cows suddenly appeared, and then a vote of people followed to open the source together. This background actually has a great influence on an open source project. Ceph belongs to the first category. It is actually a doctoral research topic of the great god Sage Weil. This thesis is also a doctoral dissertation. This type of open source project is generally very innovative in principle and technology, compared with other similar products. One's own uniqueness.

It can be seen from the paper that, obviously in Sage's eyes, the entire system is dynamic. It is precisely because of this different perspective that the subsequent automated design is prompted. The dynamics are mainly divided into the following aspects:

  • Data dynamics: The writing and reading of data in a large-scale storage system are very frequent, and the requirements for data also dictate the consistency of the data.
  • Scale dynamics: Obviously it is impossible for such a large-scale storage system to be established on the first day, and there is no need to modify it later. On the contrary, the scale of the system is a continuous process of increasing, which also stipulates that the system Hot spots and fragmentation must be dealt with.
  • Device dynamics: After all, in the design of ceph, the assumption for OSD (object storage device) is that frequent failures occur. It is obvious that it is almost impossible to manually update the work in a large storage system, and this problem is also Can not affect the business, so this also stipulates the reliability of the system and the recovery function in the event of an error.
  • Access dynamics: users may operate different products at different frequencies in different time periods. For example, on Double 11, everyone is definitely a product that tends to be discounted. Usually, there are definitely more visits to the products displayed on the homepage; ceph is not the right access pattern Assumptions, this also means that ceph will make changes for different access patterns.

ceph's approach

The problems mentioned above are realized by the following four technologies:

  1. Separate data and metadata
  2. Dynamic distributed metadata management
  3. Reliable, automatically distributed object storage
  4. CRUSH algorithm

Separate data and metadata

This is actually a very common design in distributed file systems, because in fact one MDS can manage one or more paths, and this may contain even PB-level data. This design makes us not to let too many clients To access the MDS, it should be read from the OSD when the data is read, which greatly reduces the load of the MDS. We found that the client needs to access the MDS anyway, because it is necessary to know where the requested data exists. This is also an important practice of ceph, that is, the process of looking up the table is cancelled in ceph to further reduce the MDS load, so how does the client know the data? Where does it exist? The answer is just to calculate. MDS will return an 80-byte inode when the client requests it, which uniquely identifies each storage object. The client can calculate the number of the data object based on this, and then according to a special map. Get the location of the data. It seems that if other file attributes are not considered, each object only occupies the size of [file name + inode (80 bytes)] in the MDS. Of course, each file also has three data: permissions, file size, and immutable attributes [5] section 4.2. This separation of data and metadata to the greatest extent makes the load of MDS greatly reduced.

Dynamic distributed metadata management

First of all, we must be clear about what is the purpose of dynamic distributed metadata management? First, give the answer: reduce the disk IO and flow control of metadata .

Reduce disk IO for metadata

In fact, the description in this part of the paper is not clear [5]4.1. The article describes it as: MDS can actually meet most memory cache requests (reduce disk IO?), but for safety, metadata updates must be submitted to disk . In order to optimize this update process, the log will be directly transmitted to the OSD cluster, so that when one node is down, the other node can quickly scan for recovery. Of course, I have some doubts here. Why not configure slave nodes for each MDS and run consensus algorithms such as Paxos and Raft, which can also be quickly restored in the event of a downtime, which is obviously simpler. Of course, it may also weigh the load of MDS. Obviously, the article does not describe it completely, because generally log-based storage should include log compression, otherwise there will be a large number of expired logs in the log.

Log-based transfer of data can make disk operations all sequential, which also makes log persistence faster.

Dynamic subtree partitioning and current limiting

Let's first take a look at the organization of the MDS cluster:
Insert picture description here
we can clearly see that each MDS node is actually responsible for a part of the path of the file system. Not to mention anything else, the design of this MDS cluster is very exciting, because this is a very novel decentralized cluster that also supports expansion. For example, GFS uses a static subtree partition (after all, there is only one node), which will cause hot spots for unexpected access. Dynamic subtree partition elegantly solves this problem.

as shown in picture 2. Count the number of visits to metadata in each MDS by using a counter. Any operation causes the count of the affected inode and its upper nodes up to the root directory to increase, thereby providing a weight for each MDS to describe the recent load distribution. Regularly migrating data can ensure the load balance of metadata. Although metadata updates are rare, they will eventually occur. This involves distributed transactions, because multiple copies need to be modified. The article mentions that **log entry new and old MDS (similar to two-phase commit)** can solve this Question, but did not mention this algorithm in detail.

Of course, metadata will be distributed when there are hot spots, and a large number of read-accessed directories (for example, multiple open) will be copied to multiple nodes. Each MDS not only provides the inode for the client, but also provides the copy information of the ancestor, so that the next operation can be randomly assigned to the master node or the slave node (all nodes are equal, just store data), which solves the hot issue.

CRUSH algorithm

Here I believe everyone will have a question at the beginning, why not directly hash, the reason is actually very simple, we can see that the CRUSH algorithm here is actually one-to-many, that is, the inode obtained by the client from the MDS can be mapped to Multiple OSDs (there is a layer in between). In fact, the CRUSH algorithm, in addition to simply copying a set of data to a set of OSDs, the choice of OSDs is also exquisite. It will tend to choose devices with larger capacity. This trend is actually very intriguing. Why not just choose the largest one? , Because this will make the largest node instantly full, if it is dynamic conversion, it is not easy to locate the OSD again next time, so CRUSH uses a pseudo-random number algorithm to convert the PGID (obtained by the inode) into A random number (the same input produces the same output), and then multiplied by the weight of each OSD, the largest one is the OSD we finally access. Of course, it is very simple to select N, we can do tricks on CRUSH, not described in detail here, see [1].

Insert picture description here

Let's describe in detail how an inode maps to multiple OSDs to achieve decentralization:

  1. The client sends a request to the MDS and gets the inode based on the file name .
  2. After the hash, the inode is matched with the mask to get the pgid, and the mask is of course the maximum value of pg .
  3. pgid obtains multiple OSD numbers through the CRUSH algorithm .
  4. Get the real storage node address of the data through the cluster map .

Reliable, automatically distributed object storage

First of all, how to understand reliable, automatically distributed object storage? First of all, we mentioned earlier that ceph is a theoretically unlimited distributed file system, so it means that its maintenance is very troublesome, such as fault detection, fault repair, cluster changes, etc. Of course, we hope that ceph is designed at the beginning These problems can be solved automatically. Let's go back to the first question, what is reliable and automatically distributed object storage? My understanding is that safe storage can achieve customer-specified consistency, automatic failure recovery and support for scalability. This is not easy for a decentralized storage system, we will see how ceph perfectly implements these places next.

Backup

Safe storage is achieved by backup. What's interesting is that ceph's data redundancy is not based on a consistent algorithm. In fact, this is very special. Many distributed systems, such as zookeeper, GFS, etc., use consistency algorithms in key data redundancy places, such as ZAB, paxos, etc., which realize automatic fault detection and data recovery, but the disadvantage is a large amount of message communication , And this pg will stop providing services when the leader in the OSD mapped by this pg goes down. In ceph, it is assumed that a PB-level or EB-level system failure occurs normally, which may be an important reason for not using the consensus algorithm.

We mentioned above that each inode will be allocated to a pg at the end, and the objects in these pgs will be stored in N OSDs according to the CRUSH algorithm. The first OSD is the leader, which is responsible for replying to the client. In fact, According to the CRUSH algorithm, if the cluster map is unchanged, the first OSD obtained each time is also unchanged. For the CRUSH algorithm, please refer to [1].

One write operation is as follows:
Insert picture description here
The OSD read operation of ceph randomly selects the OSD, and does not need to go through the leader like the consensus algorithm. This also means that for data consistency, we must copy the data to three nodes in each copy. In, obviously this seems to be expensive. An optimized strategy is adopted here, that is, when all nodes write data to the memory, they will return an ack to the leader. When the leader receives all the acks, it will return immediately. At this time, the read operation of the client can be seen. To the data, when all the data is stored in the disk, submit a commit to the leader, and return to the client when all the nodes are submitted. Obviously, careful students can see that this approach is a weak consistency, because it is possible for the client to see a value that has been seen before. Of course, I think there is a way to avoid it through the client, although we There is no way to guarantee the consistency of the server perspective, but the consistency of the client perspective can be guaranteed. For example, we maintain a seq on the client for each read operation. If the client has read the data that has not been entered into the disk, but an OSD is down during the next read, the newly added OSD does not have this message. Reading we can see a piece of data that is older than the last read, then its seq is less than the last read, and the client can easily reject this request, that is, satisfy monotonic read consistency; if the OSD goes down The client does not perceive the process of recovery, so monotonic read consistency is also satisfied.

Of course, the client in the article can also cache data until commit, which not only ensures consistency, but also does not cause data loss.

Troubleshooting

In fact, in a cluster running a consensus algorithm, failure detection will be achieved through heartbeat packets, and will automatically recover. But this is not done in ceph, but the detection must be based on the heartbeat packet. The method of ceph is that all OSDs of a pg detect each other through the heartbeat packet. When an OSD is detected to be offline (no receipt for a long time) To the heartbeat packet from this node) will mark this node as down. The paper here proposes that RADOS can confirm two OSD activities, whether the OSD is accessible and whether the data is allocated through the CRUSH algorithm. The first is the one described above; the second is not described in the text, I think it is node data Is it still enough to be allocated, because the CRUSH algorithm described in [1] is calculated based on the total capacity of the node. Obviously, if the data is full, if it is not marked, other nodes with remaining capacity will not work. This may be CRUSH can reassign one when it detects this state, which is not mentioned in the specific paper.

Of course, there is a very important concept in the article is the monitor, but the paper did not describe it in detail, it is just a general mention, I summarized, its most important use is to maintain a cluster map, which is the mapping from OSD to specific address. Because of this function, it can maintain the cluster relationship. When an OSD is down (failure recovery) or a new OSD is added to the cluster (scalability), it will detect the change, and return the change to the OSD, and the OSD will be received. After going to a new cluster map, it will detect the relationship between itself and the newly added OSD after the CRUSH algorithm. Of course, it seems that there are only three relationships:

  1. The newly added OSD has nothing to do with me.
  2. The new OSD is its own leader.
  3. The new OSD is its own follower.

In the first case, nothing will happen;

In the second case, for the copy PG, the OSD where it is located will provide the current PG version to the main OSD;

In the third case, if the OSD is the master node of the PG, it collects the current (and past) copy of the PG version number. If the master node is not the latest PG state, it will retrieve the latest PG change log from the OSD where the PG is located (or A complete PG content overview, if necessary), so that the latest PG content can be obtained, and the master node will send a new log update to each replica node (if necessary, it can be a complete content overview), so that the master node All parties and copies can know the content of PG.

The Ceph monitor uses Paxos.

Because the article does not describe the implementation of the monitor, if it is just a simple Paxos cluster, a single-point monitor may become a bottleneck; the problem is not in storage, because an OSD corresponding to an address does not cost too much storage space. The OSD number is an integer, four bytes. Even if the IP address of ipv6 is stored in a string, the maximum is 4*8+7=39 bytes, plus the port and some other information, which counts as 128 bytes. 1GB of space is enough to store 2^13 machines, which is already huge. The key to the problem lies in the need to monitor the entire cluster in order to update the cluster relationship in time. I think this may be a possible bottleneck.

In fact, the paper only describes the general situation. There are still many problems that have not been solved, and we need to continue to learn, such as:

ask:

  • There is no mention of the way the metadata is modified, that is, whether there is a cache on the client side?
  • The role of client permissions?
  • How to respond to the client in the MDS cluster, is it similar to the forwarding of the redis cluster? This requires each node to maintain all the messages, but it is a good practice.
  • What should I do if the metadata of the dynamic subtree partition is modified after the data is copied?

All the specific details still need to go deep into the source code, but there are too many lessons, and we will recruit for a while. Let's put it aside for a while!

reference:

  1. Blog post "Explain Ceph's killer technology CRUSH in detail"
  2. Blog post "Detailed Explanation of Ceph Working Principle"
  3. Blog post "In-depth analysis of the kernel client of Ceph distributed storage"
  4. Blog post "Analysis of Ceph (Part 1): Overview and Design Ideas"
  5. 论文《Ceph: A Scalable, High-Performance Distributed File System》

Guess you like

Origin blog.csdn.net/weixin_43705457/article/details/108556303