"Ceph Analysis" series (2)-Ceph's design ideas

One of the problems often encountered when analyzing open source projects is insufficient data. Daniels who have time to write code usually don't have time or disdain to write documents. Few documents are usually manuals and the like. Even occasional design documents are often unclear. In this case, you want to extract the design ideas from the code in the reverse direction, after all, not everyone can do it.

        Fortunately, Ceph is a typical open source project that originated from academic research topics. Although the academic research career for Sage is just a short story of its glorious deeds, there are still several academic literatures for reference [ 1 ]. This also provides us with a rare opportunity to analyze an excellent open source project in the field of systems from the top perspective. The content of this article is also the author's experience of reading these documents.

3.1 Target application scenarios targeted by Ceph

        To understand Ceph's design ideas, we must first understand the target application scenarios that Sage designed when designing Ceph. In other words, "What is the purpose of doing this?"

        In fact, Ceph's initial target application scenario is a large-scale, distributed storage system. The so-called "large-scale" and "distributed" means that it can carry at least PB-level data, and is composed of thousands of storage nodes.

        Today, with the slogan of big data becoming popular, PB is no longer an exciting system design goal. However, it should be noted that the Ceph project originated in 2004. It was an era when commercial processors used single core as the mainstream, and the common hard disk capacity was only a few dozen GB. This is no different from the situation where 6 cores, 12 threads, and dual processors and a single hard drive of 3TB are common. Therefore, to understand this design goal, we should consider the actual situation at that time. Of course, as mentioned earlier, Ceph's design has no theoretical upper limit, so the PB level is not the actual application capacity limit.

        In Sage's thinking, such a large-scale storage system cannot be viewed from a static perspective. For its dynamic characteristics, the author summarizes the following three "changes":

  • Changes in the scale of storage systems: Such large-scale storage systems are often not able to predict their final scale on the first day of construction, or even the concept of final scale does not even exist. It can only be that as the business continues to develop and the scale of the business continues to expand, allowing the system to carry more and more data capacity. This means that the scale of the system will naturally change with it, becoming larger and larger.
  • Changes in equipment in the storage system: For a system composed of thousands of nodes, the failure and replacement of its nodes must inevitably occur frequently. On the one hand, the system must be reliable enough to prevent business from being affected by such frequently occurring hardware and underlying software problems. At the same time, it should be as intelligent as possible to reduce the cost of related maintenance operations.
  • Changes in data in storage systems: For a large-scale storage system that is commonly used in Internet applications, changes in stored data are also likely to be highly frequent. New data is continuously written, and existing data is updated, moved or even deleted. This scenario requirement must also be considered during design.

        The above three "changes" are the key features of Ceph's target application scenarios. The main features of Ceph are also proposed for these scene characteristics.

3.2 The expected technical characteristics proposed for the target application scenario

        For the above application scenarios, several technical characteristics of Ceph at the beginning of the design are:

  • High reliability. The so-called "high reliability" is first of all for the data stored in the system, that is, as much as possible to ensure that the data will not be lost. Secondly, it also includes the reliability of the data writing process, that is, during the process of writing data to the Ceph storage system, there will be no data loss due to unexpected situations.
  • Highly automated. Specifically includes automatic replication of data, automatic re-balancing, automatic failure detection and automatic failure recovery. Overall, these automation features on the one hand ensure the high reliability of the system, on the one hand, they also ensure that the difficulty of operation and maintenance can be maintained at a relatively low level after the system scale is expanded.
  • High scalability. The concept of "scalable" here is relatively broad, including both the scalability of the system size and storage capacity, as well as the linear expansion of the aggregate data access bandwidth as the number of system nodes increases, and also includes the underlying API based on rich and powerful functions It can provide multiple functions and support multiple applications.

3.3 Design ideas proposed for expected technical characteristics

        In response to the expected technical characteristics introduced in Section 3.2, Sage's design ideas for Ceph can be summarized as follows:

  • Give full play to the computing power of the storage device itself. In fact, using computing-capable devices (the simplest example is an ordinary server) as the storage node of the storage system, this idea is not new even at that time. However, Sage believes that these existing systems are basically just using these nodes as simple storage nodes. If the computing power on the node is fully utilized, the expected characteristics proposed above can be achieved. This has become the core idea of ​​Ceph system design.
  • Remove all center points. Once the center point appears in the system, on the one hand, a single point of failure is introduced, on the other hand, it will inevitably face scale and performance bottlenecks when the system scale expands. In addition, if the central point appears on the critical path of data access, in fact it will inevitably lead to an increase in the delay of data access. These are obviously issues that should not have occurred in the system Sage envisioned. Although in most systems engineering practice, the single point of failure and performance bottleneck problems can be alleviated by adding backups to the central point, but the Ceph system finally adopted an innovative method to solve this problem more thoroughly.

3.4 Key technological innovations that support the realization of design ideas

        No matter how novel and wonderful the design idea is, the final landing must be supported by technical strength. And this is where Ceph shines most.

        The core technical innovation of Ceph is the eight words summarized above-"No need to look up the table, it is just fine." Generally speaking, a large-scale distributed storage system must be able to solve the two most basic problems:

       One is "Where should I write the data". For a storage system, when a user submits data to be written, the system must make a quick decision to allocate a storage location and space for the data. The speed of this decision affects the latency of data writing, and more importantly, the rationality of its decision also affects the uniformity of data distribution. This will further affect subsequent issues such as storage unit life, data storage reliability, and data access speed.

        The second is "Where did I write the data before?" For a storage system, efficient and accurate handling of data addressing problems is also one of the basic capabilities.

        In response to the above two problems, a common solution of the traditional distributed storage system is to introduce a dedicated server node in which to store the data structure used to maintain the mapping relationship of the data storage space. When a user writes / accesses data, he first connects to this server for a search operation, and after determining / finding the actual data storage location, he then connects to the corresponding node for subsequent operations. It can be seen that the traditional solutions on the one hand are easy to cause single points of failure and performance bottlenecks, on the other hand it is also easy to cause longer operation delays.

        In response to this problem, Ceph completely abandoned the data addressing method based on table lookup, and switched to a calculation-based method. In short, any client program of a Ceph storage system uses only a small amount of local metadata that is not updated regularly, and is simply calculated to determine its storage location based on a data ID. After comparison, it can be seen that this method makes the problems of the traditional solutions completely swept away. Almost all excellent features of Ceph are based on this data addressing method.

        So far, Ceph's design ideas have been introduced in a more comprehensive and in-depth manner. The following articles will introduce Ceph's system architecture, working principles and processes, main features, etc. in turn, and contact OpenStack to compare and analyze Ceph and Swift.

 

Published 59 original articles · 69 praises · 270,000+ views

Guess you like

Origin blog.csdn.net/pansaky/article/details/102454317