[ Ceph ]

Foreword

In the early realization of the container, Ceph plan to use as a storage container. Said the mother of storage virtualization, relative to the container, the store also played a crucial role.

Ceph storage container as chosen for the following reasons:

  1. Facilitate post scale;
  2. Ceph can simultaneously support fast storage, object storage, file storage, block storage containers, the latter will be used to replace the OSS object storage service.

For these reasons, the use of Ceph distributed storage should be a good choice. The first step in the beginning of the container from the storage.

 

Ceph official document (English): https://docs.ceph.com/docs/master/start/intro/

Ceph official document (Chinese): http://docs.ceph.org.cn/start/intro/

 

About Ceph

Whether you want to provide Ceph object storage platform for cloud or block device, or want to deploy a Ceph file system, to deploy all Ceph storage cluster starts with one Ceph nodes, network and Ceph storage cluster.

 

Creating Ceph storage, at least the following services:

  1. A Ceph Monitor
  2. OSD two daemons

 

And when you run the Ceph file system client, you must have a metadata server (Metadata Server), using Ceph file system such as the demand scenario should not be many, but this container project will not involve file system storage and therefore does not get to the bottom.

 

 

Ceph OSDs: Ceph OSD daemon (Ceph OSD) functions to store data, copy data processing, recovery, backfill, rebalancing, and to provide monitoring information to the Ceph Monitors by examining other OSD daemon heartbeat. When the Ceph cluster set has two copies, at least two OSD daemon cluster to achieve active + clean state (there are three copies of Ceph default, but you can adjust the number of copies).

Can be understood: the Ceph through one OSD to store data, the Ceph default OSD is three, but can be manually set to 2, at least 2 to allow a pod to achieve healthy states are available and the OSD can be realized data high availability.

 

Monitors: Ceph Monitor graph showing maintains various cluster states, including a monitor view the OSD FIG normalized set group (PG) in FIG, CRUSH and FIG. Ceph preserved historical information every state change occurs on Monitors, OSD, and PG (called the epoch).

It can be understood: Monitor is a monitor of the Ceph, Ceph when storing data, the index data is recorded is displayed. As for what PG / will be discussed later CURSH.

 

Ceph save the client data object storage pool. By using CRUSH algorithm, Ceph can calculate the object (Object) which co-location group (PG) should hold specified, then further out which hold the OSD daemon return home group. CRUSH algorithm makes Ceph storage cluster can dynamically stretch, rebalancing and restoration.

The official word document, summed up the Ceph principle, but for the initial contact with the Ceph, is simply jerky and difficult. Summarize a few keywords: Object, CRUSH algorithm, return home group (PG), dynamically scalable

This, we need to understand the overall Ceph What works? Otherwise this question will always be around us.

 

By querying the data and found that fat brother write this article is very good, it is worth repeating understand: Westward Ceph - CRUSH that point thing

[The following article Pange theoretical knowledge sources and some of their own understanding]

 

First, throw a question: will store a copy of the data to the Ceph cluster, a total of a few steps away?

Ceph answer is two steps:

  1. Calculation of PG, which is the official document of the group return home
  2. Computing OSD

As mentioned calculation, it certainly would have algorithms that algorithm is not called CRUSH it?

 

PG computing

First, a clear provision of Ceph: Ceph in the middle, everything is an object. (Reference is made to the object, and official documents consistent, all objects are easy to understand, like python, everything is an object)

For example, it straightforward, everything is an object, whether it is video, photos, text or other formats:

Whether all formats of video data, text, photographs, Ceph it as a unified object because its roots chase, all the data are binary data stored on the disk, so every binary data as an object , not to their format to distinguish them.

Since it is the object, then the object should have the object name. The distinction between two objects is distinguished by its object name. If the file name, like that of the two objects do?

The question now becomes very beginning: to store an object to Ceph clusters carve a few steps away?

Known: Ceph cluster is a number of servers, precisely, is a pile of disks, while the Ceph in each block on the disk as an OSD.

The document also simplified as: the object is stored in the OSD carve a few steps away?

 

The logical layer Ceph

Ceph To save an object, for the construction of a logical layer, which is the pool (Pool), this is not difficult to understand, as storage pooling virtualization, used to save the object, if the Pool likened to a Chinese chess board , then the process of saving an object similar to a sesame seed placed chessboard.

Simple as shown:

 

 

Pool once again broken down, is about a Pool divided into a number of PG (return home group), which is similar to the grid on the board, all the squares constitute the entire board, which means that all the PG constitute a Pool.

 

 

By these figures, we can summarize: the file is an object, the object is stored in each PG in, but more PG constitutes Pool

Now the question again, how to save the object on which you want to know what PG? Here we assume that the Pool named rbd, a total of 256 PG, PG to each number is called, respectively, 0, 1, 2, 3, 4

To solve this problem, first look at what is currently there?

  1. Different object name
  2. PG number of different

Here on the introduction of the first algorithm Ceph: HASH

For the object name foo and bar are two objects, their object names can be calculated:

  • HASH(‘bar’) = 0x3E0A4162
  • HASH(‘foo’) = 0x7FE391A0
  • HASH(‘bar’) = 0x3E0A4162

HASH algorithm should be the most commonly used algorithms, object name after HASH, get a string of hexadecimal output value, which means that we will name an object into a string of numbers by HASH, then the first line above and the third line is the same as what's the point? Meaning that for a same object name, the calculated result is always the same , but the name of the object does HASH algorithm calculated a random number. With this random number to the random number is divided by the total number of PG, such as 256, the remainder must be obtained to fall, which is the 256 PG in one between 1-256.

Formula: HASH ( 'bar')% PG Number

  • 0x3E0A4162 % 0xFF ===> 0x62
  • 0x7FE391A0 % 0xFF ===> 0xA0

By the above calculation, the object bar to save the number 0x62 PG, the PG number is stored in the object foo in 0xA0. Object bar will always be stored in the PG 0x62 in! Object foo will always be saved to the PG 0xA0 in!

At present, it can be summed up how an object is stored in the PG:

HASH of object name to get a random number, then use the random number modulo the total number of PG, to give the total number of values ​​must fall between l- PG, the data will always change the object name is stored in the PG, .

Therefore, since the name of the object is determined, then the object stored data PG also determined.

Arising from such a problem, the object name to determine uniqueness, do it regardless of the size of the object data yet?

The answer is yes, that is Ceph does not distinguish between the real size of the content objects as well as any form of format, only recognized the object name.

 

Here is a little more Ceph description Ceph in fact, there are multiple pool, each pool of which there are several PG, the same as if two pool inside the PG number, Ceph how to distinguish it? Ever since, Ceph each pool is numbered, for example, just the rbd pool, given the number 0, to build a pool to give No. 1, then Ceph, the actual number of PG is pool_id+.+PG_idcomposed of, say, just the barobjects will be saved in 0.62this PG, the fooobject will be saved in 0.A0this PG in. Other names may PG pond 1.12f, 2.aa1,10.aa1and so on.

 

The physical layer of Ceph

In the logical layer, it is known how a file is stored in a Ceph in PG was, in short, by object name hash and modulo the total number of PG, the remainder obtained is the corresponding PG, the data is then stored in this PG return home group.

Then take a look Ceph in the physical layer. A number of servers, there are a number of disks on the server, usually, Ceph to a disk as a OSD (actually, OSD is a program managed disk) , so physical layer composed by a number of OSD, our ultimate goal is to save objects to disk , at the logical level, the objects are stored in the PG, then the task now is to open up the PG and OSD tunnel between. PG equivalent to the same number more than a bunch of objects combination, this part of the object called PG package, now we need to put a lot of packages placed on the average of each OSD, which is all to do CRUSH algorithm: CRUSH computing PG - > OSD mapping relationship

In this case, the object coupled to the layer just PG logic algorithm can be concluded that two formulas:

  • Cell ID + HASH ( 'object name')% PG_num -> PG_ID
  • CRUSH(PG_ID) --> OSD

Here the use of two algorithms HASH and CRUSH, what difference do these two algorithms? Why not use HASH (PG_ID) directly to the corresponding OSD it?

CURSH (PG_ID) ==> to HASH (PG_ID)% OSD_num ==> OSD

The following is a fat brother inference:

  1. If you hang a OSD, OSD_num = 1, then all the remainder will change PG% OSD_num, that this saved PG disk has changed, this is the simplest explanation, the data on a disk transfer from the PG up to another disk, an excellent storage architecture makes data migration should minimize the amount of damage in the disk, CRUSH can be done;
  2. If you save multiple copies, we hope to get more output OSD result, HASH can only get one but any number can be obtained CRUSH;
  3. If the increase in the number of the OSD, OSD_num increased, resulting in the same PG random migration between the OSD, it can ensure a uniform diffusion CRUSH data to the new machine.

 

Summed up HASH algorithm mapping is only suitable for one relationship, and the two calculated values ​​can not be changed, and therefore is not suitable where PG -> the map calculation of the OSD. So, here to introduce CRUSH algorithm.

 

CRUSH algorithm

Here CRUSH not intend to detail the source code, but to understand CRUSH algorithm by way of example.

First look at what to do:

  • The existing PG_ID mapped to the OSD, with the mapping relationship can be saved on a disk to a PG;
  • If you want to save a copy of the three, a PG can be mapped to three different OSD, PG holds exactly the same content on three OSD.

 

Take a look at what are:

  • Different from each other PG_ID;
  • If the OSD is also made to a number, then there is a mutually different OSD_ID;
  • Each OSD biggest difference is their capacity, that 4T or 800GB of capacity, weight capacity we each OSD OSD, also known as weight (weight) , weight is prescribed 4T 4,800G 0.8, it is T value-bit units.

 

The question now transformed into: how to map PG_ID to have their own weights OSD. Here direct use of straw algorithm CRUSH algorithm, which translates to draw, it means pick up the sign, sign here refers to the right of the OSD weight.

That can not always choose the largest capacity OSD bar, which is full of data, regardless of minutes that regard largest OSD yet? So before you pick these OSD rub a rub, where direct introduction CRUSH algorithm:

  • CRUSH_HASH( PG_ID, OSD_ID, r ) ===> draw
  • ( draw &0xffff ) * osd_weight ===> osd_straw
  • pick up high_osd_straw

The first line, we shall say r as a constant, the first line actually did rub a rub things: as with the PG_ID, OSD_ID and r input CRUSH_HASH of obtaining a hex output, and this HASH (object name) is completely similar, but more than two inputs. It should be stressed that, for the same three inputs, the calculated drawvalues are necessarily identical.

This drawin the end dim? In fact, CRUSH want to get a random number, which is here draw, and then go right to take this random number multiplied by the weight of the OSD, so the right random number and OSD heavy rub together, you get a real sign long each OSD, and each sign is not the same length (maximum probability), it is easy to pick from one of the longest.

 

To put it plainly, CRUSH want to 随机pick one out of the OSD, but also to meet the greater the weight the greater the probability of the OSD is picked in order to achieve the purpose of random, which makes each OSD before pick are holding their own weight multiplied by a random number, and then take the product of the largest one. So here we are again set small goals: pick a million times! From a macro point of view, the same is multiplied by a random number, after the sample size is large enough, the random number to pick the results are no longer influential, decisive influence weight is the weight of the OSD, that is to say, the more weight the right OSD the greater the probability of being picked in the large, macroscopic point of view.

 

The above content do not understand it does not matter, where PG select OSD when a thing to do in a simple sort:

  • It gives a PG_ID, as the input CRUSH_HASH;
  • CRUSH_HASH (PG_ID, OSD_ID, r) draw a random number (random number key is not the HASH);
  • For all OSD their weight multiplied by a random number corresponding to each OSD_ID, to give a product;
  • Select the product of the largest OSD;
  • The PG will be saved on the OSD.

 

By the above description, a problem can be solved has been mapped to a plurality of OSD PG's, while the constant r, when r + 1, find the random number again, each again multiplies OSD weight, again selected product of the maximum the OSD, and if the number is not the same as the previous OSD, then select it, and if the same number the previous OSD, then then r + 2, and then choose a random number, until the election we need three different numbers the date OSD!

Of course, the actual selection process but also a little more complicated, but here I use the most simple way to explain things in select OSD CRUSH when done.

 

Application CRUSH algorithm

CRUSH understand the process of selecting OSD above, we can easily further CRUSH algorithm combined with the actual structure, here is a tree diagram Sage painted in his doctoral thesis:

 

 

 

Bottom blue bar can be seen as a host, which can be seen as a gray cylindrical one OSD, purple is also a cabinet cabinets, green row can be regarded as a row of cabinets, we are top of the root the root node has no real meaning, you can see it as a data center of meaning, can also be seen as a means room, but only played a role in the root of a tree structure.

Based on this structure selection OSD, we put forward new requirements:

  • A total of three elected OSD;
  • OSD requires three are located in a row below;
  • Within each cabinet there is at most a OSD.

Such requirements, if a method selected from the OSD CRUSH spend section, two hundred thirty-two not meet the requirements, because the OSD distribution is random.

So to accomplish such a request, take a look at what:

  • Each of the OSD weight;
  • Each host can have a weight, the weight obtained by the cumulative weight of all the OSD within the host;
  • Each cabinet is accumulated by the weight obtained by the weight of all hosts, in fact, all the weight values ​​in the OSD and the cabinet;
  • Similarly push each row of the cabinet has the summation weight;
  • The root weight is actually right all OSD heavy sum.

 

So tree tree structure, each node has its own weight, the weight of each node by the heavy 下一层weight of node summation of weight, so right root of the root node is the weight of the weights of all the OSD clusters of heavy and then Once you have so much weight, how do we elect the three OSD?

CRUSH modeled on the method of election OSD:

  • CRUSH selects one row from all the row at the root;
  • In just a row of following all the cabinet, CRUSH elect three cabinet;
  • In all OSD three cabinet just below the middle, CRUSH elected, a OSD.

Because each row has its own weight, so CRUSH selected row and a method selected OSD exactly the same, with the right row of multiplied by a random number, taking the maximum. Then in this row continues elected the following three cabinet, and then select a OSD in each cabinet below.

To do so fundamental meaning that the average distribution of data in a cluster on the inside of all the OSD, if the two machines are heavy weights is 16:32, then the amount of data on the distribution of the two machines is 1: 2. At the same time, so did the three selected OSD distributed over three different cabinet.

So here are CRUSH algorithm combined with the legend of the process:

  • take(root) ============> [root]
  • choose(1, row) ========> [row2]
  • choose (3, cabinet) =====> [cab21, cab23, cab24] in [ROW2] at
  • choose (1, osd) ========> [osd2107, osd2313, osd2437] at three cab
  • emit ================> [osd2107, osd2313, osd2437]

 

CRUSH algorithm given here are two important concepts:

  • BUCKET / OSD: OSD and our one-disk, bucket OSD is in addition to all non-child nodes, such as the above cabinet, row, root, etc. are;
  • RULE: CRUSH choose to follow a path of selection, choose a path that is a rule.

 

RULE generally divided into three steps: take -> choose N -> emit. 

Take this step is responsible for selecting a root node, the root node is not necessarily the root, it may be any one of the Bucket.

choose N to do is to select qualified Bucket Bucket of weight for each statement and each choose, and select an object to choose the next step on the results obtained.

The end result is the output corresponds to emit the stack.

 

Here cite a simple example, which is our most common three hosts each host three OSD structure:

 

 

We each selected from the three host following a OSD, make three copies of the fall on a host, this time, we can guarantee hang two host, as well as a copy of the running, then this will shape RULE Such as:

  • take (root) ============> [default] is the root note of the name
  • choose(3, host) ========> [ceph-1, ceph-2, ceph-3]
  • choose(1, osd) ========> [osd.3, osd.1, osd.8]
  • emit()

Here in brief summary follows:

We painted the room a production environment into a tree structure:

The bottom layer is a layer of OSD, OSD each with its own weight.

The above OSD consists of host / rack / row / room / root node, etc., weight of each node are accumulated by the weight from the lower nodes.

Each node selection algorithm CRUSH (Straw) are the same, with the weight thereof is multiplied by a random number taken wherein the maximum, but we selected node to determine the type and number of statements by choose.

Finally, do not forget elected by the result PG-> OSD mapping, such as: pg 0.a2 ---> [osd.3, osd.1, osd.8] and pg 0.33 ---> [osd.0 , osd.5, osd.7], PG each have their own mapping to the OSD, this relationship is summarized by the formula: CRUSH (pg_id) ---> [osd.a, osd.b ... osd. n].

 

So far, we have completed a second step to save the data group Server, and then look back at this whole process:

  • Each file has a unique object name;
  • Pool_ID + HASH (object name)% PG_NUM obtained PG_ID;
  • CRUSH (PG_ID) to give the PG will be stored OSD combinations thereof;
  • This object will be saved to the PG located on the OSD (PG is a directory on disk) these

So, HASH algorithm is responsible for calculating the object name to PG mapping, CRUSH responsible for calculating the PG to the OSD mapping, let us remember this.

 

Ceph architecture

By understanding the principles of the face, look at Figure Ceph architecture very capacity to understand.

 

Guess you like

Origin www.cnblogs.com/hukey/p/12588436.html