Ceph Architecture Understanding

0. introduce  

Ceph is an open source distributed file system. Because it also supports block storage and object storage, it is naturally used as the entire storage backend of cloud computing framework openstack or cloudstack . Of course, it can also be used as storage alone, such as deploying a set of clusters as object storage, SAN storage, NAS storage, etc. Many companies at home and abroad have proved that ceph block storage and object storage are reliable. This article hopes to analyze the entire architecture of ceph in a simple way through my own understanding .

1. Architecture overview  

 

1.1 Support interface

Object storage: radosgw, compatible with S3 interface. Upload and download files through rest api .

Filesystem: posix interface. The ceph cluster can be seen as a shared file system mounted locally.

Block storage: ie rbd . There are two ways to use kernel rbd and librbd . Snapshots and clones are supported. It is equivalent to a hard disk hanging locally, and the usage and purpose are the same as the hard disk.

1.2 Advantages

There are many distributed file systems, and ceph has the following advantages over others:

1.2.1 Unified Storage

   Although the bottom layer of ceph is a distributed file system, the interface supporting objects and blocks is developed in the upper layer. Therefore, in open source storage software, it is possible to unify the rivers and lakes. As for whether it will last forever, I don't know.

1.2.2 High scalability

    In other words, it is easy to expand and has a large capacity. Able to manage thousands of servers, EB -level capacity.

1.2.3 Strong reliability

Supports multiple strongly consistent replicas, EC . Replicas can be stored in mainframes, racks, computer rooms, and data centers. So safe and reliable. Storage nodes are self-managing and self-healing. No single point of failure, strong fault tolerance.

1.2.4 High Performance

Because there are multiple copies, it can be highly parallelized when reading and writing operations. In theory, the more nodes, the higher the IOPS and throughput of the entire cluster. Another point is that the ceph client reads and writes data directly and interacts with the storage device (osd)  . No metadata server is required in block and object storage.

      Note: The above are the advantages of the design concept presented by ceph. Since there are bugs in each version of ceph , the specific application must be verified by large-scale testing. Recommended versions: 0.67.0 , 0.80.7 , 0 , 94.2

 

1.3 Rados Cluster

Rados cluster is the storage core of ceph . The function of each component is briefly introduced below. Later, we will explain how the various components work together.

                  

As shown in the figure, the rados cluster is divided into the following roles : mdss , osds , mons. Osd  object storage device, which can be understood as a hard disk + osd  management process, responsible for accepting client read and write, data verification (srub) between osd , data recovery ( recovery) , heartbeat detection, etc. Mons  mainly solves the state consistency problem of distributed systems, and maintains the consistency of the node relationship graph (mon-map osd-map mds-map pg-map) in the cluster , including state updates of osd additions and deletions. The Mds metadata server must be configured when files are stored. It should be noted that the mds server does not store data, it is just a process for managing metadata. Metadata such as inodes of the Ceph file system really exist in the rados cluster (by default in the metadata pool ). The following information presents the status information in the ceph cluster.

# ceph -s
    cluster 72d3c6b5-ea26-4af5-9a6f-7811befc6522
     health HEALTH_WARN
            clock skew detected on mon.mon1, mon.mon3
     monmap e3: 3 mons at {mon1=10.25.25.236:6789/0,mon2=10.25.25.235:6789/0,mon3=10.25.25.238:6789/0}
            election epoch 16, quorum 0,1,2 mon2,mon1,mon3
     osdmap e330: 44 osds: 44 up, 44 in
      pgmap v124351: 1024 pgs, 1 pools, 2432 GB data, 611 kobjects
            6543 GB used, 153 TB / 160 TB avail
                1024 active+clean

 

2. What is Object Storage  

My understanding is explained from several aspects:

One is: In terms of storage data types, it refers to unstructured data, such as pictures, audio and video, documents, etc.

The second is: in terms of application scenarios, that is, one write and multiple reads.

The third is: from the way of use, unlike the file posix , object storage generally uses the bucket (bucket) concept, giving you a bucket, you can store data in the bucket (through the object id ).

3. What is block storage  

Typical devices for block storage are disks or disk arrays. Data is accessed in blocks. The iops and throughput of a single disk are very low, so it is natural to think of using raid to read and write in parallel. Ceph 's block storage of course also solves such problems, but uses distributed to provide higher performance, reliability and scalability. It's just that ceph is a network block device, like a SAN . The following figure shows that ceph  will be distributed on different host disks "virtually" and provided to VM as disks.

 

4.   Ceph  component interaction

Nonsense: In fact, in the IT field, it has always been a state of separation and integration. Divided for a long time must be together, long together must be divided. For example, a single computer consists of cpu, memory, input and output devices, buses and other components working together. When a single computer has insufficient storage capacity, it begins to differentiate into a distributed architecture. A network is equivalent to a bus, and a network device is a routing center, a data transfer station, and a computer that provides stronger computing and storage capabilities to the outside world. Physical (logical) separate hosts work together, and the following problems need to be solved:

Connect: How to connect separate units together.

Find: The purpose of connecting is of course not for reproduction, but for communication. You have to find it first.

Send: After you find the host, you have to send data to interact to generate value. Let’s multiply data.

Let's see how ceph solves these problems.

4.1 rados - Lianliankan

  The division of labor of each role in the rados cluster was introduced earlier . The components are linked through a logical deployment diagram.

4.2 crush- I want to find you

  When the client  reads and writes data, it needs to find the node location where the data should be stored. The general practice is to record this mapping relationship (hdfs) or calculate it by algorithm (consistent hash ). Ceph uses a smarter crush algorithm to solve the mapping of files to osd . Let's first look at the relationship between management data ( objects ) in ceph . First , the objects in ceph are pooled , and there are several pg in the pool , and pg is a logical unit that manages a bunch of objects . A pg is distributed on different osd . As shown in the figure, the file on the client side will be split into objects , and these objects are distributed in units of pg .

 

The input of the Crush algorithm is (pool obj), which is the namespace of ceph

  1. Calculate the pg number, through obj mod  the total number of pg in the current pool

  2. Crush(pg,crushmap)  calculates multiple osd . The crush algorithm is a multi-input pseudo-random algorithm.  Crushmap is mainly the hierarchical result, weight information, and copy strategy information of the entire cluster osd .

Later I will introduce the internal implementation details of the crush algorithm.

4.3 ceph rw -data interaction

        Communication between nodes in a Rados cluster. It mainly includes the following aspects:

Client read and write

Mainly receive client upload, download, update, delete operations. In order to maintain strong data consistency, a write operation must be completed only after several copies of the data are successfully written. (The client will only write to the primary )

Read and write within the cluster

       Including data synchronization between osd , data verification, heartbeat check. Heartbeat check between mon and osd , status synchronization between mon , etc.

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326771391&siteId=291194637