0. introduce
Ceph is an open source distributed file system. Because it also supports block storage and object storage, it is naturally used as the entire storage backend of cloud computing framework openstack or cloudstack . Of course, it can also be used as storage alone, such as deploying a set of clusters as object storage, SAN storage, NAS storage, etc. Many companies at home and abroad have proved that ceph block storage and object storage are reliable. This article hopes to analyze the entire architecture of ceph in a simple way through my own understanding .
1. Architecture overview
1.1 Support interface
Object storage: radosgw, compatible with S3 interface. Upload and download files through rest api .
Filesystem: posix interface. The ceph cluster can be seen as a shared file system mounted locally.
Block storage: ie rbd . There are two ways to use kernel rbd and librbd . Snapshots and clones are supported. It is equivalent to a hard disk hanging locally, and the usage and purpose are the same as the hard disk.
1.2 Advantages
There are many distributed file systems, and ceph has the following advantages over others:
1.2.1 Unified Storage
Although the bottom layer of ceph is a distributed file system, the interface supporting objects and blocks is developed in the upper layer. Therefore, in open source storage software, it is possible to unify the rivers and lakes. As for whether it will last forever, I don't know.
1.2.2 High scalability
In other words, it is easy to expand and has a large capacity. Able to manage thousands of servers, EB -level capacity.
1.2.3 Strong reliability
Supports multiple strongly consistent replicas, EC . Replicas can be stored in mainframes, racks, computer rooms, and data centers. So safe and reliable. Storage nodes are self-managing and self-healing. No single point of failure, strong fault tolerance.
1.2.4 High Performance
Because there are multiple copies, it can be highly parallelized when reading and writing operations. In theory, the more nodes, the higher the IOPS and throughput of the entire cluster. Another point is that the ceph client reads and writes data directly and interacts with the storage device (osd) . No metadata server is required in block and object storage.
Note: The above are the advantages of the design concept presented by ceph. Since there are bugs in each version of ceph , the specific application must be verified by large-scale testing. Recommended versions: 0.67.0 , 0.80.7 , 0 , 94.2
1.3 Rados Cluster
Rados cluster is the storage core of ceph . The function of each component is briefly introduced below. Later, we will explain how the various components work together.
As shown in the figure, the rados cluster is divided into the following roles : mdss , osds , mons. Osd object storage device, which can be understood as a hard disk + osd management process, responsible for accepting client read and write, data verification (srub) between osd , data recovery ( recovery) , heartbeat detection, etc. Mons mainly solves the state consistency problem of distributed systems, and maintains the consistency of the node relationship graph (mon-map osd-map mds-map pg-map) in the cluster , including state updates of osd additions and deletions. The Mds metadata server must be configured when files are stored. It should be noted that the mds server does not store data, it is just a process for managing metadata. Metadata such as inodes of the Ceph file system really exist in the rados cluster (by default in the metadata pool ). The following information presents the status information in the ceph cluster.
# ceph -s
cluster 72d3c6b5-ea26-4af5-9a6f-7811befc6522
health HEALTH_WARN
clock skew detected on mon.mon1, mon.mon3
monmap e3: 3 mons at {mon1=10.25.25.236:6789/0,mon2=10.25.25.235:6789/0,mon3=10.25.25.238:6789/0}
election epoch 16, quorum 0,1,2 mon2,mon1,mon3
osdmap e330: 44 osds: 44 up, 44 in
pgmap v124351: 1024 pgs, 1 pools, 2432 GB data, 611 kobjects
6543 GB used, 153 TB / 160 TB avail
1024 active+clean
2. What is Object Storage
My understanding is explained from several aspects:
One is: In terms of storage data types, it refers to unstructured data, such as pictures, audio and video, documents, etc.
The second is: in terms of application scenarios, that is, one write and multiple reads.
The third is: from the way of use, unlike the file posix , object storage generally uses the bucket (bucket) concept, giving you a bucket, you can store data in the bucket (through the object id ).
3. What is block storage
Typical devices for block storage are disks or disk arrays. Data is accessed in blocks. The iops and throughput of a single disk are very low, so it is natural to think of using raid to read and write in parallel. Ceph 's block storage of course also solves such problems, but uses distributed to provide higher performance, reliability and scalability. It's just that ceph is a network block device, like a SAN . The following figure shows that ceph will be distributed on different host disks "virtually" and provided to VM as disks.
4. Ceph component interaction
Nonsense: In fact, in the IT field, it has always been a state of separation and integration. Divided for a long time must be together, long together must be divided. For example, a single computer consists of cpu, memory, input and output devices, buses and other components working together. When a single computer has insufficient storage capacity, it begins to differentiate into a distributed architecture. A network is equivalent to a bus, and a network device is a routing center, a data transfer station, and a computer that provides stronger computing and storage capabilities to the outside world. Physical (logical) separate hosts work together, and the following problems need to be solved:
Connect: How to connect separate units together.
Find: The purpose of connecting is of course not for reproduction, but for communication. You have to find it first.
Send: After you find the host, you have to send data to interact to generate value. Let’s multiply data.
Let's see how ceph solves these problems.
4.1 rados - Lianliankan
The division of labor of each role in the rados cluster was introduced earlier . The components are linked through a logical deployment diagram.
4.2 crush- I want to find you
When the client reads and writes data, it needs to find the node location where the data should be stored. The general practice is to record this mapping relationship (hdfs) or calculate it by algorithm (consistent hash ). Ceph uses a smarter crush algorithm to solve the mapping of files to osd . Let's first look at the relationship between management data ( objects ) in ceph . First , the objects in ceph are pooled , and there are several pg in the pool , and pg is a logical unit that manages a bunch of objects . A pg is distributed on different osd . As shown in the figure, the file on the client side will be split into objects , and these objects are distributed in units of pg .
The input of the Crush algorithm is (pool obj), which is the namespace of ceph
-
Calculate the pg number, through obj mod the total number of pg in the current pool
-
Crush(pg,crushmap) calculates multiple osd . The crush algorithm is a multi-input pseudo-random algorithm. Crushmap is mainly the hierarchical result, weight information, and copy strategy information of the entire cluster osd .
Later I will introduce the internal implementation details of the crush algorithm.
4.3 ceph rw -data interaction
Communication between nodes in a Rados cluster. It mainly includes the following aspects:
Client read and write
Mainly receive client upload, download, update, delete operations. In order to maintain strong data consistency, a write operation must be completed only after several copies of the data are successfully written. (The client will only write to the primary )
Read and write within the cluster
Including data synchronization between osd , data verification, heartbeat check. Heartbeat check between mon and osd , status synchronization between mon , etc.