Ceph technology

background

Ceph was developed by Sage Weil in 2006 as a doctoral thesis project. In recent years, as OpenStack has adopted it as the default block storage backend, it has gradually become popular and has become a star project in the open source community. Ceph is a unified storage system that supports both traditional block storage and file storage protocols, as well as emerging object storage protocols, which enables Ceph to meet the needs of most current storage scenarios.

Architecture

Although Ceph has been created for more than 10 years, its design philosophy is still current. It takes high scalability, high reliability and high performance as its core design concepts. RADOS is the core supporting component of Ceph. Ceph has built three core application components through RADOS and its interface library librados: RGW object storage, RBD block storage, and CephFS file storage. More types of storage applications can be developed through RADOS and librados.

CREATED

RADOS is the abbreviation of Reliable Autonomic Distributed Object Store, which is a highly reliable distributed object storage system underlying Ceph. RADOS has built a random read-write object storage system. The upper-layer object storage, block storage and file storage are all mapped to the underlying RADOS object storage. RADOS object storage mainly includes the concepts of Collection and Object.

  • Collection: Equivalent to "directory", called PG in RADOS, it is a shard of Pool

  • Object: equivalent to "file", which mainly includes three types of data:

    • Data: Equivalent to "file data"

    • Attr: equivalent to "file attributes", small KV pair

    • OMap: unrestricted KV pairs

Each Object has a unique ID, binary data and a set of KV pair information:

RADOS includes three types of processes:

  • OSD: The process of storing data, responsible for data replication, balancing and repair

  • Monitor: Cluster status management, using Paxos to achieve high availability of meta-information. Meta-information includes OSD Map, MDS Map and Monitor Map (the Paxos cluster's own information in Monitor).

  • Manager: maintains the detailed information of the Placement Group and Host, and uninstalls a large part of the read-only requests for this part on the Monitor to improve scalability.

Logical sharding

Ceph pools and manages all storage resources in the cluster, abstracting the resource pool into the concept of Pool. Each Pool has a unique Pool ID and can set different attributes (such as number of copies, replication method, storage media, etc.) . Each Pool can be further divided, and each shard is called PG (abbreviation for Placement Group). The number of PGs is usually set to a power of 2. The number of PGs can be adjusted for expansion (the number of PGs is usually not adjusted). ). The storage unit in RADOS is an object. Each object has a unique ObjectId, and each object belongs to a certain PG.

In the figure above, Ceph's data slicing and distribution are divided into two layers:

  • First layer: The mapping from Object to PG is a static logical mapping. By inputting ObjectId and PoolId through a pseudo-random Hash function, the corresponding PGId can be obtained.

  • Second layer: The mapping from PG to OSD is a dynamic physical mapping. The PGId and cluster topology information are input through the CRUSH algorithm to calculate the OSD distribution of PG replicas.

Although RADOS uses Object as the read and write management unit, the underlying data is organized in PG. The PG mapping method is called stable_mod, and its logic is as follows:

if ((hash & (2^n - 1)) < pg_num)
	return (hash & (2^n - 1))
else
	return (hash & (2^(n-1) - 1))

2^n - 1 is the mask of PG. The number of PG should be set to the power of 2 as much as possible, so that the data distribution is more even. If the number of PG is not a power of 2, that is, 2^(n-1) < PG < 2^(n), follow the stable modulo algorithm above to redistribute the data between [PG, 2^n) to [ Between 2^n-PG, 2^(n-1)), the figure below shows the data distribution when n=4 and PG=12. PGId 4-7 will have more data than other PGs.

In addition to the PG_NUM above, there is also a PGP_NUM in RADOS. Understanding the difference between these two concepts is crucial to understanding PG splitting. PG_NUM is the number of PGs in the Pool to which the object belongs, which determines the logical placement; PGP_NUM is the number of PGs in the Pool corresponding to the parent PG to which the PG belongs, which determines the physical placement. Normally PG_NUM=PGP_NUM, but when PG is split, these two values ​​​​are different. First increase PG_NUM and reduce the data volume of each fragment but the physical placement remains unchanged. Then increase PGP_NUM to adjust the physical placement of PG to create a new split. The outgoing PG is load balanced to the new OSD. For details, please refer to the description of PG and PGP in the Ceph community:

PG = Placement Group

PGP = Placement Group for Placement purpose

pg_num = number of placement groups mapped to an OSD

When pg_num is increased for any pool, every PG of this pool splits into half, but they all remain mapped to their parent OSD.

Until this time, Ceph does not start rebalancing. Now, when you increase the pgp_num value for the same pool, PGs start to migrate from the parent to some other OSD, and cluster rebalancing starts. This is how PGP plays an important role.

Ceph’s PG split is actually divided into two stages:

  • The first stage is to adjust the number of PGs. The PGs on the corresponding OSD will be divided into several new PGs according to the left shift of the Hash value. This process affects the mapping of objects to PGs;

  • The second stage is to adjust the number of PGPs and start load balancing the newly created PGs. This process affects the PG placement strategy.

The introduction of PGP_NUM is also to make the split smoother. If the PGId of the newly added child PG is directly used as the input of the CRUSH algorithm, a large number of newly added child PGs will be migrated before the OSD. Therefore, during the splitting process, we can use the parent PG number PGP_NUM as the CRUSH algorithm input, so that the corresponding Object data can be quickly found without data migration. In this way, regardless of whether the PG is split or not, the CRUSH algorithm can always find the corresponding PG with PGP_NUM as input. This process is different from traditional Table splitting. Table splitting has a global central control responsible for splitting and recording the starting and ending points of the split. However, the entire data reading and writing process of Ceph is based on Hash calculation, so this two-stage basis is adopted. The way the Object's Hash value is split.

physical distribution

RADOS does not maintain the physical distribution metainformation of PG between OSDs in Monitor/Manager, but solves it through the CRUSH algorithm. The full name of CRUSH algorithm is: Controlled, Scalable, Decentralized Placement of Replicated Data, which is a controllable, scalable and distributed copy data placement algorithm. The CRUSH algorithm is a stable copy distribution calculation algorithm that implements stable pseudo-random mapping from PG to OSD. Even if one or more devices join or leave the cluster, most of the PG to OSD mapping relationship remains unchanged, and CRUSH only moves part data to maintain the balance of data distribution. CRUSH also supports setting weights to control the amount of data allocated on each storage device. The weights can be set based on the capacity or performance of the device.

The addressing process from PG to OSD in the CRUSH algorithm can be expressed by the following function:

CRUSH(PGId, ClusterMap, PlacementRule) -> (OSDx, OSDy, OSDz)

ClusterMap defines a static topology structure of the OSD cluster with hierarchical relationships. The topological representation of this hierarchical relationship enables the CRUSH algorithm to achieve rack awareness, that is, to distribute copies in different computer rooms and different racks to achieve high data reliability. There are some concepts in hierarchical Cluster Map:

  • Device: The most basic storage device, that is, OSD. One OSD corresponds to a disk storage device.

  • Bucket: A container for devices, which can recursively contain buckets of multiple devices or subtypes. There are six Bucket types in Ceph by default: Root, Datacenter, Room, Row, Rack, and Host. Users can customize new types. Each Device has a weight (generally related to storage space), and the weight of a Bucket is the sum of the weights of its sub-Buckets.

Placement Rule determines the rules for selecting replicas of a PG, allowing users to set the distribution of replicas in the cluster. The Placement Rule definition generally includes three parts: take to select the bucket, choose to select osd, and emit to output the result. Its definition format is as follows:

tack {bucket}  
choose 
  choose firstn {num} type {bucket-type}
  chooseleaf firstn {num} type {bucket-type}
    If {num} == 0, choose pool-num-replicas buckets (all available).
    If {num} > 0 && < pool-num-replicas, choose that many buckets.
    If {num} < 0, it means pool-num-replicas - {num}.
emit

The second step of choose can be iterated multiple times. For example, if you choose three OSDs with different Tors in the same Row, you can first choose a certain Row, then choose three Tors, and finally choose one OSD.

step take root
step choose firstn 1 type row
step choose firstn 3 type tor
step choose firstn 1 type osd
step emit

Placement Rule can also implement the primary copy on SSD and other copies on HDD:

step take ssd

step chooseleaf firstn 1 type host

step emit

step take hdd

step chooseleaf firstn -1 type host

step emit

CRUSH supports a variety of Bucket random selection algorithms, such as Uniform, List, Tree, Straw, and Straw2. The default Straw algorithm is easier to cope with the addition and deletion of Items. A comparison of several algorithms is as follows:

The OSD selection process is controlled by the CRUSH_HASH(x, r, i) function. Where x is the PGId, r is the number of selected replicas, and i is the Id of the osd. Take the Straw algorithm as an example. It is a pseudo-random lottery algorithm. Each OSD is multiplied by a random number related to the OSD when selecting, and finally the OSD with the largest product is selected. In this way, under massive pick calculations, it is basically guaranteed that the OSD with the largest weight will be selected. Choose with a higher probability:

foreach item in bucket
    draw = CRUSH_HASH(PG_ID, OSD_ID, r)
    osd_straw = (draw & 0xffff) * osd_weight
    check and restore high_osd_straw
return high_osd_straw

When executing the Bucket selection algorithm for multiple copies of PG, you only need to simply increment the r parameter in the above Hash algorithm to calculate the corresponding OSD. Conflicts that may occur with different r inputs, such as an OSD being repeatedly selected, or an OSD being offline or overloaded, require reselection in these cases. When a conflict occurs and needs to be reselected, the parameter r order Adding a recalculation hash function can get a new hash value. The following is a case of osd2 conflict re-selection during the pg 1.1 copy selection process:

The CRUSH algorithm will cause certain data imbalance in small-scale clusters. In addition, when adding new devices, data migration will inevitably occur between the old and new devices. Considering that even if a dedicated table is used to manage the mapping relationship of each PG to OSD, this table is very small (MB level) in the case of a fairly large cluster, and the physical distribution meta-information of replicas in many storage systems is saved. In the Master, only this part of meta-information needs to be cached on the Client side, which will not affect the read and write performance of the entire storage system. In addition, changes in cluster topology information in the CRUSH algorithm will still cause some data to be migrated. Using a centralized mapping table can only migrate the minimum data that needs to be migrated. The logical data sharding part of the design of CDS refers to the implementation of RADOS, but the physical data distribution directly stores the copy location information of the shards in the Master.

There is no separate capacity-based load balancing process in Ceph. Instead, data migration based on the CRUSH algorithm is automatically triggered when a new OSD device is added, so that the data distribution in the cluster is automatically balanced. The basic unit of data migration is PG. When a new OSD is added, the data will be completely empty. Follow the repair process below to repair some PG copies on it to migrate the data to the new OSD.

Reading and writing process

The reading and writing of RADOS follows the two-stage scheme introduced above. The pseudo code of the specific reading and writing process is as follows:

locator = object_name
obj_hash =  hash(locator)
pg = obj_hash % num_pg
osds_for_pg = crush(pg)    # returns a list of osds
primary = osds_for_pg[0]
replicas = osds_for_pg[1:]

After RADOS finds the PG to which the Object belongs through the CRUSH algorithm above, it directly sends the read and write requests to the main OSD of the PG. There are two scenarios for writing and reading: multi-copy and EC. Reading and writing from the client perspective are sent to the main OSD. OSD.

In a multi-copy scenario, the client writes directly to the primary OSD, and the primary OSD returns to the client only after all three copies have been written. The specific process is as follows:

When OSD uses FileStore in earlier versions, two ACKs will be performed when writing Journal and writing Data (there is only one ACK in the current default BlueStore). The specific process is as follows:

The writing process in the EC scenario is similar to that of multiple copies. The Client first writes to the main OSD, and then the main OSD performs EC encoding and writes the generated data blocks and encoding blocks to all OSDs. When all OSDs are written successfully, Return again. If the lengths are not aligned during the writing process, use the Read-Modify-Write method to read the previous data, merge it with the new update, and then write it, and zero-pad the insufficient stripe width in the Append write. The specific process is as follows:

Repair process

Ceph's troubleshooting process is mainly divided into three major steps:

  1. Perception of cluster status: First, Ceph must be able to sense cluster failures in a timely manner through some method, determine the status of nodes in the cluster, determine which nodes have left the cluster, and provide authoritative basis for determining which copies of data are affected by the failure.

  2. Determine the data affected by the failure: Ceph calculates and determines the data that is missing from the replica based on the new cluster state.

  3. Restore affected data.

The Ceph cluster is divided into two parts: MON cluster and OSD cluster. The MON cluster forms a decision-maker cluster through the Paxos algorithm to jointly make decisions and broadcast key cluster events. "OSD node leaves" and "OSD node joins" are two key cluster events. OSD node joining and leaving events are realized through the OSD heartbeat mechanism. OSD regularly reports heartbeats to the Mon cluster. Fault detection is also performed regularly between OSDs in the same PG. The status information of the OSD node is stored in the OSD Map on the Mon cluster. . The Mon cluster determines whether the OSD node is online or offline through the OSD node heartbeat report and the fault detection report between OSD nodes.

After determining that the OSD node is offline, the Mon cluster modifies the OSDMap and adds its epoch version information, and randomly distributes the latest OSDMap to an OSD through the message mechanism. When the client (peer OSD) processes the IO request, it finds that its OSDMap version is too low. , will request the latest OSDMap from MON. The other two copies of PG in each OSD may be in any OSD in the cluster, so that after a period of propagation, eventually the OSDs in the entire cluster will receive the OSDMap update. (This process is a lazy trigger update method to prevent the cluster from generating sudden repair traffic when OSD fails on a large scale)

After receiving the OSDMap update message, the OSD will scan all PGs under the OSD, clean up the PGs that no longer exist (have been deleted, etc.), and initialize the PGs. If the PG on the OSD is a Primary PG, the PG will Peering operation. During the peering process, PG will check the consistency of multiple copies based on PGLog, and try to calculate the data missing in different copies of PG. Finally, a complete list of missing objects will be obtained, which will be used as a basis for subsequent recovery operations (Recovery It is a background process that ensures that copies of all objects in PG exist on the Acting Set node). For PGs whose lost data cannot be calculated based on PGLog, you need to copy the entire PG data through the Backfill operation to restore it. It should be noted that before the Peering process is completed, PG data is unreliable, so PG will suspend all client IO requests during the Peering process. The process of Peering mainly includes three steps:

  • GetInfo: The PG's master OSD obtains the pg_info information of all slave OSDs by sending messages.

  • GetLog : Based on the comparison of the pg_info information obtained from each replica, select an OSD with an authoritative log. If the primary OSD is not the OSD that owns the authoritative log, the authoritative log is pulled from the OSD. After the main OSD completes pulling the authoritative log, it will also have the authoritative log.

  • GetMissing : The main OSD pulls the PG logs of other OSDs. By comparing with the local authoritative log, the missing object information on the OSD is calculated as the basis for subsequent recovery operations.

After Peering is completed, PG enters the Active state and marks itself as Degraded/Undersized according to the PG copy status. In the Degraded state, the number of logs stored in PGLog will be expanded from 3,000 to 10,000 records by default, providing more data. Records facilitate data recovery after the replica node goes online. After entering the Active state, PG is available and begins to accept data IO requests, and decides whether to perform Recovery and Backfill operations based on Peering information. Primary PG will copy the data of specific objects based on the object's missing list. For missing data from Replica PG, Primary will push the missing data through Push operation. For missing data from Primary PG, it will obtain missing data from the replica through Pull operation. During the recovery operation, PG will transfer the complete 4M size object. For those that cannot rely on PGLog for recovery, PG will perform a Backfill operation to make a full copy of the data. After the data of each copy is completely synchronized, the PG is marked as Clean, the copy data remains consistent, and the data recovery is completed. For details on Peering, please refer to the official documentation , "Ceph Principle and Implementation" and "Ceph Source Code Analysis".

Since Ceph's IO process must be carried out through the Primary PG, once the OSD where the Primary PG is located goes down, IO will not be able to proceed normally. After the Primary OSD goes down, if there is no data on the new Primary OSD obtained by MON through the CRUSH algorithm, Backfill will be triggered to restore the data. In order to ensure that normal business IO will not be interrupted during the recovery process, MON will allocate PG Temp to temporarily process IO requests. Remove PG Temp after data recovery is complete. For example, at the beginning, the Acting Set (incumbent set) of PG is [0, 1, 2], and osd0 is the main osd. When osd0 fails, CRUSH recalculates the Acting Set of the PG to be [3, 1, 2]. At this time, osd3 If it is the main osd but there is no data on it, apply for a temporary PG [1, 2, 3] from the Monitor, and use osd1 as the main osd of the temporary PG. After osd3 completes Backfill, delete the temporary PG and set the Acting Set Change to [3, 1, 2].

Cache Tier

RADOS implements an automatic tiered storage mechanism based on Pool. The first layer is called Cache Tier and is set to Cache Pool, using SSD high-speed storage devices; the second layer of Storage Tier is set to Data Pool, using HDD large-capacity low-speed storage devices. Use EC mode to reduce storage space. Data between the Cache Tier layer and the Storage Tier layer is automatically migrated based on activity. Cache Tier can improve the performance of critical data or hot data while reducing storage costs.

Cache Tier supports the following modes:

  • write back: Read and write requests are sent directly to the cache pool. The data in the cache pool is automatically flushed to the data pool. When there is a miss in the cache pool, it is loaded from the data pool. It is suitable for application scenarios with large amounts of modifications.

  • read proxy: When the read object is not in the cache pool, the cache pool layer sends a request to the data pool. It is suitable for transition from write back mode to none mode.

  • Read forward: When the read object is not in the cache pool, it is returned directly to the client and then the client directly reads the data pool.

  • write proxy: When the written object is missed in the cache pool, it does not wait for the data to be loaded from the data pool to the cache pool, and directly sends the write to the data pool.

  • read only: also called write-around or read cache, read requests are sent directly to the cache pool, and write requests are sent directly to the data pool. The cache pool is generally set to a single copy, which is suitable for write-once and read-many scenarios.

BlueStore

BlueStore is Ceph's default ObjectStore storage backend starting from the Luminous version. Previously, it was FileStore built based on the local file system. The community developed a NewStore but there was no final Release, FileStore and NewStore issues:

  • Since the underlying file system uses Journal, the data ends up being written twice, which means that Ceph sacrifices half of its disk throughput.

  • Journaling of Journal issue, this is discussed in the above-mentioned Write Behaviors paper. Ceph's FileStore does the log once, and the Linux file system itself also has a log mechanism. In fact, the log is done twice.

  • For new LSM-Tree storage, such as RocksDB and LevelDB, since the data itself is organized in log form, there is actually no need to add a separate WAL.

  • Better utilize the performance of SSD/NVM storage media. Unlike disks, Flash-based storage has higher parallel capabilities that need to be exploited. The processing speed of CPU is gradually slower than that of storage, so it is necessary to make better use of multi-core parallelism. Queues used in large quantities in storage can easily lead to time-consuming concurrency competition and need to be optimized. On the other hand, RocksDB has good support for SSD, etc., and it is adopted by BlueStore.

BlueStore is the abbreviation of Block NewStore. The overall architecture is divided into three parts: BlockDevice, BlueFS and RocksDB. BlockDevice is the lowest level block device. BlueStore abandons the local file system and directly uses AIO to read and write bare block devices. Since AIO only supports directIO, write operations to BlockDevice are written directly to the disk and need to be aligned according to the page. BlueStore is based on Ceph's object storage implemented by RocksDB and BlockDevice. All its metadata is stored in RocksDB, a KV storage system, including object collections, objects, omap information of storage pools, disk space allocation records, etc., are all stored in RocksDB. The data of its objects is stored directly on the BlockDevice, without using the local file system, directly taking over the raw device, and using only one raw partition. The overall architecture of BlueStore is shown in the figure below:

The raw device is managed through the Allocator (allocator) and the data is saved directly to the device; at the same time, RocksDB is used to save the metadata. The underlying layer encapsulates a BlueFS based on the bare disk and uses BlueRocksEnv to connect to RocksDB.

BlueFS is a file system implemented to support RocksDB. Reading and writing files in RocksDB must implement the rocksdb::EnvWrapper interface. BlueFS uses BlueRocksEnv to directly implement a lightweight file system on the bare disk to support RocksDB. BlueFS divides storage space into three layers: slow (Slow) space, high-speed (DB) space, and ultra-high-speed (WAL) space. BlueFS mainly has three parts of data: Superblock, Journal and Data. Superblock mainly stores the global information of BlueFS and log information, and its position is fixed at the head of BlueFS; Journal stores log records, and generally pre-allocates a continuous area, and then allocates it from the remaining space after it is full; Data is the actual file The data storage area is allocated an area from the remaining space each time it is written. SuperBlock and Journal form the LogBase architecture. SuperBlock is the Base of meta-information, and Journal is the Log.

BlueFS is designed to be as simple as possible and is specifically designed to support RocksDB. It does not need to support the POSIX interface and only needs to adapt to RocksDB through RocksEnv. In general it has these characteristics:

  • In terms of directory structure, BlueFS only has a flat directory structure and no tree hierarchical relationship; it is used to place the db.wal/, db/, db.slow/ files of RocksDB. These files can be mounted on different hard disks, for example, db.wal/ is placed on an NVMe high-speed device; db/ containing hot SST data can be placed on an SSD; db.slow/ is placed on an HDD disk.

  • In terms of data writing, BlueFS does not support overwriting, only append-only. The block allocation granularity is coarse, about 1MB. There is a garbage collection mechanism that regularly handles wasted space.

  • Operations on metadata are recorded in the log, and the log is replayed each time it is mounted to obtain the current metadata. Metadata exists in memory and is not persisted on disk. There is no need to store things such as free block lists. When the log is too large, Compact will be rewritten.

There are three types of writing to BlueStore:

  • Write to the newly allocated area of ​​​​Object: no need for Journal, write directly to Data, and write the Meta record index position after writing.

  • Write the new location of the existing Blob: no need for Journal, write directly to Data, and write the Meta record index location after writing.

  • Overwriting and writing to the location of an existing Blob: Journal needs to be introduced for delayed writing, and the Journal needs to contain data. If the write length is small, writes are merged by writing to the Journal; if the write length is large, the written data is divided into three blocks according to the disk block size: the first and last non-aligned part and the middle block aligned part. The middle block part can be written to the new position according to ROW and then the index record in Meta is updated; the non-aligned parts of the first and last parts are written to the Journal, that is, the log disk is written successfully before the data disk is updated, and the log is released after the data disk update is completed.

There is no Overwrite scenario for object storage, so BlueStore has a great performance improvement for object storage. For the Journal in the Overwrite scenario, the Journal is not implemented based on the file system, but is written to RocksDB and implemented using its WAL. Here, Simple Write represents new write/aligned write (COW) and other scenarios that do not require WAL, and Deferred Write represents scenarios that require writing WAL (RMW-Read Modify Write). For an IO write request at the user or OSD level, it may be a Simple Write, a Deferred Write, or a combination of Simple Write and Deferred Write at the BlueStore layer.

Simple Write

The writing process of SimpleWrite first writes the data into the new block, and then updates the k/v meta information:

Deferred Write

In the writing process of Deferred Write, WAL is directly encapsulated in RocksDB's k/v transaction, and RocksDB writes logs to commit k/v operations; after the log writing is completed, a transaction is executed to update the data on the disk through RMW; and finally it is deleted. The k/v in RocksDB written in the first step:

Simple Write + Deferred Write

In the case of the combination of Simple Write and Deferred Write, the processing flow is equivalent to combining the two, except that the update of k/v meta information in Simple Write and the WAL writing to RocksDB in Deferred Write are merged into one RocksDB transaction. The other stages of the processing flow are as follows. The above two writes are the same:

RGW

Object storage uses objects as data storage units, abandons the characteristics of file system metadata management, and stores all objects in a flat manner. RGW is Ceph's gateway system for object storage. It builds an HTTP proxy layer on top of RADOS to implement object storage logic.

data model

The data model of object storage is a hierarchical relationship of users, buckets and objects:

  • User: User of the object storage application. A user can have one or more buckets.

  • Bucket: A container for object storage, a management unit for objects with the same attribute

  • Object: The basic unit of data organization and storage. An object includes data and metadata.

In order to realize the authentication, authorization and quota of the object storage system, RGW saves the user information in a RADOS object.

A bucket also corresponds to a RADOS object. The information contained in a bucket is divided into two categories: one is user-defined metadata that is transparent to the RGW gateway, and the other is the storage policy, number of index objects, and other information that the RGW gateway is concerned about. When a bucket is created, one or more index objects will be created simultaneously to save the list of objects under the bucket to support List Bucket operations. Therefore, the index object must be updated when uploading and deleting objects. Ceph initially had only one index object per bucket, and the update of the index object became a bottleneck for object uploading and deletion. In the new Ceph version, the index object is sliced ​​into multiple index objects, which can greatly improve the write performance of the object. Index object sharding will cause List Bucket operations to slow down. Ceph uses parallel reading of index objects and then merges them for optimization.

There are two ways to upload objects: overall upload and segmented upload, of which segmented upload is also called three-step upload. RGW limits the size of an object uploaded as a whole to no more than 5GB. If it exceeds this size, it must be uploaded in parts. During the upload process, RGW performs authentication, authentication, and rate limiting. During the rate limiting process, there is a QoS cache in multiple RGW gateways, and the local QoS is updated regularly in the Read-Modify-Write method.

Upload as a whole

When uploading as a whole, when the object size is smaller than the RGW block size, the object uploaded by the user only corresponds to one RADOS object. The object is named after the application object name, and the application object metadata is stored in the extended attributes of this RADOS object. When the object uploaded by the user is larger than the RGW chunk size, it will be split into multiple chunks: a first object with a size equal to the chunk size, multiple intermediate objects with a size equal to the strip size, and a small fish equal to the strip size. The size of the tail object. The first object is named after the application object name and is called head_obj. It saves the rgw_max_chunk_size data before the application object and the meta-information data and manifest information of the application object. The middle object and the tail object hold the remaining data, named "shadow_"+"."+"32 random string"+"_"+"stripe encoding".

RGW supports multiple versions of objects. When uploading as a whole, head_obj points to a specific version of head_obj.

Multipart upload

When uploading in parts, the RGW gateway divides each segment into multiple RADOS objects according to the stripe size. The first RADOS object name of each segment is "_multipart_" + "User upload object name" + "Multiple upload Id "+"Part number", the remaining object names are "_shadow_"+"User upload object name"+"Part upload Id"+"Part number". When all parts are uploaded, RGW will generate an additional RADOS object to save the application object metadata and the manifest of all parts.

Multipart upload also supports multiple versions of objects. head_obj points to the multipart upload head_obj of a certain version.

RBD

RBD is one of the three major storage service components of Ceph. It is the abbreviation of RADOS Block Device and is currently the most stable and widely used storage interface of Ceph. There are two ways for upper-layer applications to access RBD block devices: librbd and krbd. The architecture of RBD is very different from the RGW and CephFS architectures. RBD block devices have less metainformation and are accessed infrequently, so there is no need for a daemon to load metainformation into memory to accelerate metainformation access.

metadata

The RBD block device is called an image in Ceph. The image consists of metadata and data. The metadata is stored in multiple special RADOS objects, and the data is automatically striped and stored in multiple RADOS objects. In addition to the Image's own metadata, there is also a set of special RADOS objects in the storage pool to which the Image belongs to record Image associations (such as snapshot and clone information) or additional information and other RBD management metadata.

The core metadata of Image is stored in three objects: rbd_id.<name>, rbd_header.<id>, and rbd_object_map.<id>. rbd_id saves the mapping between name and id, and rbd_header saves capacity, function, snapshot, and bar. With meta-information such as parameters, rbd_object_map saves whether the corresponding block in the image exists. Because the EC storage pool does not support omap, in order for rbd data to support EC, different storage pools can be set up for rbd metadata and data. Data is stored in objects prefixed with rbd_data.<pool_id>.<id>.

Ceph introduced object-map from the H version, allowing the client to perceive the distribution of object data in the Image. In scenarios such as capacity statistics, snapshots, and cloning, there is no need to traverse all data objects of the entire Image, and the execution time is greatly shortened.

Snapshot

RBD snapshot creates a read-only snapshot for an Image and only needs to save a small amount of snapshot meta-information. Its underlying data IO implementation completely relies on RADOS snapshot implementation. The COW process of cloning data objects to generate snapshot objects is unaware of the RBD client. The RADOS layer It is decided whether to perform a COW operation based on the SnapContext information carried by the data object IO initiated by the RBD client. The RBD snapshot is completely implemented in Lazy on the OSD side.

The RADOS layer supports snapshot operations of a single RADOS object. A RADOS object consists of a head object and possibly multiple Clone objects. The OSD side uses the SnapSet structure to save the snapshot information of the object. The clone_overlap field records the overlap interval of the data content of the clone object and the head object. This field can be used to reduce data transmission between OSDs when recovering object data.

The following is an example of an Image snapshot:

Initial Image, the first 8M data is written:

Create the first snapshot Snap1 and only update snap_seq and snapshot_<snap_id> in the rbd_header metadata object:

Write the COW that triggers the RADOS snapshot to Obj0, generate the obj0-clone1 object, and write the data to the head object:

Create a second snapshot snap2 and only update the snapshot metainformation in rbd_header:

Perform write operations on obj0, obj1, and obj2, trigger the COW of the RADOS object to generate a clone object, and write the data to the head object:

Create the third snapshot Snap3 and modify the snapshot-related metadata in rbd_header:

Write to obj1, obj2, and obj3, trigger the RADOS object COW, generate a clone object, and write new data to the head object:

clone

RBD clone is a writable snapshot implemented on the basis of RBD snapshot. The implementation of RBD also uses the COW mechanism, but it does not rely on the object snapshot implementation of RADOS, but implements cloning on the RBD client. The RADOS layer is completely unaware of the connection between images. Clone relationship. The process of cloning an Image is equivalent to creating a new Image and recording its parent snapshot in the Parent field in the Image metadata.

When RBD opens the cloned Image, it will read the Parent metadata and construct the dependency relationship between the Images. When accessing the cloned Image data, it first accesses the data object of the cloned Image. If it does not exist, it will try to access the data object of the parent snapshot. Since there may be multiple levels of cloning relationships, the process may go all the way back to the top-level Parent.

The RBD clone Image reading process is as follows:

RBD clone Image writing process:

CephFS

CephFS is a POSIX-compliant file system built on Ceph's distributed object storage RADOS. File metadata and file data are stored in separate Pools, and both file metadata and file data are mapped to objects in the Ceph storage cluster. The client can mount this file system as a kernel object or a user space file system (FUSE). The metainformation server MDS permanently stores all file system metadata (directories, file owners, access modes, etc.) in RADOS objects while keeping the metadata resident in memory. The reason why MDS exists is that simple file system operations such as listing ls, cds and other high-frequency file system meta-information read operations can be performed without reading the OSD. Caching it to a separate memory implementation can provide higher read and write performance.

VFS

Before introducing MDS, let's first understand the basic knowledge of file systems. In the Linux operating system, VFS (Virtual FileSystem) is used to define the POSIX semantic interface that the file system needs to implement to shield the differences between different file systems.

In order to adapt to different types of file systems, VFS defines 4 basic data types:

  • SuperBlock: used to save meta-information of the file system

  • Inode: used to save meta-information (directory, file, etc.) of a file system object in the file system

  • Dentry: A link to an Inode in the file system. The link has a name attribute. The directory tree is implemented through the links between Inodes.

  • File: The operation handle of a file opened by a process in the file system, associated with a process

In order to realize file sharing, the file system introduces the concept of links, including hard links and soft links:

  • Hard link: Multiple file names point to the same Inode. A physical storage content has multiple names in the file system. Internally, the reference counting mechanism is used to realize that the Inode data is deleted only after all files with all names are deleted. In addition, in order to prevent the occurrence of directory rings, directories cannot be used to create hard links, and hard links cannot span file systems.

  • Soft link: A soft link creates a new Inode, and the content stored in the Inode is a pointer to another file path. Soft links can be pointed across file systems. Without a reference technology management mechanism, deleting soft link files will not affect the source files. Deleting the source files will cause soft link file pointing errors.

MDS

Metadata access accounts for 80% of the entire file system, and the design of MDS directly affects the performance of the Cephfs file system. There are two ways to save Inode meta information in the industry, BSD's FFS and C-FFS. BSD mode saves Inode in a separate Hash table and indexes it through Inode+DentryName; C mode embeds Inode into Dentry.

Consider the pseudocode of the VFS interface corresponding to the next ls operation in bash:

foreach item in ${readdir}

getattr($item)

ls not only needs to get the name of the child item under the directory, but also needs to traverse and query the attributes of each child item, such as whether it is a directory, file, or soft link. In C mode, the Inode meta information is embedded in Dentry and only one read is needed to get the Name and Inode attributes of the Child Item. In the BSD mode, it is necessary to read multiple times, once to get the Name of the Child Item, and then read multiple times. Get the properties of Child Inode. The C mode file system is 10-300% faster than the BSD mode, but does not support hard links well. In BSD mode, the design of Inodes being stored separately only requires the creation of multiple Dentrys pointing to Inodes; in C mode, the default Dentry and Inode correspondence is one-to-one, and some extension mechanisms are needed to implement hard links.

The metadata of a Dentry in the MDS includes the file name and the attributes of the Child Inode. The Dentry itself is also an Inode used to be referenced by the parent Dentry. The following figure shows the structure of MDS metadata. The Child Inode is placed near the Dentry to facilitate quick reading of meta information. Each Dentry corresponds to a RADOS object in the Metadata Pool. If the child inode under the Dentry has a particularly large storage space exceeding 4M, it needs to be processed. Slice to form objects such as ${inode}.00000, ${inode}.00001.

The content of each Inode in RADOS includes two parts. One part uses the xattr of the RADOS object to save the Parent and Layout information of the Inode, and the other part uses OMap to save the Child Inode information. The metainformation of the file is stored in the Dentry of the upper-level directory.

In order to be compatible with the case where multiple Dentry points to one Inode in a hard link, MDS calls the first Dentry pointing to the Inode the Primary Dentry, and the subsequent Dentry pointing to the Inode is called the Remote Dentry. Inode objects will be saved in the directory defined by Primary Dentry. Remote Dentry does not save Inodes in the directory defined by Remote Dentry like Primary Dentry, which will cause multiple copies of Inode data to be cached. MDS builds an in-memory Anchor table to save, the key is Inode, and the value includes parent and ref. The Archor table can be used to trace back the Inode to the root directory, or to quickly search from the root directory to a certain Inode. The in-memory Anchor table can avoid reading RADOS objects multiple times and improve meta-information reading and writing performance. The figure below is an example of an Anchor table, in which Path is fictitious for ease of understanding and presentation.

When renaming a directory, it may affect the Inodes on the entire chain. A transaction is needed to ensure that the Inodes on the entire chain are updated at the same time, reduce the old Ref technology, increase the new Ref count, and ensure that the Anchor table keeps up with the actual Directory structure consistency. If the Ref technology reaches zero, it means that there are no link references and can be deleted from the Anchor table.

MDS does not use Paxos to achieve multi-copy high availability like Monitor, but uses the active-standby mode for high availability. Each MDS cluster creates a Journal to ensure metainformation data consistency. This Journal is also stored in the RADOS object, and its memory is just the corresponding cache. When the meta information is updated, the MDS persists in writing the RADOS Journal object before returning it, so there will be no data loss after the main MDS fails.

MDS high availability has two methods: cold standby and hot standby:

  • Cold backup: The backed-up mds only maintains heartbeats with the Monitor and serves as a process backup, and does not cache metadata. When the main mds fails, it takes some time for the cold standby mds to replay metadata to the cache.

  • Hot standby: In addition to process backup, the metadata cache is also synchronized with the primary mds through the RADOS Journal object at all times. When the primary mds fails, the hot standby mds directly becomes the primary mds, and the intermediate switching time window is smaller.

Data synchronization between the active and standby MDS is implemented through the Journal of the underlying RADOS object. The MDS regularly sends heartbeats to the Monitor cluster. If the Monitor does not detect the MDS heartbeat for a long time, the MDS is considered abnormal. The standby MDS synchronizes metadata from the RADOS object written by the primary MDS to the local cache. When the primary MDS fails, the standby MDS receives the MDSMap information. Make a decision to cut the master. When the Monitor cluster fails, MDS receives the MonMap message and determines to stop the service. The MDS active/standby switching process is as follows:

The processing flow of MDS is: boot -> replay -> reconnect -> rejoin -> active.

  • replay: mds restores memory from journal objects in rados, including inode table, session map, openfile table, snap table, purge queue, etc.

  • reconnect: After the client receives the mdsmap change message, the client sends a reconnect message to mds, carrying caps, openfile, path and other information. After processing these requests, mds rebuilds the session with the legitimate client and rebuilds the caps for the inode in the cache, otherwise it is recorded

  • rejoin: Reopen the open file and record it in the cache, and process the caps recorded in the reconnect phase.

Cephfs supports LazyIO to relax some POSIX semantics and improve multi-client read and write performance through Buffer Write and Read Cache, which is very helpful in HPC scenarios. Another experimental feature in the community is multiple file systems. CephFS based on active and standby MDS can already run stably in the production environment. Through multiple FS instances, CFS-like multi-tenant NAS products on the cloud can be realized.

Reading and writing process

All file data in CephFS are stored as RADOS objects. CephFS clients can directly access RADOS to operate file data. MDS only handles metadata operations. A unique system has been established in Ceph to manage the client's operating permissions on Inodes, called capabilities , or CAPS for short. One of the main differences with other network file systems (such as NFS or SMB) is that the CAPS granted are very granular, and multiple clients may have different CAPS on the same inode.

The Caps metadata of each file can be divided into 5 parts according to content:

  • p: Pin, means that the inode is stored in memory

  • A: Auth, that is, the ability to operate mode, uid, and gid

  • L: Link, that is, the operation capability of count related to inode and dentry

  • X: Xattrs, that is, the ability to operate extended attributes

  • F: File, that is, the ability to operate file size (size), file data and mtime

Each part has up to 6 corresponding operation capabilities:

  • s: shared sharing capability, that is, the modified data can be obtained by multiple clients, one-to-many model

  • x: exclusive exclusive capability, only for this client

  • r: read has the ability to read

  • w: write has the ability to write

  • c: cache read has caching capability and can cache read data on the client side

  • b: buffer client has the ability to cache writes, that is, the written data can be cached on the local client

Here are a few examples:

  • AsLsXs: All clients can read metadata state related to local cache

  • AxLxXx: Only this client can read and change the related metadata state

  • Fs: Ability to cache and read mtime and file size locally

  • Fx: Ability to write mtime and file size locally

  • Fr: can read data from OSD synchronously

  • Fc: Data in the client can be read from the cache

  • Fw: Has the ability to synchronously write data in OSD

  • Fb: has the ability to write to the cache first, that is, first enter the objectcacher and then write to the OSD asynchronously

Caps are managed by MDS. The client sends updates and requests for Caps to MDS. MDS uses some lock mechanisms internally for management. MDS can authorize (Grant) the Client to use certain Caps, and can also revoke (Revoke) certain Caps. In the process of canceling Caps, the client needs to take some corresponding actions. For example, Fb needs to flush the write cache to the OSD, and Fc needs to discard the read cache. The following is an example of a sequence diagram for a Client to modify permissions:

In Ceph, if the Client wants to read/write CephFS files, the client needs to have the "file read/write" function of the corresponding inode. If the client does not have the required feature caps, it sends a "cap message" to the MDS telling the MDS what it wants, and the MDS will issue the feature caps to the client when possible. Once the client has "file read/write" capabilities, it can directly access RADOS to read/write file data. If the file is opened by only one client, MDS also provides "file cache/buffer" functionality to the only client. The "File Cache" feature means that client-side caching can satisfy file reading requirements. The "File Buffer" feature means that file writes can be buffered in the client cache.

CepgFS Client access example

  • Client sends open file request to MDS

  • MDS returns file node, file size, capability and stripe information

  • Client directly READ/WRITE data to OSDs (if there is no caps information, you need to request caps from MDS first)

  • MDS manages the Client's capabilities for this file

  • Client sends a close file request to MDS, releases the file's capabilities, and updates the file's detailed information.

One thing to note here is that MDS does not persist Caps into the RADOS Journal object, so when MDS fails and restarts, the Caps saved on the client are collected during the reconnect phase.

Summarize

advantage:

  • The Pool→PG→Object three-level model in Ceph's RADOS design is very clever and can save a lot of meta-information storage. At the same time, PG split supports the expansion of Pool data scale.

  • The CacheTier design can adjust cost and performance very flexibly. Different Pools on the front and back ends of the Cache can use different storage media and replication methods.

  • RADOS supports EC reading and writing. In addition to WriteFull and Append, it also supports Overwrite, which reduces storage costs compared to multiple copies.

  • The object-map design in Ceph RBD is of great reference value and can significantly reduce the overhead of traversing data blocks when the Image is sparse. At the same time, the Clone implementation on the client side is also very clever. (The snapshot of CDS is saved in BOS, so it is inconvenient to learn from the design of Lazy snapshot and client clone)

shortcoming:

  • Although Ceph's CRUSH algorithm is flexible, it often causes unexpected data migration and is not as convenient as the Master's centralized management of PG's OSD distribution.

  • Ceph writes require all replicas to be successfully written before returning. In the Peering process triggered by a fault or data migration, writing cannot occur and the user experience is not good.

reference

ceph-exp-sosp19.pdf

Red_Hat_Ceph_Storage-4-Architecture_Guide-en-US.pdf

weil-thesis.pdf

20170323 bluestore.pdf

Guess you like

Origin blog.csdn.net/Franklin7B/article/details/130683424