Ceph entry to proficiency - overview of CrushMap algorithm

The following is the pseudocode of the pseudocode object to osd

locator =object_name

obj_hash =hash(locator)

pg =obj_hash %num_pg

OSDs_for_pg =crush(pg) # returns a list of OSDs

primary =osds_for_pg[0]

replicas =osds_for_pg[1:]

defcrush(pg):

all_osds=['osd.0','osd.1','osd.2',...]

result=[]

# size is the number of copies; primary+replicas

whilelen(result)<size:

r=hash(pg)

chosen=all_osds[r%len(all_osds)]

ifchoseninresult:

# OSD can be picked only once

continue

result.append(chosen)

returnresult

CRUSH READ

For a read and write operation of a Ceph cluster, the client first contacts Ceph's momtor and obtains a cluster

map copy. The cluster map helps the client obtain the status and configuration information of the Ceph cluster. Convert the data to an object using the object and pool name IID. Then the object and PG (placement groups, placement group) number are hashed together to generate the PG that is finally stored in the Ceph pool. Then the previously calculated PG is checked by CRUSH to determine the location of the main OSO required to store or retrieve data. After calculating the accurate OSD 10, the client directly contacts this OSO to store data. All these computing operations are performed by the client, so it does not affect the performance of the cluster.

Once data is written to the primary OSO, the node where the primary OSO resides will perform a CRUSH lookup operation and calculate the location of the secondary placement group and lOSD to achieve data replication, thereby achieving high availability. Refer to the following example to understand the mapping of CRUSH lookup and object king IJ OSO.

First, a PG 10 is obtained by applying a hash function to the object name and the cluster PG number based on pool 10 . Next,

Execute CRUSH search for this PG ID to get the main OSO and auxiliary OSO, and finally write data.

CRUSH Hierarchy

CRUSH is fully aware of all infrastructure and supports user-defined configuration, maintaining a nested hierarchy of all your infrastructure components. The CRUSH device list usually includes disks, nodes, racks, rows, switches, power circuits, rooms, data centers, etc. These components are called fault domains or CRUSH buckets. The CRUSH map contains a list of available buckets, which indicate the specific physical location of the device. It also wraps a series of rules that tell CRUSH how to replicate data for different Ceph pools. From the diagram below you can see how CRUSH views your infrastructure.

CRUSH WEIGHTS

The CRUSH algorithm assigns a weight value to each device, and its goal is to approximate the uniform probability distribution of I/O requests. As a best practice, we recommend creating pools with devices of the same type and size, and assigning the same relative weights. Since this is not always practical, you can combine devices of different sizes, and use relative weights so that Ceph allocates more data to larger drives and less data to smaller drives.

CRUSH RECOVERY

After any component failure within the fault domain, Ceph marks the OSD down and out and initializes the recovery

By default Ceph waits 300 seconds before resuming operations. This value can be modified by the parameter mon osd down out interval in the configuration file of the Ceph cluster. During a recovery operation, Ceph begins reorganizing the affected data on the failed node. CRUSH will copy data to multiple disks, and these copied data will be used during recovery. During recovery CRUSH tries to move as little data as possible to construct a new cluster layout, which ensures Ceph's fault tolerance even if some components fail.

When a new host or disk is added to the Ceph cluster, CRUSH starts to perform a rebalancing operation, in this process, CRUSH moves data from the existing host/disk to the new host/disk. Rebalance guarantees all disks

It can be used evenly to improve cluster performance and maintain cluster health. For example, if a Ceph cluster contains 2000 OSOs, and a new system adds 20 new OSDs, only 1% of the data will be moved during rebalancing, and all existing OSDs will move data in parallel, Make it do this quickly. However, for Ceph clusters with high utilization, it is recommended to set the weight of the newly added OSD to 0 first, and then gradually increase the weight to a higher weight value determined according to the disk capacity. In this way, the new OSD will reduce the load on the Ceph cluster rebalancing and avoid performance degradation.

Guess you like

Origin blog.csdn.net/wxb880114/article/details/130552380