Ceph PG Peering data restoration

Ceph data repair
When the PG completes the Peering process, the PG in the Active state can provide external services. If there are inconsistent objects on each copy of the PG, it needs to be repaired.

There are two types of Ceph repair processes: Recovery and Backfill.

Recovery is to repair inconsistent objects based only on missing records in the PG log.

Backfill is that PG rescans all objects, compares and finds missing objects, and repairs them by copying them as a whole. When an OSD fails for too long and cannot be repaired according to the PG log, or when a newly added OSD causes data migration, the Backfill process will be started.

After the PG completes the Peering process, it is in the activated state. If Recovery is required, a DoRecovery event will be generated to trigger the recovery operation. If Backfill is required, a RequestBackfill event will be generated to trigger the Backfill operation. During the PG repair process, if there are both OSDs that require the Recovery process and OSDs that require the Backfill process, the recovery process needs to be repaired first, and then the Backfill process should be completed.

This chapter introduces the implementation process of Ceph's data restoration. First introduce the knowledge of resource reservation for data restoration , and then introduce the state transition diagram of restoration to get a general understanding of the entire data restoration process . Finally, the specific implementations of the Recovery process and the Backfill process are introduced in detail.

1. Resource reservation
In the process of data restoration, in order to control the maximum number of PGs being repaired on an OSD, resource reservation is required, and reservation is required on both the master OSD and the slave OSD. If no appointment is successful, you need to block and wait. The maximum number of PGs that an OSD can restore at the same time is set in the configuration option osd_max_backfills, and the default value is 1.

Class AsyncReserver is used to manage resource reservations, and its template parameter <T> is the resource type to be reserved. This class implements asynchronous resource reservation. When the resource reservation is successfully completed, the registered callback function is called to notify the caller that the reservation is successful (src/common/AsyncReserver.h):

template <typename T>
class AsyncReserver {
  Finisher *f;               //当预约成功,用来执行的回调函数
  unsigned max_allowed;      //定义允许的最大资源数量,在这里指允许修复的PG的数量
  unsigned min_priority;     //最小的优先级
  Mutex lock;

  //优先级到待预约资源链表的映射,pair<T, Context *>定义预约的资源和注册的回调函数
  map<unsigned, list<pair<T, Context*> > > queues;

  //资源在queues链表中的位置指针
  map<T, pair<unsigned, typename list<pair<T, Context*> >::iterator > > queue_pointers;

  //预约成功,正在使用的资源
  set<T> in_progress;
};

1.1 Resource reservation
The function request_reservation() is used to reserve resources:

/**
* Requests a reservation
*
* Note, on_reserved may be called following cancel_reservation.  Thus,
* the callback must be safe in that case.  Callback will be called
* with no locks held.  cancel_reservation must be called to release the
* reservation slot.
*/
void request_reservation(
  T item,                   ///< [in] reservation key
  Context *on_reserved,     ///< [in] callback to be called on reservation
  unsigned prio
  ) {
    Mutex::Locker l(lock);
    assert(!queue_pointers.count(item) &&
    !in_progress.count(item));
    queues[prio].push_back(make_pair(item, on_reserved));
    queue_pointers.insert(make_pair(item, make_pair(prio,--(queues[prio]).end())));
    do_queues();
}


The specific process is as follows:

1) Add the resource to be requested to the queue according to the priority, and add its corresponding location pointer in queue_pointers:

queues[prio].push_back(make_pair(item, on_reserved));
queue_pointers.insert(make_pair(item, make_pair(prio,--(queues[prio]).end())));

2) Call the function do_queues() to check all resource reservation applications in the queue: check from the request with high priority, if there is still a quota and the priority of the request is at least not less than the minimum priority, the resource is authorized to it .

3) Delete the resource reservation request in the queue, and delete the location information of the resource in queue_ponters. Add the resource to the in_progress queue, and add the corresponding callback function of the request to the Finisher class to make it execute the callback function. Finally, the appointment was notified successfully.

1.2 Cancel reservation
The function cancel_reservation() is used to release the resources that are no longer used:

/**
* Cancels reservation
*
* Frees the reservation under key for use.
* Note, after cancel_reservation, the reservation_callback may or
* may not still be called. 
*/
void cancel_reservation(
  T item                   ///< [in] key for reservation to cancel
  ) {
    Mutex::Locker l(lock);
    if (queue_pointers.count(item)) {
      unsigned prio = queue_pointers[item].first;
      delete queue_pointers[item].second->second;
      queues[prio].erase(queue_pointers[item].second);
      queue_pointers.erase(item);
    } else {
      in_progress.erase(item);
    }
    do_queues();
}

The specific process is as follows:

1) If the resource is still in the queue, delete it (this is the handling of abnormal situations); otherwise, delete the resource in the in_progress queue

2) Call the do_queues() function to re-authorize the resource to other waiting requests.

2. Data restoration state transition diagram

The following figure 11-1 shows the state transition diagram of the restoration process . When the PG enters the Active state, it enters the default sub-state Activating:

The state transition process of data repair is as follows:

**Case 1:** When entering the Activating state, if all the copies are complete and do not need to be repaired, the state transition process is as follows:

1) The Activating state receives the AllReplicasRecovered event and directly transitions to the Recovered state

2) The Recovered state receives the GoClean event, and the entire PG is transferred to the Clean state

Case 2: After entering the Activating state, there is no Recovery process, only the Backfill process is required:

1) The Activating state directly receives the RequestBackfill event and enters the WaitLocalBackfillReserved state;

2) When the WaitLocalBackfillReserved state receives the LocalBackfillReserved event, it means that the local resource reservation is successful, and it is transferred to WaitRemoteBackfillReserved;

3) After all replica resources are reserved successfully, the main PG will receive the AllBackfillsReserved event, enter the Backfilling state, and start the actual data Backfill operation process;

4) The Backfilling state receives the Backfilled event, which marks the completion of the Backfill process and enters the Recovered state;

5) Exception handling: When the RemoteReservationRejected event is received in the WaitRemoteBackfillReserved and Backfilling states, it indicates that the resource reservation has failed, enter the NotBackfilling state, and wait for the RequestBackfilling event again to re-initiate the Backfill process;

**Case 3: **When PG needs both the Recovery process and the Backfill process, the PG completes the Recovery process first, and then the Backfill process, with special emphasis on the sequence here. The specific process is as follows:

1) Activating state: After receiving the DoRecovery event, transfer to the WaitLocalRecoveryReserved state;

2) WaitLocalRecoveryReserved state: In this state, the reservation of local resources is completed. When the LocalRecoveryReserved event is received, it marks the completion of the local resource reservation and transfers to the WaitRemoteRecoveryReserved state;

3) WaitRemoteRecoveryReserved state: In this state, the reservation of remote resources is completed. When the AllRemotesReserved event is received, it indicates that the PG has completed resource reservation on all slave OSDs participating in data recovery, and enters the Recovery state;

4) Recovery state: In this state, the actual data recovery work is completed. After completion, set PG to PG_STATE_RECOVERING state, add PG to recovery_wq work queue, and start data recovery:

PG::RecoveryState::Recovering::Recovering(my_context ctx)
  : my_base(ctx),
    NamedState(context< RecoveryMachine >().pg->cct, "Started/Primary/Active/Recovering")
{
  context< RecoveryMachine >().log_enter(state_name);

  PG *pg = context< RecoveryMachine >().pg;
  pg->state_clear(PG_STATE_RECOVERY_WAIT);
  pg->state_set(PG_STATE_RECOVERING); pg-
  >publish_stats_to_osd();
  pg->osd->queue_for_recovery(pg)
;
After the state completes the Recovery work, if Backfill is required, it receives the RequestBackfill event and enters the Backfill process;
6) If there is no Backfill workflow, directly receives the AllReplicasRecovered event and enters the Recovered state;

7) Recovered state: Reaching this state means that the data recovery work has been completed. When the GoClean event is received, the PG enters the clean state.

3. The basis for data restoration in the Recovery process
is the following information generated during the Peering process:

The missing object information on the master copy is stored in the pg_missing_t structure of the pg_log class; the
missing object information on each slave copy is stored in the pg_missing_t structure in peer_missing corresponding to the OSD;
the location information of the missing object is stored in the class MissingLoc

According to the above information, you can know the missing object information of each OSD in the PG, and which OSDs currently have complete information on the missing object.

Based on the above information, the data restoration process is relatively clear:

For objects missing from the primary OSD, randomly select an OSD that owns the object and pull the data over;
for objects missing from the replica, push the missing object data from the primary copy to the secondary copy to complete the data repair;
for special The snapshot object of , added some optimized methods when repairing;


3.1 Triggering the
recovery process The main OSD of the PG triggers and controls the entire recovery process. In the repair process, the missing (or inconsistent) objects on the master OSD are repaired first, and then the missing objects on the slave OSD are repaired. It can be seen from the data recovery state transition process that when a PG is in the Activate/Recovering state, the PG is added to the RecoveryWQ work queue of the OSD. In recovery_wq, the processing function of the thread pool of its work queue calls the do_recovery() function to perform the actual data recovery operation:

void OSD::do_recovery(PG *pg, ThreadPool::TPHandle &handle);

The function do_recovery() is executed by a thread of the thread pool of the RecoveryWQ work queue. The input parameter is the PG to be repaired, and the specific processing flow is as follows:

1) The configuration option osd_recovery_sleep sets the sleep time after the thread does a repair. If this value is set, each time the thread starts to sleep for the corresponding length of time. The default value of this parameter is 0, no sleep is required.

2) Add the recovery_wq.lock() lock to protect the recovery_wq queue and the variable recovery_ops_active. Calculate the max value of repairable objects, which is the maximum number of objects allowed to be repaired osd_recovery_max_active minus the number of objects being repaired recovery_ops_active, and then call the function recovery_wq.unlock() to unlock;

3) If max is less than or equal to 0, that is, there is no quota for repairing objects, add PG to the work queue recovery_wq and return; otherwise, if max is greater than 0, call pg->lock_suspend_timeout(handle) to reset the thread timeout. Check the status of the PG, if the PG is in the state of being deleted, or is neither in the peered state nor the main OSD, exit directly;

4) Call the function pg->start_recovery_ops() to repair, and the return value more is the number of objects that still need to be repaired. The output parameter started is the number of objects whose repair has started.

5) If more is 0, that is, there is no repaired object. But pg->have_unfound() is not 0, and there are still unfound objects (that is, missing objects, which OSD can find complete objects), call the function discover_all_missing() to continue to search for the object on the OSD in the might_have_unfound queue Object, the way to search is to send a message to obtain the OSD's pg_log to the relevant OSD.

6) If rctx.query_map->empty() is empty, that is, no other OSD is found to obtain the pg_log to find the unfound object, then the recovery operation of the PG is ended, and the function is called to delete the PG from recovery_wq._dequeue(pg);

7) The function dispatch_context() does the finishing work, sends the query_map request here, and submits the ctx.transaction transaction to the local object storage.

From the above process analysis, we can see that the core function of the do_recovery() function is to calculate the max value of the object to be repaired, and then call the function start_recovery_ops() to start the repair.

3.2 ReplicatedPG
class ReplicatedPG is used to deal with related repair operations of Replicated type PG. The following analyzes the specific implementation of the start_recovery_ops() function and its related functions for repair.

3.2.1 start_recovery_ops()
function start_recovery_ops() calls recovery_primary() and recovery_replicas() to repair the primary and secondary copies of objects on the PG. After the repair is completed, if the Backfill process is still required, a corresponding event is thrown to trigger the PG state machine to start the Backfill repair process.

bool ReplicatedPG::start_recovery_ops(
  int max, ThreadPool::TPHandle &handle,
  int *ops_started);
The specific process of this function is as follows:

1) First check the OSD to make sure it is the main OSD of the PG. Exit if the PG is already in the state of PG_STATE_RECOVERING or PG_STATE_BACKFIL;

2) Get the missing object from pg_log, which holds the missing objects of the main OSD. The parameter num_missing is the number of missing objects in the main OSD; num_unfound is the number of missing objects on the PG but no other correct copy of the object is found in the OSD; if num_missing is 0, it means that the main OSD does not miss objects, and directly set info.last_complete to the latest version The value of info.last_update;

3) If num_missing is equal to num_unfound, it means that the missing objects of the main OSD are unfound type objects, first call the function recover_replicas() to start repairing the objects on the replica;

4) If started is 0, that is, the number of objects that have already been repaired is 0, call the function recover_primary() to repair the objects on the primary OSD;

5) If started is still 0 and num_unfound has changed, start recover_replicas() again to repair the copy;

6) If started is not 0, set the value of work_in_progress to true;

7) If the recovering queue is empty, that is, there is no object undergoing Recovery operation, the state is PG_STATE_BACKFILL, and backfill_targets is not empty, started is less than max, and missing.num_missing() is 0:

a) If the flag get_osdmap()->test_flag(CEPH_OSDMAP_NOBACKFILL) is set, the Backfill process is postponed;

b) If the flag CEPH_OSDMAP_NOREBALANCE is set and it is in the degrade state, postpone the Backfill process;

c) If backfill_reserved is not set, a RequestBackfill event is thrown to the state machine to start the Backfill process;

d) Otherwise, call the function recover_backfill() to start the Backfill process

8) Finally, if the PG is in the PG_STATE_RECOVERING state and the object is successfully repaired, check: if the Backfill process is required, the RequestBackfill event is sent to the PG state machine; if the Backfill process is not required, the AllReplicasRecovered event is thrown;

9) Otherwise, the state of PG is PG_STATE_BACKFILL state, clear this state, and throw the Backfilled event;

3.2.2 recover_primary()
The function recover_primary() is used to repair missing objects on the main OSD of a PG:

int ReplicatedPG::recover_primary(int max, ThreadPool::TPHandle &handle);

Its processing is as follows:

1) Call pgbackend->open_recovery_op() to return a PGBackend::RecoveryHandle related to the PG type. For the RPGHandle corresponding to ReplicatedPG, there are two maps inside, which store the encapsulation PushOp and PullOp of Push and Pull operations:

struct RPGHandle : public PGBackend::RecoveryHandle { map<pg_shard_t, vector<PushOp> > pushes; map<pg_shard_t, vector<PullOp> > pulls; } ; Get an object that has not been fixed.



3) Traverse each object that has not been repaired: latest is the last log of the missing object saved in the log record, soid is the missing object. If latest is not empty:

a) If the log record is of pg_log_entry_t::CLONE type, no special processing will be done here until the snapshot-related information SnapSet is successfully obtained;

b) If the log record type is pg_log_entry_t::LOST_REVERT type: when the revert operation is inconsistent with the data, the administrator forcibly rolls back to the specified version through the command line, and reverting_to records the rollback version:

If item.have is equal to the latest->reverting_to version, that is, the log record shows that the current version that has been rolled back is displayed, then the ObjectContext of the object is obtained. If the current version obc->obs.io.version of the object is equal to latest-> version, indicating that the rollback operation is completed;
if item.have is equal to latest->reverting_to, but the current version obc->obs.io.version of the object is not equal to latest->version, it means that the rollback operation has not been performed, and the object is directly modified The version number is latest->version.
Otherwise, it is necessary to pull the reverting_to version of the object. There is no special processing here, just check whether all OSDs have this version of the object, and if so, add it to missing_loc to record the location information of this version, and continue to complete the subsequent repair.
c) If the object is in the recovering process, it indicates that it is being repaired, or its head object is being repaired, skipped, and the count is increased by skipped; otherwise, call the function recover_missing() to repair.

4) Call the function pgbackend->run_recovery_op() to send the message encapsulated by PullOp or PushOp;

The following example illustrates the repair process when the last log record type is LOST_REVERT:

Example 11-1 log repair process

The PG log records are as follows: Each unit represents a log record, which is the name, version and operation of the object, and the format of the version is (epoch, version). The gray part represents the missing log record on this OSD. The log record is copied from the authoritative log record, so the current log record is continuous and complete.

Case 1: Repair of normal cases

The list of missing objects is [obj1, obj2]. The current repair object is obj1. According to the log records, the object obj1 has been modified three times, namely versions 6, 7, and 8. The version have value of the currently owned obj1 object is 4, and only the last modified version 8 can be repaired.

Case 2: The last operation is an operation of type LOST_REVERT

For the object obj1 to be repaired, the last operation is a LOST_REVERT type operation. The current version of this operation is 8, the previous version prior_version is 7, and the reverting_to version is 4.

In this case, the log shows that there is already version 4. Check the actual version of object obj1, which is the version number saved in object_info:

1) If the value is 8, it means that the last revert operation was successful and no repair action is required;

2) If the value is 4, it means that the LOST_REVERT operation is not executed. Of course, the data content is already version 4, you only need to modify the version of object_info to 8.

If the reverting_to version is not version 4 but version 6, then the data of obj1 needs to be repaired to the data of version 6 in the end. Ceph's processing here is only to check whether there is version 6 in the missing objects of other OSDs. If there is, it will be added to missing_loc, and the location of the OSD with this version will be recorded, and it will be repaired later.

3.2.3 recover_missing()
The function recover_missing() handles the repair of the snap object. When repairing the snap object, the head object or snapdir object must be repaired first to obtain the SnapSet information, and then the snapshot object itself can be repaired.

int ReplicatedPG::recover_missing(
  const hobject_t &soid, eversion_t v,
  int priority,
  PGBackend::RecoveryHandle *h);
the specific implementation is as follows:

1) Check if the object soid is unfound, and directly return the value of PULL_NONE. Temporarily unable to fix objects in unfound

2) If the repair is a snap object:

a) Check if the corresponding head object is missing, recursively call the function recover_missing() to repair the head object first;

b) Check if the snapdir object is missing, then recursively call the function recover_missing() to first repair the snapdir object;

Obtain head_obc information from the head object or snapdir object;
4) Call the function pgbackend->recover_object() to encapsulate the operation information to be repaired into the PullOp or PushOp object, and add it to the RecoveryHandle structure.

3.3 pgbackend
pgbackend encapsulates the implementation of different types of Pool. ReplicatedBackend implements the underlying functions related to the PG of the replicate type, and ECBackend implements the underlying functions related to the PG of the Erasure code type.

From the analysis of 3.2 in the previous section, it can be known that the recover_object() function of pgbackend needs to be called to realize the information encapsulation of the repaired object. Only copy-based is introduced here.

The function recover_object() implements the pull operation, and calls the prepare_pull() function to encapsulate the request into a PullOp structure. If it is a push operation, call start_pushes() to encapsulate the request into a PushOp operation.

3.3.1 pull operation
The prepare_pull() function packs the operation information related to the object to be pulled into PullOp class information, as follows:

void ReplicatedBackend::prepare_pull(
  eversion_t v, //The version information of the object to be pulled
  const hobject_t& soid, //The object to be pulled
  ObjectContextRef headctx, //The ObjectContext information of the object to be pulled
  RPGHandle *h); //Encapsulated and saved The difficulty of the RecoveryHandle
lies in the recovery process of the snap object. Let's first introduce the PullOp data structure.

The PullOp data structure is as follows (src/osd/osd_types.h):

struct PullOp {   hobject_t soid; //object to be pulled

  ObjectRecoveryInfo recovery_info; //Information that needs to be repaired
  ObjectRecoveryProgress recovery_progress; //Object repair progress information
};

struct ObjectRecoveryInfo {   hobject_t soid; //repair object   eversion_t version; //repair object version   uint64_t size; //repair object size   object_info_t oi; //repair object object_info information   SnapSet ss; //repair object snapshot information




  //The set of objects that need to be copied. When repairing the snapshot object, it needs to be copied from other OSDs to the local object's segment set
  interval_set<uint64_t> copy_subset;

  //When the clone object is repaired, the interval
  map<hobject_t, interval_set<uint64_t>, hobject_t::BitwiseComparator> clone_subset;
};


struct ObjectRecoveryProgress {   uint64_t data_recovered_to; // pointer to the location where the data has been restored   string omap_recovered_to; // pointer to the location where omap has been repaired   bool first; // whether it is the first repair operation   bool data_complete; // whether the data has been restored   bool omap_complete; //omap Whether the repair is complete }; The specific process of the function prepare_pull() is as follows:






1) Obtain the pointer of the PG object by calling the function get_parent(). The parent of pgbackend is the corresponding PG object. Get missing, peer_missing, missing_loc and other information through PG;

2) Obtain the OSD collection where the soid object is located from the missing_loc map corresponding to the soid object. Save this set in the shuffle vector. Call the random_shuffle operation to randomly sort the OSD list, and then select the first OSD in the vector as the missing object to pull the value of the source OSD. It can be seen from this step that when the object on the master OSD is repaired, and the object exists on multiple slave OSDs, one of the source OSDs is randomly selected to be pulled.

3) After selecting a source shard, check the peer_missing corresponding to the shard to ensure that the object is not missing on the OSD, that is, the object of this version is indeed owned.

4) Determine the data range of the pulled object:

a) If it is a head object, copy all of the object directly, add the interval (0,-1) to copy_subset(), which means all copies, and finally set the size to -1:

recovery_info.copy_subset.insert(0, (uint64_t)-1);
recovery_info.size = ((uint64_t)-1);
b) If the object is a snap object, ensure that one of the head object or the snapdir object must exist. If headctx is not empty, you can get the SnapSetContext object, which stores snapshot-related information. Call the function calc_clone_subsets() to calculate the range of data to be copied.

5) Set the relevant fields of PullOp and add it to RPGHandle

The function calc_clone_subsets() is used to repair snapshot objects. Before introducing it, we need to introduce the data structure of SnapSet and the overlap concept of clone object.

In the SnapSet structure, the field clone_overlap saves the overlap between the clone object and the last clone object:

struct SnapSet {   snapid_t seq;   bool head_exists;   vector<snapid_t> snaps; // sequence number in descending order   vector<snapid_t> clones; // sequence number in ascending order



  //The part that overlaps with the latest clone object caused by the write operation
  map<snapid_t, interval_set<uint64_t> > clone_overlap;  

  map<snapid_t, uint64_t> clone_size;
};
Let's use an example to illustrate the concept of clone_overlap data structure.

Example 11-2 The clone_overlap data structure is shown in Figure 11-2:

snap3 is cloned from the snap2 object and modifies intervals 3 and 4. The offset and length of the range in the object are (4,8) and (8,12). Then record in clone_overlap of SnapSet:

clone_overlap[3] = {(4,8), (8,12)}
The function calc_clone_subset() is used to calculate the data interval that should be copied when repairing the snapshot object. When repairing a snapshot object, it is not a complete copy of the snapshot object. The key to optimization here is: there is data overlap between the snapshot objects, and the overlapping part of the data can be repaired by copying the data of the existing local snapshot object; The repaired part is copied through the local snapshot object, and the corresponding data needs to be pulled from other copies.

The specific implementation of the function calc_clone_subsets() is as follows:

First get the size of the snapshot object, and add (0, size) to the data_subset:
data_subset.insert(0, size);
2) Look forward for the interval where the (oldest snap) intersects with the current snapshot until you find one that is not missing Snapshot object, added to clone_subset. The non-overlapping intervals found here are the intervals that have never been modified between the snapshot object that is never missing and the currently repaired snapshot object, so when repairing, just copy the required interval data directly from the existing snapshot object.

3) Similarly, look backward (newest snap) for objects that overlap with the current snapshot object until an object that is not missing is found and added to the clone_subset.

4) Remove all overlapping intervals, which is the data interval that needs to be pulled;

data_subset.subtract(cloning);
For the above algorithm, the following examples illustrate:

Example 11-3 An example of snapshot object restoration is shown in Figure 11-3:

The object to be repaired is snap4. Different lengths represent different sizes of each clone object, and the dark red interval represents the modified interval after the clone. snap2, snap3, and snap5 are all non-deletion objects that already exist.

The algorithm processing flow is as follows:

1) Look forward to the interval that overlaps with snap4 until the non-missing object snap2 is encountered. The overlapping intervals from snap4 to snap2 are 1, 5, and 8 intervals. Therefore, when repairing the object snap4, the data in intervals 1, 5, and 8 can be directly copied from the existing local non-missing object snap2.

2) Similarly, look backwards for the interval overlapping with snap4 until the non-missing object snap5 is encountered. The overlapping intervals of snap5 and snap4 are six intervals of 1, 2, 3, 4, 7, and 8. Therefore, when repairing object 4, it is sufficient to directly copy the intervals 1, 2, 3, 4, 7, and 8 from the local object snap4.

3) Remove the above-mentioned intervals that can be repaired locally, and only interval 6 of the object snap4 needs to copy data from other OSDs for repair.

3.3.2 Push operation
The function start_pushes() obtains the OSD list of actingbackfill, finds the OSD missing the object through peer_missing, and calls prep_push_to_replica() to pack the PushOp request.

The implementation process of the function prep_push_to_replica() is as follows:

void ReplicatedBackend::prep_push_to_replica(
  ObjectContextRef obc, const hobject_t& soid, pg_shard_t peer,
  PushOp *pop, bool cache_dont_need);
If the object to be pushed is a snap object: check if the head object is missing, call prep_push() to push the head object; if it is headdir If the object is missing, call prep_push() to push the headdir object;
2) If it is a snap object, call the function calc_clone_subsets() to calculate the data range of the snapshot object to be pushed;

3) If it is a head object, call the function calc_head_subsets() to calculate the range of the head object that needs to be pushed. The principle is similar to calculating the snapshot object, so it will not be described in detail here. Finally, call prep_push() to encapsulate the PushInfo information, and read the actual data to be pushed in the function build_push_op().

3.3.3 Processing repair operation
The function run_recover_op() calls the send_pushed() function and the send_pulls() function to send the request to the relevant OSD. This process is relatively simple.

After the master OSD pushes the object to the slave OSD that lacks the object, the slave OSD needs to call the function handle_push() to implement the data writing work, so as to complete the restoration of the object. Similarly, when the master OSD initiates a request to pull objects from the OSD to repair its own missing objects, it needs to call the function handle_pulls() to handle the response to the request.

Handle the handle_push request in the function ReplicatedBackend::handle_push(), and mainly call the submit_push_data() function to write data.

The handle_pull() function receives a PullOp operation and returns the PushOp operation. The processing flow is as follows:

void ReplicatedBackend::handle_pull(pg_shard_t peer, PullOp &op, PushOp *reply)
{
  const hobject_t &soid = op.soid;
  struct stat st;
  int r = store->stat(ch, ghobject_t(soid), &st);
  if (r != 0) {
    get_parent()->clog_error() << get_info().pgid << " "
                   << peer << " tried to pull " << soid
                   << " but got " << cpp_strerror(-r) << "\n";
    prep_push_op_blank(soid, reply);
  } else {
    ObjectRecoveryInfo &recovery_info = op.recovery_info;
    ObjectRecoveryProgress &progress = op.recovery_progress;
    if (progress.first && recovery_info.size == ((uint64_t)-1)) {
      // Adjust size and copy_subset
      recovery_info.size = st.st_size;
      recovery_info.copy_subset.clear();
      if (st.st_size)
        recovery_info.copy_subset.insert(0, st.st_size);
      assert(recovery_info.clone_subset.empty());
    }

    r = build_push_op(recovery_info, progress, 0, reply);
    if (r < 0)
      prep_push_op_blank(soid, reply);
  }
}
first call the store->stat() function to verify whether the object exists, if not, call The function prep_push_op_blank() directly returns a null value;
2) If the object exists, obtain the ObjectRecoveryInfo and ObjectRecoveryProgress structures. If progress.first is true and recovery_info.size is -1, it means full copy repair: set recovery_info.size to the size of the actual object, clear recovery_info.copy_subset, and add the (0, size) range to recovery_info.copy_subset. The copy interval of insert(0, st.st_size).

3) Call the function build_push_op() to build the PullOp structure. If an error occurs, call prep_push_op_blank() and return a null value directly.

The function build_push_op() completes the request to build push. The specific treatment is as follows:

int ReplicatedBackend::build_push_op(const ObjectRecoveryInfo &recovery_info,
                     const ObjectRecoveryProgress &progress,
                     ObjectRecoveryProgress *out_progress,
                     PushOp *out_op,
                     object_stat_sum_t *stat,
                                     bool cache_dont_need);
If progress.first is true, you need to get the metadata of the object Obtain the header information of omap through store->omap_get_header(), obtain the extended attribute information of the object through store->getattrs(), and verify whether oi.version is recovery_info.version; otherwise, return -EINVAL value. If successful, new_progress.first is set to false.
2) The previous step only obtained the omap header information, but did not obtain the omap information. This step first determines whether progress.omap_complete is completed (initialization is set to false), if not completed, iteratively obtains the (key, value) information of omap, and checks that the size of the obtained information cannot exceed the setting of cct->_conf->osd_recovery_max_chunk value (8MB by default). In particular, it should be noted that when the value of this configuration parameter is smaller than the size of an object, the repair of an object requires multiple data push operations. In order to ensure the integrity and consistency of the data, first copy the data to the temp storage space of the PG. After the copy is completed, move to the actual space of the PG.

3) Start copying data: check recovery_info.copy_subset, which is the copied interval;

4) Call the function store->fiemap() to determine the value of the effective data range out_op->data_included, and read the corresponding data into data through store->read().

5) Set the relevant fields of PullOp and return.

4. Backfill process
When the PG completes the recovery process, if backfill_targets is not empty, it indicates that there is an OSD that needs the Backfill process, and the Backfill task needs to be started to complete the full recovery of the PG. The following describes the data structure and specific processing process related to the Backfill process.

4.1 Related data structures
The data structure BackfillInterval is used to record the Backfill process on each peer (src/osd/pg.h).

struct BackfillInterval {     //A peer's backfill_interval information     version_t version; //The latest object version when scanning

    map<hobject_t,eversion_t,hobject_t::Comparator> objects;
    bool sort_bitwise;
    hobject_t begin;
    hobject_t end;
};
its fields are described as follows:

version: When recording the list of scanned objects, the latest version of the current PG object update, usually last_update, since the PG is in the active state at this time, a write operation may be in progress. It is used to check if there has been an object write operation since the last scan. If there is, the object that has completed the write operation is in the scanned object list, and when the Backfill operation is performed, the object needs to be updated to the latest version.

objects: the list of scanned objects ready for Backfill operation;
begin: the currently processed object;
end: the end of this scanning object, which is used as the start of the next scanning object:


4.2 The specific implementation of Backfill
The function recovery_backfill(), as the core function of the Backfill process, controls the entire Backfill repair process. Its workflow is as follows.

1) Initial settings

In the function on_activate(), the property value new_backfill of PG is set to true, and last_backfill_started is set to the value of earliest_backfill(). This function calculates the minimum value of last_backfill stored in the peer_info information in the OSD that needs backfill.

The map of peer_backfill_info saves the backfillInterval object information corresponding to each OSD that needs backfill. First, initialize begin and end as peer_info.last_backfill. According to the Peering process of PG, in the function activate(), if the OSD of Backfill is required, set the last_backfill of the peer_info of the OSD to hobject_t(), which is the MIN object.

backfills_inf_flight saves the objects that are being Backfilled, and pending_backfill_updates saves the objects that need to be deleted.

2) Set backfill_info.begin to last_backfill_started, and call the function update_range() to update the list of objects that need to be backfilled;

3) Perform the trim operation according to the backfillInterval information corresponding to the last_backfill of each peer_info. Update relevant fields in backfill_info according to last_backfill_started;

4) If backfill_info.begin is less than or equal to earliest_peer_backfill(), it means that more objects need to be scanned continuously, and backfill_info is reset. It is important to note here that the version field of backfill_info is also reset to (0,0), which will cause subsequent The update_scan() function called then calls the scan_range() function to scan the object;

5) For comparison, if pbi.begin is smaller than backfill_info.begin, you need to send MOSDPGScan::OP_SCAN_GET_DIGEST message to each OSD to obtain the list of objects currently owned by the OSD;

6) After obtaining the object list of all OSDs, compare it with the object list of the current main OSD to repair.

7) The check object pointer is the smallest object in the current OSD that needs to be Backfilled:

a) Check the check object, if it is less than backfill_info.begin, you need to delete the object on each OSD that needs Backfill operation, and add it to the to_remove queue;

b) If the check object is greater than or equal to backfill_info.begin, check the OSD that owns the check object, if the version is inconsistent, add it to need_ver_targ. If the version is the same, add keep_ver_targs.

c) For OSDs whose begin objects are not check objects, if pininfo.last_backfill is less than backfill_info.begin, then the object is missing and added to the missing_targs list;

d) If pinfo.last_backfill is greater than backfill_info.begin, it means that the repair progress of the OSD has exceeded the repair progress indicated by the current main OSD, and it is added to skip_targs;

8) For the OSDs in the keep_ver_targs list, do nothing. For the OSDs in need_ver_targs and missing_targs, the object needs to be added to to_push to repair.

9) Call the function send_remove_op() to send a delete message to OSD to delete the object in to_remove;

10) Call the function prep_backfill_object_push() to package the operation into PushOp, and call the function pgbackend->run_recovery_op() to send the request. Its process is similar to the Recovery process.

11) Finally, update the last_backfill value of pg_info of each OSD with new_last_backfill. If pinfo.last_backfill is MAX, it means that the backfill operation is completed, and the message MOSDPGBackfill::OP_BACKFILL_FINISH is sent to the OSD; otherwise, MOSDPGBackfill::OP_BACKFILL_PROGRESS is sent to update the last_backfill field of pg_info on each OSD.

The following example illustrates.

Example 11-4 As shown in Figure 11-4 below, the PG is distributed on 5 OSDs (that is, 5 copies, here for the convenience of listing various processing situations), and the object list on each line is the current corresponding OSD The list of scan objects for backfillInterval. osd5 is the main OSD, which is the authoritative object list, and other OSDs are repaired according to the object list on the main OSD.

The following examples illustrate the different repair methods in step 7:

1) The current check object pointer is the minimum value of begin in the peer_backfill_info saved on the main OSD, and the check object in the figure should be an obj4 object;

2) Compare the check object with the backfill_info.begin object on the main osd5. Since check is smaller than obj5, obj4 is a redundant object, and all OSDs that own the check object must delete this object. Therefore, the obj4 objects on osd0 and osd2 are deleted, and the corresponding begin pointers are moved forward.

3) The current status of each OSD is shown in Figure 11-5: at this time, the check object is obj5, compare the values ​​of check and backfill_info.begin:

a) For the osd0, osd1, and osd4 of the current begin unchecked object:

* For osd0 and osd4, the check object and backfill_info.begin object are both obj5, and the version number is (1,4), added to the keep_ver.targs list, no need to repair;

* For osd1, the version number is inconsistent, add it to the need_ver_targs list, and need to fix
b) For osd2 and osd3 whose current begin is not a check object:

* For osd2, its last_backfill is less than backfill_info.begin, obviously the object obj5 is missing, add missing_targs to repair;

* For osd3, its last_backfill is greater than backfill_info.begin, which means it has been repaired to obj6, obj5 should be restored, add skip_targs to skip;
4) Step 3 processing is completed, set last_backfill_started to the current value of backfill_info.begin. The backfill_info.begin pointer moves forward, and all the begin pointers whose begin is equal to the check object move forward, and repeat the above steps to continue repairing.

The function update_range() calls the function scan_range() to update the list of objects repaired by BackfillInterval. At the same time, it checks the list of objects scanned last time. If any object has a write operation, it updates the repaired version of the object.

The specific implementation steps are as follows:

1) bi->version records the latest updated version number of PG when scanning the list of objects to be repaired, and is generally set to the value of last_update_applied or info.last_update. During initialization, bi->version is set to (0,0) by default, so if it is smaller than info.log_tail, update the setting of bi->version and call the function scan_range() to scan the object.

2) Check that if the value of bi->version is equal to info.last_update, it means that from the last time the object was scanned to the current time, PG has no write operation and returns directly.

3) If the value of bi->version is less than info.last_update, it means that PG has a write operation, and you need to check the object in the log from bi->version to log_head: if the object has an update operation, repair the latest version when repairing ; If the object has been deleted, it does not need to be repaired, and it will be deleted in the repair queue.

The following example illustrates the processing of update_range():

Example 11-5 update_range processing

The log records are shown in the figure below:


The scanned object list of BackfillInterval: bi->begin is the object obj1(1,3), bi->end is the object obj6(1,6), and the current info.last_update is the version (1,6), so bi->version Set to (1,6). Since the list of objects scanned this time may not be repaired, we can only wait for the next repair.

2) The log record looks like this:

wds: Figure 9

Enter the function recovery_backfill for the second time, at this time the begin object points to the obj2 object. Indicates that only object obj1 was repaired last time. While continuing to repair, an update operation occurred on an object during:

a) There are some operations on the object obj3, and the version is updated to (1,7). At this time, the obj3 version (1,5) of the object to be repaired in the object list needs to be updated to the value of version (1,7).

b) The object obj4 sends a delete operation and does not need to be repaired, so it needs to be deleted from the object list.

To sum up, Ceph's Backfill process scans the list of all objects of the PG on the OSD, compares them with the main OSD, repairs objects that do not exist or have inconsistent versions, and deletes redundant objects at the same time.

5. Summary
This chapter introduces the Ceph data recovery process, which has two processes: Recovery process and Backfill process. According to the missing record, the recovery process first completes the repair of the master copy, and then completes the repair of the slave copy. For OSDs that cannot be repaired through the log, the Backfill process scans the objects on each part to repair them in full. The data repair process of the entire Ceph is relatively clear, and more complex replicas may involve the repair processing of snapshot objects.

At present, this part of the code is the core code of Ceph, and it will not be easily modified unless necessary. At present, the community has also proposed an optimization method when repairing. It is to record the modified object range in the log, so that it is not necessary to copy the entire object to repair during the recovery process, but only repair the range corresponding to the modified object, which can reduce the amount of repaired data in some cases.

Reference: https://ivanzz1001.github.io/records/post/ceph/2019/02/02/ceph-src-code-part11_1

Guess you like

Origin blog.csdn.net/weixin_43778179/article/details/132702203