Backfill and recovery process of PG Peering process

After the PG completes the Peering process, the PG in the Active state can provide external services. If there are inconsistent objects on each copy of the PG, it needs to be repaired. There are two types of Ceph repair processes: Recovery and Backfill.

void ReplicatedPG::do_request(
  OpRequestRef& op,
  ThreadPool::TPHandle &handle)
{
	assert(!op_must_wait_for_map(get_osdmap()->get_epoch(), op));
	if (can_discard_request(op)) {
		return;
	}
	if (flushes_in_progress > 0) {
		dout(20) << flushes_in_progress << " flushes_in_progress pending " << "waiting for active on " << op << dendl;
		waiting_for_peered.push_back(op);
		op->mark_delayed("waiting for peered");
		return;
	}
	
	if (!is_peered()) {
		// Delay unless PGBackend says it's ok
		if (pgbackend->can_handle_while_inactive(op)) {
			bool handled = pgbackend->handle_message(op);
			assert(handled);
			return;
		} else {
			waiting_for_peered.push_back(op);
			op->mark_delayed("waiting for peered");
			return;
		}
	}

	...
}

Note: In addition, according to the ceph data read and write process, OSD::dispatch_session_waiting() and other stages may block requests

Recovery is to repair inconsistent objects based only on missing records in the PG log. Backfill is that PG rescans all objects, compares and finds missing objects, and repairs them by copying them as a whole. When an OSD fails for too long and cannot be repaired according to the PG log, or when a newly added OSD causes data migration, the Backfill process will be started.

It can be seen from Chapter 10 that after the PG completes the Peering process, it is in the activated state. If Recovery is required, a DoRecovery event will be generated to trigger the recovery operation. If Backfill is required, the opportunity generates a RequestBackfill event to trigger the Backfill operation. During the PG repair process, if there are both OSDs that require the Recovery process and OSDs that require the Backfill process, the recovery process needs to be repaired first, and then the Backfill process should be completed.

void ReplicatedPG::on_activate()
{
	// all clean?
	if (needs_recovery()) {
		dout(10) << "activate not all replicas are up-to-date, queueing recovery" << dendl;
		queue_peering_event(
		  CephPeeringEvtRef(
			std::make_shared<CephPeeringEvt>(
			  get_osdmap()->get_epoch(),
			  get_osdmap()->get_epoch(),
			  DoRecovery())));
	} else if (needs_backfill()) {
		dout(10) << "activate queueing backfill" << dendl;
		queue_peering_event(
		  CephPeeringEvtRef(
			std::make_shared<CephPeeringEvt>(
			  get_osdmap()->get_epoch(),
			  get_osdmap()->get_epoch(),
			  RequestBackfill())));
	} else {
		dout(10) << "activate all replicas clean, no recovery" << dendl;
		queue_peering_event(
		  CephPeeringEvtRef(
			std::make_shared<CephPeeringEvt>(
			  get_osdmap()->get_epoch(),
			  get_osdmap()->get_epoch(),
			  AllReplicasRecovered())));
	}
	
	publish_stats_to_osd();
	
	if (!backfill_targets.empty()) {
		last_backfill_started = earliest_backfill();
		new_backfill = true;
		assert(!last_backfill_started.is_max());
		dout(5) << "on activate: bft=" << backfill_targets << " from " << last_backfill_started << dendl;
		for (set<pg_shard_t>::iterator i = backfill_targets.begin(); i != backfill_targets.end(); ++i) {
			dout(5) << "target shard " << *i << " from " << peer_info[*i].last_backfill << dendl;
		}
	}
	
	hit_set_setup();
	agent_setup();
}

This chapter introduces the implementation process of Ceph's data restoration. First introduce the knowledge of resource reservation for data restoration, and then introduce the state transition diagram of restoration to get a general understanding of the entire data restoration process. Finally, the specific implementations of the Recovery process and the Backfill process are introduced in detail.

1. Resource reservation

In the process of data repair, in order to control the maximum number of PGs being repaired on an OSD, resource reservation is required, and reservation is required on both the master OSD and the slave OSD. If no appointment is successful, you need to block and wait. The maximum number of PGs that an OSD can repair at the same time osd_max_backfillsis set in the configuration option, and the default value is 1.

Class AsyncReserver is used to manage resource reservations, and its template parameter <T>is the resource type to be reserved. This class implements asynchronous resource reservation. When the resource reservation is successfully completed, the registered callback function is called to notify the caller that the reservation is successful (src/common/AsyncReserver.h):

template <typename T>
class AsyncReserver {
	Finisher *f;               //当预约成功,用来执行的回调函数
	unsigned max_allowed;      //定义允许的最大资源数量,在这里指允许修复的PG的数量
	unsigned min_priority;     //最小的优先级
	Mutex lock;
	
	//优先级到待预约资源链表的映射,pair<T, Context *>定义预约的资源和注册的回调函数(注:值越大,优先级越高)
	
	map<unsigned, list<pair<T, Context*> > > queues;
	
	//资源在queues链表中的位置指针
	map<T, pair<unsigned, typename list<pair<T, Context*> >::iterator > > queue_pointers;
	
	//预约成功,正在使用的资源
	set<T> in_progress;
};

In OSDService, we see that resource reservation is required to perform the following operations:

class OSDService {
public:
	// -- backfill_reservation --
	Finisher reserver_finisher;
	AsyncReserver<spg_t> local_reserver;
	AsyncReserver<spg_t> remote_reserver;

public:
	AsyncReserver<spg_t> snap_reserver;
};

OSDService::OSDService(OSD *osd) : 
  reserver_finisher(cct),
  local_reserver(&reserver_finisher, cct->_conf->osd_max_backfills, cct->_conf->osd_min_recovery_priority),
  remote_reserver(&reserver_finisher, cct->_conf->osd_max_backfills, cct->_conf->osd_min_recovery_priority),
  snap_reserver(&reserver_finisher,cct->_conf->osd_max_trimming_pgs),
{
}

1.1 Resource Reservation

The function request_reservation() is used to reserve resources:

/**
* Requests a reservation
*
* Note, on_reserved may be called following cancel_reservation.  Thus,
* the callback must be safe in that case.  Callback will be called
* with no locks held.  cancel_reservation must be called to release the
* reservation slot.
*/
void request_reservation(
  T item,                   ///< [in] reservation key
  Context *on_reserved,     ///< [in] callback to be called on reservation
  unsigned prio
  ) {
	Mutex::Locker l(lock);
	assert(!queue_pointers.count(item) &&
	!in_progress.count(item));
	queues[prio].push_back(make_pair(item, on_reserved));
	queue_pointers.insert(make_pair(item, make_pair(prio,--(queues[prio]).end())));
	do_queues();
}

void do_queues() {
	typename map<unsigned, list<pair<T, Context*> > >::reverse_iterator it;
	for (it = queues.rbegin();it != queues.rend() &&in_progress.size() < max_allowed && it->first >= min_priority;
	  ++it) {
		while (in_progress.size() < max_allowed &&!it->second.empty()) {
			pair<T, Context*> p = it->second.front();
			queue_pointers.erase(p.first);
			it->second.pop_front();
			f->queue(p.second);
			in_progress.insert(p.first);
		}
	}
}

The specific process is as follows:

1) Add the resource to be requested to the queue according to the priority, and add its corresponding location pointer in queue_pointers:

queues[prio].push_back(make_pair(item, on_reserved));
queue_pointers.insert(make_pair(item, make_pair(prio,--(queues[prio]).end())));

2) Call the function do_queues() to check all resource reservation applications in the queue: check from the request with high priority, if there is still a quota and the priority of the request is at least not less than the minimum priority, the resource is authorized to it .

3) Delete the resource reservation request in the queue, and delete the location information of the resource in queue_ponters. Add the resource to the in_progress queue, and add the corresponding callback function of the request to the Finisher class to make it execute the callback function. Finally, the appointment was notified successfully.

1.2 Cancellation of appointment

The function cancel_reservation() is used to release owned resources that are no longer used:

/**
* Cancels reservation
*
* Frees the reservation under key for use.
* Note, after cancel_reservation, the reservation_callback may or
* may not still be called. 
*/
void cancel_reservation(
  T item                   ///< [in] key for reservation to cancel
  ) {
	Mutex::Locker l(lock);
	if (queue_pointers.count(item)) {
	  unsigned prio = queue_pointers[item].first;
	  delete queue_pointers[item].second->second;
	  queues[prio].erase(queue_pointers[item].second);
	  queue_pointers.erase(item);
	} else {
	  in_progress.erase(item);
	}
	do_queues();
}

The specific process is as follows:

1) If the resource is still in the queue, delete it (this belongs to the handling of abnormal situations); otherwise, delete the resource in the in_progress queue

2) Call the do_queues() function to re-authorize the resource to other waiting requests.

2. Data restoration state transition diagram

The following figure 11-1 shows the state transition diagram of the restoration process. When the PG enters the Active state, it enters the default sub-state Activating:

ceph-chapter11-1

The state transition process of data repair is as follows:

Case 1: After entering the Activating state, if all copies are complete and do not need to be repaired, the state transition process is as follows:

1) The Activating state receives the AllReplicasRecovered event and directly transitions to the Recovered state

2) The Recovered state receives the GoClean event, and the entire PG is transferred to the Clean state

The code reference is as follows:

void ReplicatedPG::on_activate()
{
	...
	else {
		dout(10) << "activate all replicas clean, no recovery" << dendl;
		queue_peering_event(
		  CephPeeringEvtRef(
			std::make_shared<CephPeeringEvt>(
			  get_osdmap()->get_epoch(),
			  get_osdmap()->get_epoch(),
			  AllReplicasRecovered())));
	}
	...
}

struct Activating : boost::statechart::state< Activating, Active >, NamedState {
typedef boost::mpl::list <
boost::statechart::transition< AllReplicasRecovered, Recovered >,
boost::statechart::transition< DoRecovery, WaitLocalRecoveryReserved >,
boost::statechart::transition< RequestBackfill, WaitLocalBackfillReserved >
> reactions;

	explicit Activating(my_context ctx);
	void exit();
};

struct Recovered : boost::statechart::state< Recovered, Active >, NamedState {
typedef boost::mpl::list<
boost::statechart::transition< GoClean, Clean >,
boost::statechart::custom_reaction< AllReplicasActivated >
> reactions;

	explicit Recovered(my_context ctx);
	void exit();
	boost::statechart::result react(const AllReplicasActivated&) {
		post_event(GoClean());
		return forward_event();
	}
};

Case 2:  After entering the Activating state, there is no Recovery process, only the Backfill process is required:

1) The Activating state directly receives the RequestBackfill event and enters the WaitLocalBackfillReserved state;

2) When the WaitLocalBackfillReserved state receives the LocalBackfillReserved event, it means that the local resource reservation is successful, and it is transferred to WaitRemoteBackfillReserved;

3) After all replica resources are reserved successfully, the main PG will receive the AllBackfillsReserved event, enter the Backfilling state, and start the actual data Backfill operation process;

4) The Backfilling state receives the Backfilled event, which marks the completion of the Backfill process and enters the Recovered state;

5) Exception handling: When the RemoteReservationRejected event is received in the WaitRemoteBackfillReserved and Backfilling states, it indicates that the resource reservation has failed, enter the NotBackfilling state, and wait for the RequestBackfilling event again to re-initiate the Backfill process;

Note: The code analysis process in this case will be explained in detail later


Case 3: When the PG needs both the Recovery process and the Backfill process, the PG completes the Recovery process first, and then the Backfill process, with particular emphasis on the order here. The specific process is as follows:

1) Activating state: After receiving the DoRecovery event, transfer to the WaitLocalRecoveryReserved state;

2) WaitLocalRecoveryReserved state: In this state, the reservation of local resources is completed. When the LocalRecoveryReserved event is received, it marks the completion of the local resource reservation and transfers to the WaitRemoteRecoveryReserved state;

3) WaitRemoteRecoveryReserved state: In this state, the reservation of remote resources is completed. When the AllRemotesReserved event is received, it indicates that the PG has completed resource reservation on all slave OSDs participating in data recovery, and enters the Recovery state;

4) Recovery state: In this state, the actual data recovery work is completed. After completion, set PG to PG_STATE_RECOVERING state, add PG to recovery_wq work queue, and start data recovery:

PG::RecoveryState::Recovering::Recovering(my_context ctx)
  : my_base(ctx),
    NamedState(context< RecoveryMachine >().pg->cct, "Started/Primary/Active/Recovering")
{
  context< RecoveryMachine >().log_enter(state_name);

  PG *pg = context< RecoveryMachine >().pg;
  pg->state_clear(PG_STATE_RECOVERY_WAIT);
  pg->state_set(PG_STATE_RECOVERING);
  pg->publish_stats_to_osd();
  pg->osd->queue_for_recovery(pg);
}

5) After the Recovery work is completed in the Recovery state, if Backfill work is required, the RequestBackfill event is received and transferred to the Backfill process;

6) If there is no Backfill workflow, directly receive the AllReplicasRecovered event and transfer to the Recovered state;

7) Recovered state: Reaching this state means that the data recovery work has been completed. When the GoClean event is received, the PG enters the clean state.

3. Recovery process

Data restoration is based on the following information generated during the Peering process:

  • Information about missing objects on the primary copy is stored in the pg_missing_t structure of the pg_log class;

  • The missing object information on each slave copy is stored in the pg_missing_t structure in peer_missing corresponding to the OSD;

  • The location information of the missing object is stored in the class MissingLoc

According to the above information, you can know the missing object information of each OSD in the PG, and which OSDs currently have complete information on the missing object. Based on the above information, the data restoration process is relatively clear:

  • For an object whose main OSD is missing, randomly select an OSD that owns the object, and pull the data over;

  • For objects missing from the replica, push the missing object data from the master copy to the slave copy to complete the data repair;

  • For special snapshot objects, some optimization methods are added when repairing;

3.1 Trigger repair

The recovery process is triggered by the main OSD of the PG and controls the entire recovery process. In the repair process, the missing (or inconsistent) objects on the master OSD are repaired first, and then the missing objects on the slave OSD are repaired.

3.1.1 Recovery trigger process

Below we give the call flow of PG from Activatingstate to state Recovering:

1) Active/Activating generates DoRecovery() event

void ReplicatedPG::on_activate()
{
	// all clean?
	if (needs_recovery()) {
		dout(10) << "activate not all replicas are up-to-date, queueing recovery" << dendl;
		queue_peering_event(
		  CephPeeringEvtRef(
			std::make_shared<CephPeeringEvt>(
			  get_osdmap()->get_epoch(),
			  get_osdmap()->get_epoch(),
			  DoRecovery())));
	} 
	...
}

2) Enter the WaitLocalRecoveryReserved state

struct Activating : boost::statechart::state< Activating, Active >, NamedState {
  typedef boost::mpl::list <
boost::statechart::transition< AllReplicasRecovered, Recovered >,
boost::statechart::transition< DoRecovery, WaitLocalRecoveryReserved >,
boost::statechart::transition< RequestBackfill, WaitLocalBackfillReserved >
> reactions;
	explicit Activating(my_context ctx);
	void exit();
};

Above we see that after receiving DoRecoveryan event in the Activating state, it directly enters the WaitLocalRecoveryReserved state.

WaitLocalRecoveryReserved state to reserve local resources, the process is as follows:

struct WaitLocalRecoveryReserved : boost::statechart::state< WaitLocalRecoveryReserved, Active >, NamedState {
  typedef boost::mpl::list <
boost::statechart::transition< LocalRecoveryReserved, WaitRemoteRecoveryReserved >
> reactions;
	explicit WaitLocalRecoveryReserved(my_context ctx);
	void exit();
};

PG::RecoveryState::WaitLocalRecoveryReserved::WaitLocalRecoveryReserved(my_context ctx)
  : my_base(ctx),
    NamedState(context< RecoveryMachine >().pg->cct, "Started/Primary/Active/WaitLocalRecoveryReserved")
{
	context< RecoveryMachine >().log_enter(state_name);
	PG *pg = context< RecoveryMachine >().pg;
	pg->state_set(PG_STATE_RECOVERY_WAIT);
	pg->osd->local_reserver.request_reservation(
	  pg->info.pgid,
	  new QueuePeeringEvt<LocalRecoveryReserved>(
		pg, pg->get_osdmap()->get_epoch(),
		LocalRecoveryReserved()),
	  pg->get_recovery_priority());

	pg->publish_stats_to_osd();
}

As we can see above, when the local resource reservation is successful, a LocalRecoveryReserved event will be generated and posted to the message queue of PG. After the WaitLocalRecoveryReserved state receives the LocalRecoveryReserved event, it directly jumps to the WaitRemoteRecoveryReserved state

3) WaitRemoteRecoveryReserved state for remote resource reservation

struct WaitRemoteRecoveryReserved : boost::statechart::state< WaitRemoteRecoveryReserved, Active >, NamedState {
typedef boost::mpl::list <
boost::statechart::custom_reaction< RemoteRecoveryReserved >,
boost::statechart::transition< AllRemotesReserved, Recovering >
> reactions;

	set<pg_shard_t>::const_iterator remote_recovery_reservation_it;
	explicit WaitRemoteRecoveryReserved(my_context ctx);
	boost::statechart::result react(const RemoteRecoveryReserved &evt);
	void exit();
};

PG::RecoveryState::WaitRemoteRecoveryReserved::WaitRemoteRecoveryReserved(my_context ctx)
  : my_base(ctx),
    NamedState(context< RecoveryMachine >().pg->cct, "Started/Primary/Active/WaitRemoteRecoveryReserved"),
    remote_recovery_reservation_it(context< Active >().remote_shards_to_reserve_recovery.begin())
{
	context< RecoveryMachine >().log_enter(state_name);
	post_event(RemoteRecoveryReserved());
}

boost::statechart::result
PG::RecoveryState::WaitRemoteRecoveryReserved::react(const RemoteRecoveryReserved &evt) {
	PG *pg = context< RecoveryMachine >().pg;
	
	if (remote_recovery_reservation_it != context< Active >().remote_shards_to_reserve_recovery.end()) {
		assert(*remote_recovery_reservation_it != pg->pg_whoami);
		ConnectionRef con = pg->osd->get_con_osd_cluster(
		remote_recovery_reservation_it->osd, pg->get_osdmap()->get_epoch());
		if (con) {
			pg->osd->send_message_osd_cluster(
			new MRecoveryReserve(
			  MRecoveryReserve::REQUEST,
			  spg_t(pg->info.pgid.pgid, remote_recovery_reservation_it->shard),
			  pg->get_osdmap()->get_epoch()),
			  con.get());
		}
		++remote_recovery_reservation_it;
	} else {
		post_event(AllRemotesReserved());
	}
	return discard_event();
}
PG::RecoveryState::Active::Active(my_context ctx)
  : my_base(ctx),
    NamedState(context< RecoveryMachine >().pg->cct, "Started/Primary/Active"),
    remote_shards_to_reserve_recovery(
      unique_osd_shard_set(
		context< RecoveryMachine >().pg->pg_whoami,
		context< RecoveryMachine >().pg->actingbackfill)),
    remote_shards_to_reserve_backfill(
      unique_osd_shard_set(
		context< RecoveryMachine >().pg->pg_whoami,
		context< RecoveryMachine >().pg->backfill_targets)),
    all_replicas_activated(false)
{
	...
}

RemoteRecoveryReservedThrow events directly in the WaitRemoteRecoveryReserved constructor , and then send MRecoveryReservemessages to each OSD copy in remote_shards_to_reserve_recovery one by one in the WaitRemoteRecoveryReserved::react(const RemoteRecoveryReserved &) function to make remote resource reservations.

Note: The following is the processing of the remote copy

After the remote copy OSD receives the MRecoveryReserve::REQUEST message, it calls the OSD::handle_pg_recovery_reserve() function for processing:

void OSD::handle_pg_recovery_reserve(OpRequestRef op)
{
	MRecoveryReserve *m = static_cast<MRecoveryReserve*>(op->get_req());
	assert(m->get_type() == MSG_OSD_RECOVERY_RESERVE);
	
	if (!require_osd_peer(op->get_req()))
		return;
	if (!require_same_or_newer_map(op, m->query_epoch, false))
		return;
	
	PG::CephPeeringEvtRef evt;
	if (m->type == MRecoveryReserve::REQUEST) {
		evt = PG::CephPeeringEvtRef(
		  new PG::CephPeeringEvt(
			m->query_epoch,
			m->query_epoch,
			PG::RequestRecovery()));
	}
	...
}

In the handle_pg_recovery_reserve() function, a PG::RequestRecovery() event is generated. The event is processed by RepNotRecovering and directly enters RepWaitRecoveryReserved. In the RepWaitRecoveryReserved state, remote resource reservation is made, and the RemoteRecoveryReserved event is generated successfully when the reservation is successful, and the remote resource reservation success is reported to the PG Primary, and the PG Replica itself enters the state RepRecovering:

struct RepNotRecovering : boost::statechart::state< RepNotRecovering, ReplicaActive>, NamedState {
typedef boost::mpl::list<
boost::statechart::custom_reaction< RequestBackfillPrio >,
boost::statechart::transition< RequestRecovery, RepWaitRecoveryReserved >,
boost::statechart::transition< RecoveryDone, RepNotRecovering >  // for compat with pre-reservation peers
> reactions;
	explicit RepNotRecovering(my_context ctx);
	boost::statechart::result react(const RequestBackfillPrio &evt);
	void exit();
};
struct RepWaitRecoveryReserved : boost::statechart::state< RepWaitRecoveryReserved, ReplicaActive >, NamedState {
typedef boost::mpl::list<
boost::statechart::custom_reaction< RemoteRecoveryReserved >
> reactions;
	explicit RepWaitRecoveryReserved(my_context ctx);
	void exit();
	boost::statechart::result react(const RemoteRecoveryReserved &evt);
};

Note: The default initial substate of the ReplicaActive state is RepNotRecovering.

4) Enter the Recovering state

When all remote resources are reserved successfully, it will enter the Recovering state.

struct Recovering : boost::statechart::state< Recovering, Active >, NamedState {
typedef boost::mpl::list <
boost::statechart::custom_reaction< AllReplicasRecovered >,
boost::statechart::custom_reaction< RequestBackfill >
> reactions;
	explicit Recovering(my_context ctx);
	void exit();
	void release_reservations();
	boost::statechart::result react(const AllReplicasRecovered &evt);
	boost::statechart::result react(const RequestBackfill &evt);
};

PG::RecoveryState::Recovering::Recovering(my_context ctx)
  : my_base(ctx),
    NamedState(context< RecoveryMachine >().pg->cct, "Started/Primary/Active/Recovering")
{
	context< RecoveryMachine >().log_enter(state_name);
	
	PG *pg = context< RecoveryMachine >().pg;
	pg->state_clear(PG_STATE_RECOVERY_WAIT);
	pg->state_set(PG_STATE_RECOVERING);
	pg->publish_stats_to_osd();
	pg->osd->queue_for_recovery(pg);
}

In the recovery constructor, clear PG_STATE_RECOVERY_WAITthe status of the PG, set the status of the PG to PG_STATE_RECOVERINGstatus, and then add the PG to the recovery queue:

bool OSDService::queue_for_recovery(PG *pg)
{
	bool b = recovery_wq.queue(pg);
	if (b)
		dout(10) << "queue_for_recovery queued " << *pg << dendl;
	else
		dout(10) << "queue_for_recovery already queued " << *pg << dendl;
	return b;
}
3.1.2 OSD::do_recovery()

It can be seen from the data recovery state transition process that when a PG is in the Active/Recovering state, the PG is added to the RecoveryWQ work queue of the OSD. In recovery_wq, the processing function of the thread pool of its work queue calls the do_recovery() function to perform the actual data recovery operation:

struct RecoveryWQ : public ThreadPool::WorkQueue<PG> {
	void _process(PG *pg, ThreadPool::TPHandle &handle) override {
		osd->do_recovery(pg, handle);
		pg->put("RecoveryWQ");
	}
}recovery_wq;

void OSD::do_recovery(PG *pg, ThreadPool::TPHandle &handle){
	if (g_conf->osd_recovery_sleep > 0) {
		handle.suspend_tp_timeout();
		utime_t t;
		t.set_from_double(g_conf->osd_recovery_sleep);
		t.sleep();
		handle.reset_tp_timeout();
		dout(20) << __func__ << " slept for " << t << dendl;
	}
	
	// see how many we should try to start.  note that this is a bit racy.
	recovery_wq.lock();
	int max = MIN(cct->_conf->osd_recovery_max_active - recovery_ops_active,
	cct->_conf->osd_recovery_max_single_start);
	if (max > 0) {
		dout(10) << "do_recovery can start " << max << " (" << recovery_ops_active << "/" << cct->_conf>osd_recovery_max_active
		  << " rops)" << dendl;
		recovery_ops_active += max;  // take them now, return them if we don't use them.
	} else {
		dout(10) << "do_recovery can start 0 (" << recovery_ops_active << "/" << cct->_conf->osd_recovery_max_active
		  << " rops)" << dendl;
	}
	recovery_wq.unlock();
	
	if (max <= 0) {
		dout(10) << "do_recovery raced and failed to start anything; requeuing " << *pg << dendl;
		recovery_wq.queue(pg);
		return;
	} else {
		pg->lock_suspend_timeout(handle);
		if (pg->deleting || !(pg->is_peered() && pg->is_primary())) {
			pg->unlock();
			goto out;
		}
	
		dout(10) << "do_recovery starting " << max << " " << *pg << dendl;
		#ifdef DEBUG_RECOVERY_OIDS
			dout(20) << "  active was " << recovery_oids[pg->info.pgid] << dendl;
		#endif
		
		int started = 0;
		bool more = pg->start_recovery_ops(max, handle, &started);
		dout(10) << "do_recovery started " << started << "/" << max << " on " << *pg << dendl;
		// If no recovery op is started, don't bother to manipulate the RecoveryCtx
		if (!started && (more || !pg->have_unfound())) {
			pg->unlock();
			goto out;
		}
	
		PG::RecoveryCtx rctx = create_context();
		rctx.handle = &handle;
		
		/*
		* if we couldn't start any recovery ops and things are still
		* unfound, see if we can discover more missing object locations.
		* It may be that our initial locations were bad and we errored
		* out while trying to pull.
		*/
		if (!more && pg->have_unfound()) {
			pg->discover_all_missing(*rctx.query_map);
			if (rctx.query_map->empty()) {
				dout(10) << "do_recovery  no luck, giving up on this pg for now" << dendl;
				recovery_wq.lock();
				recovery_wq._dequeue(pg);
				recovery_wq.unlock();
			}
		}
	
		pg->write_if_dirty(*rctx.transaction);
		OSDMapRef curmap = pg->get_osdmap();
		pg->unlock();
		dispatch_context(rctx, pg, curmap);
	}
	
out:
	recovery_wq.lock();
	if (max > 0) {
		assert(recovery_ops_active >= max);
		recovery_ops_active -= max;
	}
	recovery_wq._wake();
	recovery_wq.unlock();
}

The function do_recovery() is executed by a thread of the thread pool of the RecoveryWQ work queue. The input parameter is the PG to be repaired, and the specific processing flow is as follows:

1) The configuration option osd_recovery_sleep sets the sleep time after the thread does a repair. If this value is set, each time the thread starts to sleep for the corresponding length of time. The default value of this parameter is 0, no sleep is required.

2) Add the recovery_wq.lock() lock to protect the recovery_wq queue and the variable recovery_ops_active. Calculate the max value of repairable objects, which is the maximum number of objects allowed to be repaired osd_recovery_max_active minus the number of objects being repaired recovery_ops_active, and then call the function recovery_wq.unlock() to unlock;

3) If max is less than or equal to 0, that is, there is no quota for repairing objects, add PG to the work queue recovery_wq and return; otherwise, if max is greater than 0, call pg->lock_suspend_timeout(handle) to reset the thread timeout. Check the status of the PG, if the PG is in the state of being deleted, or is neither in the peered state nor the main OSD, exit directly;

4) Call the function pg->start_recovery_ops() to repair, and the return value more is the number of objects that still need to be repaired. The output parameter started is the number of objects whose repair has started.

5) If more is 0, that is, there is no repaired object. But pg->have_unfound() is not 0, and there are unfound objects (that is, missing objects, which OSD can find complete objects), call the function PG::discover_all_missing() on the OSD in the might_have_unfound queue Continue to search for the object. The search method is to send a message to obtain the pg_log of the OSD to the relevant OSD.

Note: For unfound objects, they are placed at the end for recovery

6) If rctx.query_map->empty() is empty, that is, no other OSD is found to obtain the pg_log to find the unfound object, then the recovery operation of the PG is ended, and the function is called to delete the PG from recovery_wq._dequeue(pg);

7) The function dispatch_context() does the finishing work, sends the query_map request here, and submits the ctx.transaction transaction to the local object storage.

From the above process analysis, we can see that the core function of the do_recovery() function is to calculate the max value of the object to be repaired, and then call the function start_recovery_ops() to start the repair.

Note: When this recovery is completed, ReplicatedPG::on_global_recover() will be called back. If the PG still has data to recover, PG::finish_recovery_op() will be called in on_global_recover() to add the PG back to recovery_wq

3.2 ReplicatedPG

Class ReplicatedPG is used to handle related repair operations of Replicate type PG. The following analyzes the specific implementation of the start_recovery_ops() function and its related functions for repair.

3.2.1 Function start_recovery_ops()

The function start_recovery_ops() calls recovery_primary() and recovery_replicas() to repair the primary and secondary copies of objects on the PG. After the repair is completed, if the Backfill process is still required, a corresponding event is thrown to trigger the PG state machine to start the Backfill repair process.

Note: Here the ReplicatedPG::start_recovery_ops() operation includes both recovery and backfill, and the recovery operation is prioritized. The return result of the function is whether the recovery/backfill operation is successfully started

class PG : DoutPrefixProvider {
protected:
	BackfillInterval backfill_info;
	map<pg_shard_t, BackfillInterval> peer_backfill_info;
	bool backfill_reserved;                      //当前backfill操作是否预约成功,在进入Backfilling状态时会设置为true
	bool backfill_reserving;                     //当前是否开始了backfill操作的预约(注:从"开始预约"到"预约成功"是有一段过程的)
};
bool ReplicatedPG::start_recovery_ops(int max, ThreadPool::TPHandle &handle,int *ops_started)
{
	int& started = *ops_started;
	started = 0;
	bool work_in_progress = false;
	assert(is_primary());
	
	if (!state_test(PG_STATE_RECOVERING) && !state_test(PG_STATE_BACKFILL)) {
		/* TODO: I think this case is broken and will make do_recovery()
		* unhappy since we're returning false */
		dout(10) << "recovery raced and were queued twice, ignoring!" << dendl;
		return false;
	}
	
	const pg_missing_t &missing = pg_log.get_missing();
	
	int num_missing = missing.num_missing();
	int num_unfound = get_num_unfound();
	
	if (num_missing == 0) {
		info.last_complete = info.last_update;
	}

	if (num_missing == num_unfound) {
		// All of the missing objects we have are unfound.
		// Recover the replicas.
		started = recover_replicas(max, handle);
	}
	if (!started) {
		// We still have missing objects that we should grab from replicas.
		started += recover_primary(max, handle);
	}
	if (!started && num_unfound != get_num_unfound()) {
		// second chance to recovery replicas
		started = recover_replicas(max, handle);
	}
	
	if (started)
		work_in_progress = true;

	bool deferred_backfill = false;
	if (recovering.empty() && state_test(PG_STATE_BACKFILL) && !backfill_targets.empty() && started < max &&
	  missing.num_missing() == 0 && waiting_on_backfill.empty()) {

		if (get_osdmap()->test_flag(CEPH_OSDMAP_NOBACKFILL)) {
			dout(10) << "deferring backfill due to NOBACKFILL" << dendl;
			deferred_backfill = true;
		} else if (get_osdmap()->test_flag(CEPH_OSDMAP_NOREBALANCE) && !is_degraded())  {
			dout(10) << "deferring backfill due to NOREBALANCE" << dendl;
			deferred_backfill = true;
		} else if (!backfill_reserved) {
			dout(10) << "deferring backfill due to !backfill_reserved" << dendl;
			if (!backfill_reserving) {
				dout(10) << "queueing RequestBackfill" << dendl;
				backfill_reserving = true;
				queue_peering_event(
				  CephPeeringEvtRef(
					std::make_shared<CephPeeringEvt>(
					  get_osdmap()->get_epoch(),
					  get_osdmap()->get_epoch(),
					  RequestBackfill())));
			}
			deferred_backfill = true;
		} else {
			started += recover_backfill(max - started, handle, &work_in_progress);
		}
	}

	dout(10) << " started " << started << dendl;
	osd->logger->inc(l_osd_rop, started);
	
	if (!recovering.empty() || work_in_progress || recovery_ops_active > 0 || deferred_backfill)
		return work_in_progress;
	
	assert(recovering.empty());
	assert(recovery_ops_active == 0);
	
	dout(10) << __func__ << " needs_recovery: " << missing_loc.get_needs_recovery() << dendl;
	dout(10) << __func__ << " missing_loc: " << missing_loc.get_missing_locs() << dendl;
	int unfound = get_num_unfound();
	if (unfound) {
		dout(10) << " still have " << unfound << " unfound" << dendl;
		return work_in_progress;
	}

	if (missing.num_missing() > 0) {
		// this shouldn't happen!
		osd->clog->error() << info.pgid << " recovery ending with " << missing.num_missing() << ": " << missing.missing << "\n";
		return work_in_progress;
	}
	
	if (needs_recovery()) {
		// this shouldn't happen!
		// We already checked num_missing() so we must have missing replicas
		osd->clog->error() << info.pgid << " recovery ending with missing replicas\n";
		return work_in_progress;
	}
	
	if (state_test(PG_STATE_RECOVERING)) {
		state_clear(PG_STATE_RECOVERING);
		if (needs_backfill()) {
			dout(10) << "recovery done, queuing backfill" << dendl;
			queue_peering_event(
			  CephPeeringEvtRef(
				std::make_shared<CephPeeringEvt>(
				  get_osdmap()->get_epoch(),
				  get_osdmap()->get_epoch(),
				  RequestBackfill())));
		} else {
			dout(10) << "recovery done, no backfill" << dendl;
			queue_peering_event(
			  CephPeeringEvtRef(
				std::make_shared<CephPeeringEvt>(
				  get_osdmap()->get_epoch(),
				  get_osdmap()->get_epoch(),
				  AllReplicasRecovered())));
		}
	} else { // backfilling
		state_clear(PG_STATE_BACKFILL);
		dout(10) << "recovery done, backfill done" << dendl;
		queue_peering_event(
		  CephPeeringEvtRef(
			std::make_shared<CephPeeringEvt>(
			  get_osdmap()->get_epoch(),
			  get_osdmap()->get_epoch(),
			  Backfilled())));
	}
	
	return false;
}

The specific process of this function is as follows:

1) First check the OSD to make sure it is the main OSD of the PG. Exit if PG is not in the state of PG_STATE_RECOVERINGor ;PG_STATE_BACKFILL

2) Get the missing object from pg_log, which holds the missing objects of the main OSD. The parameter num_missing is 主OSDthe number of missing objects; num_unfound is 该PGthe missing object but no other correct copy of the object is found on the OSD; if num_missing is 0, it means that the main OSD is not missing objects, directly set info.last_complete to the latest version of info.last_update value;

Note: the unfound object is a subset of the missing object

3) If num_missing is equal to num_unfound, it means that the missing objects of the main OSD are unfound objects, first call the function ReplicatedPG::recover_replicas() to start repairing the objects on the replica;

4) If started is 0, that is, the number of objects that have been repaired is 0, call the function ReplicatedPG::recover_primary() to repair the objects on the primary OSD;

5) If started is still 0 and num_unfound has changed, start ReplicatedPG::recover_replicas() again to repair the copy;

6) If started is not 0, set the value of work_in_progress to true;

7) If the recovering queue is empty, that is, there is no object undergoing Recovery operation, the state is PG_STATE_BACKFILL, and backfill_targets is not empty, started is less than max, and missing.num_missing() is 0:

  a) If the flag get_osdmap()->test_flag(CEPH_OSDMAP_NOBACKFILL) is set, the Backfill process is postponed;

  b) If the flag CEPH_OSDMAP_NOREBALANCE is set and it is not in the degrade state, postpone the Backfill process;

  c) If backfill_reserved is not set, a RequestBackfill event is thrown to the state machine to start the Backfill process;

  d) Otherwise, call the function ReplicatedPG::recover_backfill() to start the Backfill process

8) Finally, if the PG is in the PG_STATE_RECOVERING state and the object is successfully repaired, check: if the Backfill process is required, the RequestBackfill event is sent to the PG state machine; if the Backfill process is not required, the AllReplicasRecovered event is thrown;

9) Otherwise, the state of PG is PG_STATE_BACKFILL state, clear this state, and throw the Backfilled event;

Next, we'll cover:

  • recover_primary() repairs missing objects on PG primary OSD

  • recover_replicas() repairs missing objects on PG replica OSD

  • recover_backfill() executes the backfill process

3.2.2 Function recover_primary()

The function recover_primary() is used to recover missing objects on the primary OSD of a PG:

class ReplicatedBackend : public PGBackend {
	struct RPGHandle : public PGBackend::RecoveryHandle {
      map<pg_shard_t, vector<PushOp> > pushes;
      map<pg_shard_t, vector<PullOp> > pulls;
	};

	/// @see PGBackend::open_recovery_op
	RPGHandle *_open_recovery_op() {
		return new RPGHandle();
	}
	PGBackend::RecoveryHandle *open_recovery_op() {
		return _open_recovery_op();
	}
};

/**
 * do one recovery op.
 * return true if done, false if nothing left to do.
 */
int ReplicatedPG::recover_primary(int max, ThreadPool::TPHandle &handle)
{
	assert(is_primary());
	
	const pg_missing_t &missing = pg_log.get_missing();
	
	dout(10) << "recover_primary recovering " << recovering.size()<< " in pg" << dendl;
	dout(10) << "recover_primary " << missing << dendl;
	dout(25) << "recover_primary " << missing.missing << dendl;
	
	// look at log!
	pg_log_entry_t *latest = 0;
	int started = 0;
	int skipped = 0;
	
	PGBackend::RecoveryHandle *h = pgbackend->open_recovery_op();
	map<version_t, hobject_t>::const_iterator p = missing.rmissing.lower_bound(pg_log.get_log().last_requested);
	while (p != missing.rmissing.end()) {
		handle.reset_tp_timeout();
		hobject_t soid;
		version_t v = p->first;
	
		if (pg_log.get_log().objects.count(p->second)) {
			latest = pg_log.get_log().objects.find(p->second)->second;
			assert(latest->is_update());
			soid = latest->soid;
		} else {
			latest = 0;
			soid = p->second;
		}

		const pg_missing_t::item& item = missing.missing.find(p->second)->second;
		++p;

		hobject_t head = soid;
		head.snap = CEPH_NOSNAP;
		
		eversion_t need = item.need;
		
		dout(10) << "recover_primary " << soid << " " << item.need << (missing.is_missing(soid) ? " (missing)":"")
		  << (missing.is_missing(head) ? " (missing head)":"") << (recovering.count(soid) ? " (recovering)":"")
		  << (recovering.count(head) ? " (recovering head)":"") << dendl;

		if (latest) {
			switch (latest->op) {
			case pg_log_entry_t::CLONE:
			/*
			* Handling for this special case removed for now, until we
			* can correctly construct an accurate SnapSet from the old
			* one.
			*/
			break;

			case pg_log_entry_t::LOST_REVERT:
			{
				if (item.have == latest->reverting_to) {
					ObjectContextRef obc = get_object_context(soid, true);
			
					if (obc->obs.oi.version == latest->version) {
						// I'm already reverting
						dout(10) << " already reverting " << soid << dendl;
					} else {
						dout(10) << " reverting " << soid << " to " << latest->prior_version << dendl;
						obc->ondisk_write_lock();
						obc->obs.oi.version = latest->version;
			
						ObjectStore::Transaction t;
						bufferlist b2;
						obc->obs.oi.encode(b2);
						assert(!pool.info.require_rollback());
						t.setattr(coll, ghobject_t(soid), OI_ATTR, b2);
						
						recover_got(soid, latest->version);
						missing_loc.add_location(soid, pg_whoami);
						
						++active_pushes;
			
						osd->store->queue_transaction(osr.get(), std::move(t),
						  new C_OSD_AppliedRecoveredObject(this, obc),
						  new C_OSD_CommittedPushedObject(
							this,
							get_osdmap()->get_epoch(),
							info.last_complete),
							new C_OSD_OndiskWriteUnlock(obc));
						continue;
					}
				} else {
					/*
					* Pull the old version of the object.  Update missing_loc here to have the location
					* of the version we want.
					*
					* This doesn't use the usual missing_loc paths, but that's okay:
					*  - if we have it locally, we hit the case above, and go from there.
					*  - if we don't, we always pass through this case during recovery and set up the location
					*    properly.
					*  - this way we don't need to mangle the missing code to be general about needing an old
					*    version...
					*/
					eversion_t alternate_need = latest->reverting_to;
					dout(10) << " need to pull prior_version " << alternate_need << " for revert " << item << dendl;
					
					for (map<pg_shard_t, pg_missing_t>::iterator p = peer_missing.begin();p != peer_missing.end(); ++p)
	      				if (p->second.is_missing(soid, need) && p->second.missing[soid].have == alternate_need) {
							missing_loc.add_location(soid, p->first);
	      				}

					dout(10) << " will pull " << alternate_need << " or " << need << " from one of " << missing_loc.get_locations(soid) << dendl;
	  			}
			}

			break;
			}
		}
   
		if (!recovering.count(soid)) {
			if (recovering.count(head)) {
				++skipped;
			} else {
				int r = recover_missing(soid, need, get_recovery_op_priority(), h);
				switch (r) {
				case PULL_YES:
					++started;
					break;
				case PULL_OTHER:
					++started;
				case PULL_NONE:
					++skipped;
					break;
				default:
					assert(0);
				}

				if (started >= max)
					break;
			}
		}
		
		// only advance last_requested if we haven't skipped anything
		if (!skipped)
			pg_log.set_last_requested(v);
	}
		
	pgbackend->run_recovery_op(h, get_recovery_op_priority());
	return started;
}

Its processing is as follows:

1) Call pgbackend->open_recovery_op() to return a PGBackend::RecoveryHandle related to the PG type. For ReplicatedPG, its corresponding RecoveryHandle is RPGHandle, there are two maps inside, which save the encapsulation PushOp and PullOp of Push and Pull operations:

struct RPGHandle : public PGBackend::RecoveryHandle {
	map<pg_shard_t, vector<PushOp> > pushes;
	map<pg_shard_t, vector<PullOp> > pulls;
};


//src/osd/osd_types.h
struct PushOp {
	hobject_t soid;
	eversion_t version;
	bufferlist data;
	interval_set<uint64_t> data_included;
	bufferlist omap_header;
	map<string, bufferlist> omap_entries;
	map<string, bufferlist> attrset;
	
	ObjectRecoveryInfo recovery_info;
	ObjectRecoveryProgress before_progress;
	ObjectRecoveryProgress after_progress;
	
	static void generate_test_instances(list<PushOp*>& o);
	void encode(bufferlist &bl) const;
	void decode(bufferlist::iterator &bl);
	ostream &print(ostream &out) const;
	void dump(Formatter *f) const;
	
	uint64_t cost(CephContext *cct) const;
};

struct PullOp {
	hobject_t soid;
	
	ObjectRecoveryInfo recovery_info;
	ObjectRecoveryProgress recovery_progress;
	
	static void generate_test_instances(list<PullOp*>& o);
	void encode(bufferlist &bl) const;
	void decode(bufferlist::iterator &bl);
	ostream &print(ostream &out) const;
	void dump(Formatter *f) const;
	
	uint64_t cost(CephContext *cct) const;
};

2) last_requested is the last repaired pointer, and the object that has not been repaired is obtained by calling the lower_bound() function;

3) Traverse each object that has not been repaired: latestit is the last log of the missing object saved in the log record, and soid is the missing object. If latest is not empty:

For pg_log_entry_tinstructions on related operations, please refer to the following:

/**
 * pg_log_entry_t - single entry/event in pg log
 *
 */
struct pg_log_entry_t {
	enum {
		MODIFY = 1,       // some unspecified modification (but not *all* modifications)
		CLONE = 2,        // cloned object from head
		DELETE = 3,       // deleted object
		BACKLOG = 4,      // event invented by generate_backlog [deprecated]
		LOST_REVERT = 5,  // lost new version, revert to an older version.
		LOST_DELETE = 6,  // lost new version, revert to no object (deleted).
		LOST_MARK = 7,    // lost new version, now EIO
		PROMOTE = 8,      // promoted object from another tier
		CLEAN = 9,        // mark an object clean
	};
};

  a) If the log record is of pg_log_entry_t::CLONE type, no special processing will be done here until the snapshot-related information SnapSet is successfully obtained;

  b) If the log record type is pg_log_entry_t::LOST_REVERT type: when the revert operation is inconsistent with the data, the administrator forcibly rolls back to the specified version through the command line, and reverting_to records the rollback version:

  • If item.have is equal to the latest->reverting_to version, that is, the log record shows that the current version that has been rolled back is displayed, then the ObjectContext of the object is obtained. If the current version obc->obs.io.version of the object is equal to latest-> version, indicating that the rollback operation is completed;

  • If item.have is equal to latest->reverting_to, but the current version obc->obs.io.version of the object is not equal to latest->version, it means that the rollback operation has not been performed, and the version number of the object can be directly changed to latest->version .

  • Otherwise, it is necessary to pull the reverting_to version of the object. There is no special processing here, just check whether all OSDs have this version of the object, and if so, add it to missing_loc to record the location information of this version, and continue to complete the subsequent repair. (Note: Since each copy OSD has been checked earlier, only peering_missing can be checked here)

  c) If the object is in the recovering process, it indicates that it is being repaired, or its head object is being repaired, skipped, and the count is increased to skipped; otherwise, call the function ReplicatedPG::recover_missing() to repair.

4) Call the function pgbackend->run_recovery_op() to send the message encapsulated by PullOp or PushOp;

Note: The construction of PullOp or PushOp is done in ReplicatedPG::recover_missing(), which we will introduce in detail later.


The following example illustrates the repair process when the last log record type LOST_REVERTis :

例11-1 log repair process

The PG log records are as follows: Each unit represents a log record, which is the name, version and operation of the object, and the format of the version is (epoch, version). The gray part represents the missing log record on this OSD. The log record is copied from the authoritative log record, so the current log record is continuous and complete.

ceph-chapter11-2

Case 1:  Repair of normal cases

The list of missing objects is [obj1, obj2]. The current repair object is obj1. It can be seen from the log records that the object obj1 has been modified three times, namely versions 6, 7, and 8. The version have value of the currently owned obj1 object is 4, and only the last modified version 8 can be repaired.

Case 2:  The last operation is an operation of type LOST_REVERT

ceph-chapter11-3

For the object obj1 to be repaired, the last operation is a LOST_REVERT type operation. The current version of this operation is 8, the previous version prior_version is 7, and the reverting_to version is 4.

In this case, the log shows that there is already version 4. Check the actual version of object obj1, which is the version number saved in object_info:

1) If the value is 8, it means that the last revert operation was successful and no repair action is required;

2) If the value is 4, it means that the LOST_REVERT operation is not executed. Of course, the data content is already version 4, you only need to modify the version of object_info to 8.

If the reverting_to version is not version 4 but version 6, then the data of obj1 needs to be repaired to the data of version 6 in the end. Ceph's processing here is only to check whether there is version 6 in the missing objects of other OSDs. If there is, it will be added to missing_loc, and the location of the OSD with this version will be recorded, and it will be repaired later.

3.2.3 Function recover_missing()

The function ReplicatedPG::recover_missing() is used to recover missing objects. When repairing the snap object, the head object or snapdir object must be repaired first to obtain the SnapSet information, and then the snapshot object itself can be repaired.

/*
 * Return values:
 *  NONE  - didn't pull anything
 *  YES   - pulled what the caller wanted
 *  OTHER - needed to pull something else first (_head or _snapdir)
 */
enum { PULL_NONE, PULL_OTHER, PULL_YES };

int ReplicatedPG::recover_missing(
  const hobject_t &soid, eversion_t v,
  int priority,
  PGBackend::RecoveryHandle *h)
{
	if (missing_loc.is_unfound(soid)) {
		dout(7) << "pull " << soid << " v " << v << " but it is unfound" << dendl;
		return PULL_NONE;
	}

	// is this a snapped object?  if so, consult the snapset.. we may not need the entire object!
	ObjectContextRef obc;
	ObjectContextRef head_obc;
	if (soid.snap && soid.snap < CEPH_NOSNAP) {
		// do we have the head and/or snapdir?
		hobject_t head = soid.get_head();

		if (pg_log.get_missing().is_missing(head)) {
			if (recovering.count(head)) {
				dout(10) << " missing but already recovering head " << head << dendl;
				return PULL_NONE;
			} else {
				int r = recover_missing(head, pg_log.get_missing().missing.find(head)->second.need, priority,h);
				if (r != PULL_NONE)
					return PULL_OTHER;
				return PULL_NONE;
			}
		}

		head = soid.get_snapdir();
		if (pg_log.get_missing().is_missing(head)) {
			if (recovering.count(head)) {
				dout(10) << " missing but already recovering snapdir " << head << dendl;
				return PULL_NONE;
			} else {
				int r = recover_missing(
				head, pg_log.get_missing().missing.find(head)->second.need, priority,h);
				if (r != PULL_NONE)
					return PULL_OTHER;
				return PULL_NONE;
			}
		}

		// we must have one or the other
		head_obc = get_object_context(soid.get_head(),false,0);
		if (!head_obc)
			head_obc = get_object_context(soid.get_snapdir(),false,0);
		assert(head_obc);
	}

	start_recovery_op(soid);
	assert(!recovering.count(soid));
	recovering.insert(make_pair(soid, obc));

	pgbackend->recover_object(soid,v,head_obc,obc,h);

	return PULL_YES;
}

The specific implementation is as follows:

1) Check if the object soid is unfound, return PULL_NONEthe value directly. Objects that are unfound cannot be repaired temporarily;

2) If the repair is a snap object:

  a) Check if the corresponding head object is missing, recursively call the function recover_missing() to repair the head object first;

  b) Check if the snapdir object is missing, then recursively call the function recover_missing() to first repair the snapdir object;

3) Obtain head_obc information from the head object or snapdir object;

4) Call the function pgbackend->recover_object() to encapsulate the operation information to be repaired into the PullOp or PushOp object, and add it to the RecoveryHandle structure.

Guess you like

Origin blog.csdn.net/weixin_43778179/article/details/132717315