PGLog processing flow in Ceph

Ceph's PGLog is maintained by PG, which records all operations of the PG, and its function is similar to the undo log in the database. PGLog usually only saves nearly a thousand operation records (the default is 3000 , specified by osd_min_pg_log_entries), but when the PG is in a degraded state, it will save more logs (the default is 10000) , so that the failed PG can It is used to restore PG data after going online again. This article mainly analyzes PGLog from the PG format, storage method, and how to participate in recovery.

Related configuration instructions:

  • osd_min_pg_log_entries: the number of pg log records under normal circumstances

  • osd_max_pg_log_entries: The number of pg log records in abnormal situations, when this limit is reached, the trim operation will be performed

1. PGLog module static class diagram

The static class diagram of the PGLog module is shown in the following figure:

pglog-hierarchical

2. The format of PGLog

Ceph uses version control to mark every update in a PG, and each version consists of an (epoch, version).

Among them, epoch is the version of osdmap. Whenever there is an OSD state change (such as adding, deleting, etc.), epoch is incremented; version is the version number of each update operation in the PG, which is incremented, and is performed by the Primary OSD in the PG. distribute.

PGLog has 3 main data structures to maintain in the code implementation (the relevant code is located in src/osd/osd_types.h):

  • pg_log_entry_t

The structure pg_log_entry_t records a single record of the PG log , and its data structure is as follows:

struct pg_log_entry_t {
	ObjectModDesc mod_desc;                 //用于保存本地回滚的一些信息,用于EC模式下的回滚操作

	bufferlist snaps;                       //克隆操作,用于记录当前对象的snap列表
	hobject_t  soid;                        //操作的对象
	osd_reqid_t reqid;                      //请求唯一标识(caller + tid)


	vector<pair<osd_reqid_t, version_t> > extra_reqids;


	eversion_t version;                     //本次操作的版本
	eversion_t prior_version;                //前一个操作的版本
	eversion_t reverting_to;                 //本次操作回退的版本(仅用于回滚操作)

	version_t user_version;                 //用户的版本号
	utime_t     mtime;                      //用户的本地时间
	
	__s32      op;                          //操作的类型
	bool invalid_hash;                      // only when decoding sobject_t based entries
	bool invalid_pool;                      // only when decoding pool-less hobject based entries

	...
};
  • pg_log_t

The structure pg_log_t stores all operation logs of the PG in memory , as well as related control structures:

/**
 * pg_log_t - incremental log of recent pg changes.
 *
 *  serves as a recovery queue for recent changes.
 */
struct pg_log_t {
	/*
	 *   head - newest entry (update|delete)
	 *   tail - entry previous to oldest (update|delete) for which we have
	 *          complete negative information.  
	 * i.e. we can infer pg contents for any store whose last_update >= tail.
	*/
	eversion_t head;                           //日志的头,记录最新的日志记录版本
	eversion_t tail;                           //指向最老的pg log记录的前一个版本


	eversion_t can_rollback_to;                //用于EC,指示本地可以回滚的版本, 可回滚的版本都大于can_rollback_to的值


	//在EC的实现中,本地保留了不同版本的数据。本数据段指示本PG里可以删除掉的对象版本。rollback_info_trimmed_to的值 <= can_rollback_to
	eversion_t rollback_info_trimmed_to;         

	//所有日志的列表
	list<pg_log_entry_t> log;  // the actual log.
	
	...
};

It should be noted that the PG log record is based on the entire PG, including modification records of all objects in the PG. The PG clock maintains a unified timing for all OSDs.

  • pg_info_t

The structure pg_info_t is a statistics of the current PG information:

/**
 * pg_info_t - summary of PG statistics.
 *
 * some notes: 
 *  - last_complete implies we have all objects that existed as of that
 *    stamp, OR a newer object, OR have already applied a later delete.
 *  - if last_complete >= log.bottom, then we know pg contents thru log.head.
 *    otherwise, we have no idea what the pg is supposed to contain.
 */
struct pg_info_t {
	spg_t pgid;                    //对应的PG ID

	//PG内最近一次更新的对象的版本,还没有在所有OSD上完成更新。在last_update和last_complete之间的操作表示
	//该操作已在部分OSD上完成,但是还没有全部完成。
	eversion_t last_update;        
	eversion_t last_complete;      //该指针之前的版本都已经在所有的OSD上完成更新(只表示内存更新完成)

	epoch_t last_epoch_started;    //本PG在启动时候的epoch值
	
	version_t last_user_version;   //最后更新的user object的版本号
	
	eversion_t log_tail;           //用于记录日志的尾部版本
	
	//上一次backfill操作的对象指针。如果该OSD的Backfill操作没有完成,那么[last_bakfill, last_complete)之间的对象可能
	//处于missing状态
	hobject_t last_backfill;      


	bool last_backfill_bitwise;            //true if last_backfill reflects a bitwise (vs nibblewise) sort
	
	interval_set<snapid_t> purged_snaps;   //PG要删除的snap集合
	
	pg_stat_t stats;                       //PG的统计信息
	
	pg_history_t history;                  //用于保存最近一次PG peering获取到的epoch等相关信息
	pg_hit_set_history_t hit_set;          //这是Cache Tier用的hit_set
};

The following is a simple diagram of the relationship between the three:

ceph-chapter6-6

in:

  • last_complete: The version of this pointer 之前has been updated on all OSDs (only means that the memory update is completed);

  • last_update: The last updated version of the object in the PG, which has not been updated on all OSDs. Operations between last_update and last_complete indicate that the operation has been completed on some OSDs, but not all.

  • log_tail: point to the oldest record of pg log;

  • head: the latest pg log record

  • tail: point to the previous one of the oldest pg log record;

  • log: store the list of actual pglog records

From the above structure, we can know that there are only contents related to object update operation in PGLog, and there is no specific data and offset size, etc., so subsequent recovery with PGLog is based on the entire object (the default object size is 4MB) ).

In addition, here are two more concepts:

  • The epoch is a monotonically increasing sequence, and the sequence is maintained by the monitor. When the configuration in the cluster and the OSD state (up, down, in, out) change, its value is increased by 1. This mechanism is equivalent to the time axis, and each sequence change is a point on the time axis. The epoch mentioned here is for OSD. When it comes to PG, that is, the change of epoch in the version version of each PG does not follow the change of the cluster epoch, but the state change of the OSD where the current PG is located. The epoch of the current PG is only will change.

As shown below:

ceph-chapter6-7

  • According to the concept of epoch growth, the second important concept interval is introduced

Because the epoch of pg is not completely continuous on its changing time axis, the time period experienced by every two changing pg epochs is called intervals.

3. The storage method of PGLog

After understanding the format of PGLog, let's analyze the storage method of PGLog. In the ceph implementation, the processing of writing IO is first encapsulated into a transaction, and then the transaction is written into the journal. After the Journal is written, the callback process is triggered. After processing by multiple threads and callbacks, the operation of writing data to the buffer cache is performed, thereby completing the entire process of writing the journal and the local cache (the specific process is in "OSD Read and Write Processing The process is described in detail in the article).

Generally speaking, PGLog is also encapsulated into the transaction, and written to the log disk together when writing the journal, and finally traverses the content in the transaction when writing the local cache, and writes PGLog related things to LevelDB, thus completing the OSD Update operation on PGLog.

Note: The ceph version here is relatively old, and filestore is used, and it is another logic after switching to bluestore.

3.1 PGLog updated to journal

3.1.1 Write IO serialization to transaction

The read and write processing flow on the main OSD is described in "OSD Read and Write Process", which will not be explained here. In the ReplicatedPG::do_osd_ops() function, according to the type CEPH_OSD_OP_WRITE, the operation of encapsulating and writing IO to transaction will be performed (ie: encode the data to be written into ObjectStore::Transaction::tbl, this is a bufferlist, and the op will be first encoded when encoding Encode it in, so that it can be operated according to the op during subsequent processing. Note that the encode here is actually a serialization operation).

The process of this transaction is as follows:

ceph-chapter6-8

3.1.2 PGLog serialized to transaction
  • In ReplicatedPG::do_op(), a context OpContext for object modification operations is created, and then the setting of the PGLog log version, etc. is completed in ReplicatedPG::execute_ctx() :
eversion_t get_next_version() const {
	eversion_t at_version(get_osdmap()->get_epoch(),
		pg_log.get_head().version+1);

	assert(at_version > info.last_update);
	assert(at_version > pg_log.get_head());
	return at_version;
}

void ReplicatedPG::do_op(OpRequestRef& op)
{
	...

	OpContext *ctx = new OpContext(op, m->get_reqid(), m->ops, obc, this);

	...

	execute_ctx(ctx);

	...
}

void ReplicatedPG::execute_ctx(OpContext *ctx){
	....
	
	// version
	ctx->at_version = get_next_version();
	ctx->mtime = m->get_mtime();

	...

	int result = prepare_transaction(ctx);

	...

	RepGather *repop = new_repop(ctx, obc, rep_tid);
	
	issue_repop(repop, ctx);
	eval_repop(repop);

	....
}

void ReplicatedPG::issue_repop(RepGather *repop, OpContext *ctx)
{
	...
	Context *on_all_commit = new C_OSD_RepopCommit(this, repop);
	Context *on_all_applied = new C_OSD_RepopApplied(this, repop);
	Context *onapplied_sync = new C_OSD_OndiskWriteUnlock(
		ctx->obc,
		ctx->clone_obc,
		unlock_snapset_obc ? ctx->snapset_obc : ObjectContextRef());

	pgbackend->submit_transaction(
		soid,
		ctx->at_version,
		std::move(ctx->op_t),
		pg_trim_to,
		min_last_complete_ondisk,
		ctx->log,
		ctx->updated_hset_history,
		onapplied_sync,
		on_all_applied,
		on_all_commit,
		repop->rep_tid,
		ctx->reqid,
		ctx->op);
}

Above we see that the third parameter of submit_transaction() is ctx->opt_t. In prepare_transaction(), we have packed the object data to be modified into the transaction.

  • Call ReplicatedPG::finish_ctx in ReplicatedPG::prepare_transaction(), and then call ctx->log.push_back() in the finish_ctx function, where pg_log_entry_t will be constructed and inserted into the vector log;
int ReplicatedPG::prepare_transaction(OpContext *ctx)
{
	...
	finish_ctx(ctx,
		ctx->new_obs.exists ? pg_log_entry_t::MODIFY :
		pg_log_entry_t::DELETE); 
}

void ReplicatedPG::finish_ctx(OpContext *ctx, int log_op_type, bool maintain_ssc,
			      bool scrub_ok)
{
	// append to log
	ctx->log.push_back(pg_log_entry_t(log_op_type, soid, ctx->at_version,
				ctx->obs->oi.version,
				ctx->user_at_version, ctx->reqid,
				ctx->mtime));
	...
}


/**
 * pg_log_entry_t - single entry/event in pg log
 *
 * (src/osd/osd_types.h)
 */
struct pg_log_entry_t {
	// describes state for a locally-rollbackable entry
	ObjectModDesc mod_desc;
	bufferlist snaps;                                       // only for clone entries
	hobject_t  soid;
	osd_reqid_t reqid;                                      // caller+tid to uniquely identify request
	vector<pair<osd_reqid_t, version_t> > extra_reqids;
	eversion_t version, prior_version, reverting_to;
	version_t user_version;                                 // the user version for this entry
	utime_t     mtime;                                      // this is the _user_ mtime, mind you
	
	__s32      op;
	bool invalid_hash;                                     // only when decoding sobject_t based entries
	bool invalid_pool;                                     // only when decoding pool-less hobject based entries

	pg_log_entry_t()
		: user_version(0), op(0),
		invalid_hash(false), invalid_pool(false) {

	}

	pg_log_entry_t(int _op, const hobject_t& _soid,
		const eversion_t& v, const eversion_t& pv,
		version_t uv,const osd_reqid_t& rid, const utime_t& mt)
		: soid(_soid), reqid(rid), version(v), prior_version(pv), user_version(uv),
		mtime(mt), op(_op), invalid_hash(false), invalid_pool(false){

	}
};

Above we can see that it will ctx->at_versionbe passed to pg_log_entry_t.version; it will ctx->obs->oi.versionbe passed to pg_log_entry_t.prior_version; it will ctx->user_at_versionbe passed to pg_log_entry_t.user_version.

For ctx->obs->oi.version, its value is assigned in the following function:

void ReplicatedPG::do_op(OpRequestRef& op)
{
	...

	int r = find_object_context(
		oid, &obc, can_create,
		m->has_flag(CEPH_OSD_FLAG_MAP_SNAP_CLONE),
		&missing_oid);

	...

	execute_ctx(ctx);

	...
}
  • Call parent->log_operation() in ReplicatedBackend::submit_transaction() to serialize PGLog into transaction. Serialize PGLog related information into transaction in PG::append_log().
//issue_repop()即处理replication operations,
void ReplicatedPG::issue_repop(RepGather *repop, OpContext *ctx)
{
	...

	Context *on_all_commit = new C_OSD_RepopCommit(this, repop);
	Context *on_all_applied = new C_OSD_RepopApplied(this, repop);
	Context *onapplied_sync = new C_OSD_OndiskWriteUnlock(
		ctx->obc,
		ctx->clone_obc,
		unlock_snapset_obc ? ctx->snapset_obc : ObjectContextRef());

	pgbackend->submit_transaction(
		soid,
		ctx->at_version,
		std::move(ctx->op_t),
		pg_trim_to,
		min_last_complete_ondisk,
		ctx->log,
		ctx->updated_hset_history,
		onapplied_sync,
		on_all_applied,
		on_all_commit,
		repop->rep_tid,
		ctx->reqid,
		ctx->op);

	...
}


void ReplicatedBackend::submit_transaction(
  const hobject_t &soid,
  const eversion_t &at_version,
  PGTransactionUPtr &&_t,
  const eversion_t &trim_to,
  const eversion_t &trim_rollback_to,
  const vector<pg_log_entry_t> &log_entries,
  boost::optional<pg_hit_set_history_t> &hset_history,
  Context *on_local_applied_sync,
  Context *on_all_acked,
  Context *on_all_commit,
  ceph_tid_t tid,
  osd_reqid_t reqid,
  OpRequestRef orig_op)
{
	std::unique_ptr<RPGTransaction> t(
		static_cast<RPGTransaction*>(_t.release()));
	assert(t);
	ObjectStore::Transaction op_t = t->get_transaction();

	...
	parent->log_operation(
		log_entries,
		hset_history,
		trim_to,
		trim_rollback_to,
		true,
		op_t);

	....
}

//src/osd/ReplicatedPG.h
void log_operation(
    const vector<pg_log_entry_t> &logv,
    boost::optional<pg_hit_set_history_t> &hset_history,
    const eversion_t &trim_to,
    const eversion_t &trim_rollback_to,
    bool transaction_applied,
    ObjectStore::Transaction &t) {
	if (hset_history) {
		info.hit_set = *hset_history;
		dirty_info = true;
	}
	append_log(logv, trim_to, trim_rollback_to, t, transaction_applied);
}

void PG::append_log(
  const vector<pg_log_entry_t>& logv,
  eversion_t trim_to,
  eversion_t trim_rollback_to,
  ObjectStore::Transaction &t,
  bool transaction_applied)
{
	//进行日志的序列化
}

Above we noticed that for the processing of PGLog, the Transaction corresponding to PGLog is the same as the Transaction corresponding to the actual object data.

  • Serialize logs in ctx->log

In ReplicatedPG::prepare_transaction(), we constructed the pg_log_entry_t object and put it in ctx->log. Then, in the following function, the information related to PGLog will be serialized into the transaction ctx->op_t:

void PG::append_log(
  const vector<pg_log_entry_t>& logv,
  eversion_t trim_to,
  eversion_t trim_rollback_to,
  ObjectStore::Transaction &t,
  bool transaction_applied)
{
	/* The primary has sent an info updating the history, but it may not
	* have arrived yet.  We want to make sure that we cannot remember this
	* write without remembering that it happened in an interval which went
	* active in epoch history.last_epoch_started.
	* 
	* (注: 在PG完成Peering之后,会由Primary发送消息来更新副本PG的history,但由于网络延时等,
	* 可能存在相应的更新消息未到达的情况。这里我们要确保: 只有在PG进入active epoch之后的写操作
	* 才是有效的(该epoch记录在history.last_epoch_started字段中)。
	*/
	if (info.last_epoch_started != info.history.last_epoch_started) {
		info.history.last_epoch_started = info.last_epoch_started;
	}
	dout(10) << "append_log " << pg_log.get_log() << " " << logv << dendl;
	
	for (vector<pg_log_entry_t>::const_iterator p = logv.begin();p != logv.end();++p) {
		add_log_entry(*p);
	}

	
	// update the local pg, pg log
	dirty_info = true;
	write_if_dirty(t);
}

void PG::add_log_entry(const pg_log_entry_t& e)
{
	// raise last_complete only if we were previously up to date
	if (info.last_complete == info.last_update)
		info.last_complete = e.version;
	
	// raise last_update.
	assert(e.version > info.last_update);
	info.last_update = e.version;
	
	// raise user_version, if it increased (it may have not get bumped
	// by all logged updates)
	if (e.user_version > info.last_user_version)
		info.last_user_version = e.user_version;
	
	// log mutation
	pg_log.add(e);
	dout(10) << "add_log_entry " << e << dendl;
}

//src/osd/pglog.h
void add(const pg_log_entry_t& e) {
	mark_writeout_from(e.version);
	log.add(e);
}

In the above PG::append_log() function, first call PG::add_log_entry() to add PGLog to it for pg_logcaching to facilitate query. Then call write_if_dirty():

void PG::write_if_dirty(ObjectStore::Transaction& t)
{
	map<string,bufferlist> km;
	if (dirty_big_info || dirty_info)
		prepare_write_info(&km);

	pg_log.write_log(t, &km, coll, pgmeta_oid, pool.info.require_rollback());
	if (!km.empty())
		t.omap_setkeys(coll, pgmeta_oid, km);
}

In PG::write_if_dirty(), since dirty_info is set to true in PG::append_log(), the prepare_write_info() function must be called first, which may pack the current epoch information and pg_info information into it km. Afterwards, if km is not empty, call t.omap_setkeys() to pack the relevant information into the transaction.

Note: From the above we can see that in addition to the currently updated object information, PGLog may also include the following:

// prefix pgmeta_oid keys with _ so that PGLog::read_log() can
// easily skip them

const string infover_key("_infover");
const string info_key("_info");
const string biginfo_key("_biginfo");
const string epoch_key("_epoch");

Now let's look at pg_log.write_log():

void PGLog::write_log(
  ObjectStore::Transaction& t,
  map<string,bufferlist> *km,
  const coll_t& coll, const ghobject_t &log_oid,
  bool require_rollback)
{
	if (is_dirty()) {
		dout(5) << "write_log with: "
			<< "dirty_to: " << dirty_to
			<< ", dirty_from: " << dirty_from
			<< ", dirty_divergent_priors: "
			<< (dirty_divergent_priors ? "true" : "false")
			<< ", divergent_priors: " << divergent_priors.size()
			<< ", writeout_from: " << writeout_from
			<< ", trimmed: " << trimmed
			<< dendl;

		_write_log(
			t, km, log, coll, log_oid, divergent_priors,
			dirty_to,
			dirty_from,
			writeout_from,
			trimmed,
			dirty_divergent_priors,
			!touched_log,
			require_rollback,
			(pg_log_debug ? &log_keys_debug : 0));

		undirty();
	} else {
		dout(10) << "log is not dirty" << dendl;
	}
}

void PGLog::_write_log(
  ObjectStore::Transaction& t,
  map<string,bufferlist> *km,
  pg_log_t &log,
  const coll_t& coll, const ghobject_t &log_oid,
  map<eversion_t, hobject_t> &divergent_priors,
  eversion_t dirty_to,
  eversion_t dirty_from,
  eversion_t writeout_from,
  const set<eversion_t> &trimmed,
  bool dirty_divergent_priors,
  bool touch_log,
  bool require_rollback,
  set<string> *log_keys_debug
  )
{
	...
	for (list<pg_log_entry_t>::iterator p = log.log.begin();p != log.log.end() && p->version <= dirty_to; ++p) {
		bufferlist bl(sizeof(*p) * 2);
		p->encode_with_checksum(bl);
		(*km)[p->get_key_name()].claim(bl);
	}
	
	for (list<pg_log_entry_t>::reverse_iterator p = log.log.rbegin();
		p != log.log.rend() &&(p->version >= dirty_from || p->version >= writeout_from) &&p->version >= dirty_to; ++p) {

		bufferlist bl(sizeof(*p) * 2);
		p->encode_with_checksum(bl);
		(*km)[p->get_key_name()].claim(bl);
	}

	...
}
void pg_log_entry_t::encode_with_checksum(bufferlist& bl) const
{
	bufferlist ebl(sizeof(*this)*2);
	encode(ebl);
	__u32 crc = ebl.crc32c(0);
	::encode(ebl, bl);
	::encode(crc, bl);
}

Through the above, we can see that in the PGLog::_write_log() function, the pg_log_entry_t data is put into the kmcorresponding bufferlist, and then the PG::write_if_dirty() function 最后packs these bufferlists into the transaction.

Note: The transaction where PGLog is located is the same as the transaction where the actual object data is located

  • Complete the writing of PGLog log data and object object data
void ReplicatedBackend::submit_transaction(
  const hobject_t &soid,
  const eversion_t &at_version,
  PGTransactionUPtr &&_t,
  const eversion_t &trim_to,
  const eversion_t &trim_rollback_to,
  const vector<pg_log_entry_t> &log_entries,
  boost::optional<pg_hit_set_history_t> &hset_history,
  Context *on_local_applied_sync,
  Context *on_all_acked,
  Context *on_all_commit,
  ceph_tid_t tid,
  osd_reqid_t reqid,
  OpRequestRef orig_op)
{
	...

	op_t.register_on_applied_sync(on_local_applied_sync);
	op_t.register_on_applied(
		parent->bless_context(
			new C_OSD_OnOpApplied(this, &op)));

	op_t.register_on_commit(
		parent->bless_context(
			new C_OSD_OnOpCommit(this, &op)));
	
	vector<ObjectStore::Transaction> tls;
	tls.push_back(std::move(op_t));
	
	parent->queue_transactions(tls, op.op);
}

void ReplicatedPG::queue_transactions(vector<ObjectStore::Transaction>& tls, OpRequestRef op) {
	osd->store->queue_transactions(osr.get(), tls, 0, 0, 0, op, NULL);
}

The construction of Transaction is completed in the previous steps, and ReplicatedPG::queue_transactions() is called here to write to ObjectStore.

3.1.3 Trim Log

As mentioned earlier, the number of PGLog records is limited. Under normal circumstances, the default is 3000 (controlled by the parameter osd_min_pg_log_entries), and in the case of PG downgrade, the default is increased to 10000 (controlled by the parameter osd_max_pg_log_entries). When the limit is reached, the trim log will be truncated.

Call ReplicatedPG::calc_trim_to() in ReplicatedPG::execute_ctx() to perform the calculation. When calculating, start from the tail of the log (tail points to the previous one of the oldest record), and the number of trim items is required log.head - log.tail - max_entries. However, min_last_complete_ondisk needs to be considered when trimming (this indicates the minimum version of last_complete on each copy, which is calculated when the master OSD receives all three copies, that is, last_complete_ondisk and last_complete_ondisk on other copy OSDs, that is, peer_last_complete_ondisk The minimum value is min_last_complete_ondisk), that is to say, the trim cannot exceed min_last_complete_ondisk, because if it is exceeded and the trim is dropped, the pg log that has not been updated to the disk will be lost. So there may be a certain moment when the number of records in pglog exceeds max_entries. For example:

ceph-chapter6-9

The trim_to in ReplicatedPG::log_operation() is pg_trim_to, and trim_rollback_to is min_last_complete_ondisk. Call pg_log.trim(&handler, trim_to, info) in log_operation() to trim, and the key that needs trim will be added to the set of PGLog::trimmed. Then insert the elements in trimmed into to_remove in _write_log(), and finally call t.omap_rmkeys() to serialize them into the bufferlist of the transaction.

void ReplicatedPG::execute_ctx(OpContext *ctx){
	calc_trim_to();
}


void ReplicatedPG::log_operation(
    const vector<pg_log_entry_t> &logv,
    boost::optional<pg_hit_set_history_t> &hset_history,
    const eversion_t &trim_to,
    const eversion_t &trim_rollback_to,
    bool transaction_applied,
	ObjectStore::Transaction &t) {

	if (hset_history) {
		info.hit_set = *hset_history;
		dirty_info = true;
	}
	append_log(logv, trim_to, trim_rollback_to, t, transaction_applied);
}

void PG::append_log(
  const vector<pg_log_entry_t>& logv,
  eversion_t trim_to,
  eversion_t trim_rollback_to,
  ObjectStore::Transaction &t,
  bool transaction_applied){

	pg_log.trim(&handler, trim_to, info);
	write_if_dirty(t);
}

void PGLog::_write_log(
  ObjectStore::Transaction& t,
  map<string,bufferlist> *km,
  pg_log_t &log,
  const coll_t& coll, const ghobject_t &log_oid,
  map<eversion_t, hobject_t> &divergent_priors,
  eversion_t dirty_to,
  eversion_t dirty_from,
  eversion_t writeout_from,
  const set<eversion_t> &trimmed,
  bool dirty_divergent_priors,
  bool touch_log,
  bool require_rollback,
  set<string> *log_keys_debug
  ){
	set<string> to_remove;
	for (set<eversion_t>::const_iterator i = trimmed.begin();
	  i != trimmed.end();
	  ++i) {
		to_remove.insert(i->get_key_name());
		if (log_keys_debug) {
			assert(log_keys_debug->count(i->get_key_name()));
			log_keys_debug->erase(i->get_key_name());
		}
	}


	...
	if (!to_remove.empty())
    	t.omap_rmkeys(coll, log_oid, to_remove);
}
3.1.4 PGLog written to the journal disk

Writing PGLog to the Journal disk is the same process as journal, as follows:

  • Call log_operation() in ReplicatedBackend::submit_transaction() to serialize PGLog into transaction, and then call queue_transactions() to pass to subsequent processing;
void ReplicatedBackend::submit_transaction(...)
{
	...
	parent->log_operation(
		log_entries,
		hset_history,
		trim_to,
		trim_rollback_to,
		true,
		op_t);

	...

	vector<ObjectStore::Transaction> tls;
	tls.push_back(std::move(op_t));

	parent->queue_transactions(tls, op.op);
}

Here ReplicatedPG implements the PGBackend::Listener interface:

class ReplicatedPG : public PG, public PGBackend::Listener {
};

ReplicatedPG::ReplicatedPG(OSDService *o, OSDMapRef curmap,const PGPool &_pool, spg_t p) :
	PG(o, curmap, _pool, p),
	pgbackend(
		PGBackend::build_pg_backend(
			_pool.info, curmap, this, coll_t(p), ch, o->store, cct)),
	object_contexts(o->cct, g_conf->osd_pg_object_context_cache_count),
	snapset_contexts_lock("ReplicatedPG::snapset_contexts"),
	backfills_in_flight(hobject_t::Comparator(true)),
	pending_backfill_updates(hobject_t::Comparator(true)),
	new_backfill(false),
	temp_seq(0),
	snap_trimmer_machine(this)
{ 
	missing_loc.set_backend_predicates(
		pgbackend->get_is_readable_predicate(),
		pgbackend->get_is_recoverable_predicate());

	snap_trimmer_machine.initiate();
}

So the parent->queue_transactions() called here is ReplicatedPG::queue_transactions()

  • When called into FileStore::queue_transactions(), the list is constructed into a FileStore::Op, and the corresponding list is placed in FileStore::Op::tls
void ReplicatedPG::queue_transactions(vector<ObjectStore::Transaction>& tls, OpRequestRef op) {
    osd->store->queue_transactions(osr.get(), tls, 0, 0, 0, op, NULL);
}

int ObjectStore::queue_transactions(Sequencer *osr, vector<Transaction>& tls,
		 Context *onreadable, Context *ondisk=0,
		 Context *onreadable_sync=0,
		 TrackedOpRef op = TrackedOpRef(),
		 ThreadPool::TPHandle *handle = NULL) {
	assert(!tls.empty());
	tls.back().register_on_applied(onreadable);
	tls.back().register_on_commit(ondisk);
	tls.back().register_on_applied_sync(onreadable_sync);
	return queue_transactions(osr, tls, op, handle);
}

int FileStore::queue_transactions(Sequencer *posr, vector<Transaction>& tls,
				  TrackedOpRef osd_op,
				  ThreadPool::TPHandle *handle)
{
	...
	if (journal && journal->is_writeable() && !m_filestore_journal_trailing) {
		Op *o = build_op(tls, onreadable, onreadable_sync, osd_op);
		...
	}

	...
}
  • Then traverse vector &tls in FileJournal::prepare_entry(), and encode ObjectStore::Transaction into a bufferlist (marked as tbl)
int FileJournal::prepare_entry(vector<ObjectStore::Transaction>& tls, bufferlist* tbl) {
	
}
  • Then call FileJournal::submit_entry() in the JournalingObjectStore::_op_journal_transactions() function, construct the bufferlist into write_item and put it in writeq
void JournalingObjectStore::_op_journal_transactions(
  bufferlist& tbl, uint32_t orig_len, uint64_t op,
  Context *onjournal, TrackedOpRef osd_op)
{
	if (osd_op.get())
		dout(10) << "op_journal_transactions " << op << " reqid_t "
		<< (static_cast<OpRequest *>(osd_op.get()))->get_reqid() << dendl;
	else
		dout(10) << "op_journal_transactions " << op  << dendl;
	
	if (journal && journal->is_writeable()) {
		journal->submit_entry(op, tbl, orig_len, onjournal, osd_op);
	} else if (onjournal) {
		apply_manager.add_waiter(op, onjournal);
	}
}

void FileJournal::submit_entry(uint64_t seq, bufferlist& e, uint32_t orig_len,
			       Context *oncommit, TrackedOpRef osd_op)
{
	...
}
  • Then in the FileJournal::write_thread_entry() function, the write_item will be taken from writeq and placed in another bufferlist
void FileJournal::write_thread_entry()
{
	...

	bufferlist bl;
	int r = prepare_multi_write(bl, orig_ops, orig_bytes);
	
	...
}
  • Finally, call do_write() to asynchronously write the contents of bufferlist to disk (that is, write journal)
void FileJournal::write_thread_entry()
{
	...
	#ifdef HAVE_LIBAIO
		if (aio)
			do_aio_write(bl);
		else
			do_write(bl);
	#else
		do_write(bl);
	#endif
	...
}
3.1.5 PGLog written to leveldb

After the write operation of the journal disk is completed above, another thread will asynchronously write these log data into the actual object object, pglog, etc. As follows, we mainly focus on the persistence operation of pglog:

In "OSD Read and Write Process", it is described that the operation of writing data to the local cache is performed in FileStore::_do_op(). The operation of writing pglog to leveldb also starts from here, and different operations will be performed according to different op types.

void FileStore::_do_op(OpSequencer *osr, ThreadPool::TPHandle &handle)
{
	if (!m_disable_wbthrottle) {
		wbthrottle.throttle();
	}
	// inject a stall?
	if (g_conf->filestore_inject_stall) {
		int orig = g_conf->filestore_inject_stall;
		dout(5) << "_do_op filestore_inject_stall " << orig << ", sleeping" << dendl;

		for (int n = 0; n < g_conf->filestore_inject_stall; n++)
			sleep(1);

		g_conf->set_val("filestore_inject_stall", "0");
		dout(5) << "_do_op done stalling" << dendl;
	}
	
	osr->apply_lock.Lock();
	Op *o = osr->peek_queue();
	apply_manager.op_apply_start(o->op);
	dout(5) << "_do_op " << o << " seq " << o->op << " " << *osr << "/" << osr->parent << " start" << dendl;

	int r = _do_transactions(o->tls, o->op, &handle);
	apply_manager.op_apply_finish(o->op);
	dout(10) << "_do_op " << o << " seq " << o->op << " r = " << r << ", finisher " << o->onreadable << " " << o->onreadable_sync << dendl;
	
	o->tls.clear();

}

int FileStore::_do_transactions(
  vector<Transaction> &tls,
  uint64_t op_seq,
  ThreadPool::TPHandle *handle)
{
}

1) For example OP_OMAP_SETKEYS(PGLog is written to leveldb based on this key)

void FileStore::_do_transaction(
  Transaction& t, uint64_t op_seq, int trans_num,
  ThreadPool::TPHandle *handle){
	switch (op->op) {
		case Transaction::OP_OMAP_SETKEYS:
		{
			coll_t cid = i.get_cid(op->cid);
			ghobject_t oid = i.get_oid(op->oid);
			_kludge_temp_object_collection(cid, oid);

			map<string, bufferlist> aset;
			i.decode_attrset(aset);
			tracepoint(objectstore, omap_setkeys_enter, osr_name);

			r = _omap_setkeys(cid, oid, aset, spos);
			tracepoint(objectstore, omap_setkeys_exit, r);
		}
		break; 
	}
}

int FileStore::_omap_setkeys(const coll_t& cid, const ghobject_t &hoid,
			     const map<string, bufferlist> &aset,
			     const SequencerPosition &spos) {
{
	...
	 r = object_map->set_keys(hoid, aset, &spos);
}

int DBObjectMap::set_keys(const ghobject_t &oid,
			  const map<string, bufferlist> &set,
			  const SequencerPosition *spos){
	t->set(user_prefix(header), set);
	return db->submit_transaction(t);
}

int LevelDBStore::submit_transaction(KeyValueDB::Transaction t)
{
	utime_t start = ceph_clock_now(g_ceph_context);
	LevelDBTransactionImpl * _t =static_cast<LevelDBTransactionImpl *>(t.get());
	leveldb::Status s = db->Write(leveldb::WriteOptions(), &(_t->bat));
	utime_t lat = ceph_clock_now(g_ceph_context) - start;


	logger->inc(l_leveldb_txns);
	logger->tinc(l_leveldb_submit_latency, lat);
	return s.ok() ? 0 : -1;
}

2) Another example OP_OMAP_RMKEYS(this key is used when trimming pglog)

//前面流程同上

int DBObjectMap::rm_keys(const ghobject_t &oid,
			 const set<string> &to_clear,
			 const SequencerPosition *spos)
{
	t->rmkeys(user_prefix(header), to_clear);
	db->submit_transaction(t);
}

The advantages of encapsulating PGLog into transaction and writing to disk together with journal: If osd crashes abnormally, journal writing is completed, but the data may not be written to disk, and the corresponding pg log is not written to leveldb, so in OSD When it starts up again, it will perform journal replay, so that the complete transaction can be read from the journal, and then the transaction is processed, that is, the data is written to the disk, and the pglog is written to the leveldb.

4. How to view PGLog

The specific PGLog content can be viewed using the following tools:

4.1 The overall pglog information of a certain PG

1) Stop the running osd, obtain the mounting path of the osd, and use the following command to obtain the pg list

# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-13/ --journal-path /dev/disk/by-id/virtio-ca098220-3a26-404a-8-part1 --type filestore --op list-pgs
41.2
32.4
39.4
41.a
53.1ab
53.5a
54.31
42.d
53.1da
41.d2
41.9f
41.bd
36.0
...

Note: For the type is filestore type, we must also specify --journal-paththe option; but for the bluestore type, this option does not need to be specified.

2) Get specific pg_log_t information

# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-13/ --journal-path /dev/disk/by-id/virtio-ca098220-3a26-404a-8-part1 --type filestore --pgid 53.62 --op log 
{
    "pg_log_t": {
        "head": "18042'298496",
        "tail": "9354'295445",
        "log": [
            {
                "op": "modify  ",
                "object": "53:462f48b1:::5c470d18-0d9e-4a34-8a6c-7a6d64784c3e.355583.2__multipart_487ead6c025ac8a2dce846bafd222c11_23.2~g0pDMHqf8sMPKcIj1WoP79w2gNICzeA.1:h
ead",
                "version": "9354'295446",
                "prior_version": "9354'295445",
                "reqid": "client.955398.0:2032",
                "extra_reqids": [],
                "mtime": "2019-12-12 14:38:07.371069",
                "mod_desc": {
                    "object_mod_desc": {
                        "can_local_rollback": false,
                        "rollback_info_completed": false,
                        "ops": []
                    }
                }
            },
            {
                "op": "modify  ",
                "object": "53:462f48b1:::5c470d18-0d9e-4a34-8a6c-7a6d64784c3e.355583.2__multipart_487ead6c025ac8a2dce846bafd222c11_23.2~g0pDMHqf8sMPKcIj1WoP79w2gNICzeA.1:h
ead",
                "version": "9354'295447",
                "prior_version": "9354'295446",
                "reqid": "client.955398.0:2033",
                "extra_reqids": [],
                "mtime": "2019-12-12 14:38:07.373229",
                "mod_desc": {
                    "object_mod_desc": {
                        "can_local_rollback": false,
                        "rollback_info_completed": false,
                        "ops": []
                    }
                }
            },
	....

Note: For some PGs, the pg log information may be empty.

3) Obtain specific pg_info_t information

# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-13/ --journal-path /dev/disk/by-id/virtio-ca098220-3a26-404a-8-part1 --type filestore --pgid 53.62 --op info
{
    "pgid": "53.62",
    "last_update": "18042'298496",
    "last_complete": "18042'298496",
    "log_tail": "9354'295445",
    "last_user_version": 298496,
    "last_backfill": "MAX",
    "last_backfill_bitwise": 1,
    "purged_snaps": "[]",
    "history": {
        "epoch_created": 748,
        "last_epoch_started": 18278,
        "last_epoch_clean": 18278,
        "last_epoch_split": 0,
        "last_epoch_marked_full": 431,
        "same_up_since": 18276,
        "same_interval_since": 18277,
        "same_primary_since": 17958,
        "last_scrub": "17833'298491",
        "last_scrub_stamp": "2020-06-03 15:16:15.370988",
        "last_deep_scrub": "17833'298491",
        "last_deep_scrub_stamp": "2020-06-03 15:16:15.370988",
        "last_clean_scrub_stamp": "2020-06-03 15:16:15.370988"
    },
    "stats": {
        "version": "18017'298495",
        "reported_seq": "545855",
        "reported_epoch": "18042",
        "state": "active+clean",
        "last_fresh": "2020-06-05 19:23:43.328125",
        "last_change": "2020-06-05 19:23:43.328125",
        "last_active": "2020-06-05 19:23:43.328125",
        "last_peered": "2020-06-05 19:23:43.328125",
        "last_clean": "2020-06-05 19:23:43.328125",
        "last_became_active": "2020-06-05 19:23:43.327839",
        "last_became_peered": "2020-06-05 19:23:43.327839",
        "last_unstale": "2020-06-05 19:23:43.328125",
        "last_undegraded": "2020-06-05 19:23:43.328125",
        "last_fullsized": "2020-06-05 19:23:43.328125",
        "mapping_epoch": 18276,
        "log_start": "9354'295445",
        "ondisk_log_start": "9354'295445",
        "created": 748,
        "last_epoch_clean": 18042,
        "parent": "0.0",
        "parent_split_bits": 8,
        "last_scrub": "17833'298491",
        "last_scrub_stamp": "2020-06-03 15:16:15.370988",
        "last_deep_scrub": "17833'298491",
        "last_deep_scrub_stamp": "2020-06-03 15:16:15.370988",
        "last_clean_scrub_stamp": "2020-06-03 15:16:15.370988",
        "log_size": 3050,
        "ondisk_log_size": 3050,
        "stats_invalid": false,
        "dirty_stats_invalid": false,
        "omap_stats_invalid": false,
        "hitset_stats_invalid": false,
        "hitset_bytes_stats_invalid": false,
        "pin_stats_invalid": false,
        "stat_sum": {
            "num_bytes": 967900926,
            "num_objects": 23059,
            "num_object_clones": 0,
            "num_object_copies": 69180,
            "num_objects_missing_on_primary": 0,
            "num_objects_missing": 0,
            "num_objects_degraded": 0,
            "num_objects_misplaced": 0,
            "num_objects_unfound": 0,
            "num_objects_dirty": 23059,
            "num_whiteouts": 0,
            "num_read": 35977,
            "num_read_kb": 1905404,
            "num_write": 244022,
            "num_write_kb": 3560580,
            "num_scrub_errors": 0,
            "num_shallow_scrub_errors": 0,
            "num_deep_scrub_errors": 0,
            "num_objects_recovered": 1833,
            "num_bytes_recovered": 333589036,
            "num_keys_recovered": 0,
            "num_objects_omap": 0,
            "num_objects_hit_set_archive": 0,
            "num_bytes_hit_set_archive": 0,
            "num_flush": 0,
            "num_flush_kb": 0,
            "num_evict": 0,
            "num_evict_kb": 0,
            "num_promote": 0,
            "num_flush_mode_high": 0,
            "num_flush_mode_low": 0,
            "num_evict_mode_some": 0,
            "num_evict_mode_full": 0,
            "num_objects_pinned": 0
        },
        "up": [
            12,
            14,
            13
        ],
        "acting": [
            12,
            14,
            13
        ],
        "blocked_by": [],
        "up_primary": 12,
        "acting_primary": 12
    },
    "empty": 0,
    "dne": 0,
    "incomplete": 0,
    "last_epoch_started": 18278,
    "hit_set_history": {
        "current_last_update": "0'0",
        "history": []
    }
}

4.2 Track the pglog of a single op

1) View the PG to which an object is mapped

Use ceph osd map pool-name object-name-idthe command to view the PG mapped to the object, for example:

# ceph osd map oss-uat.rgw.buckets.data 135882fc-2865-43ab-9f71-7dd4b2095406.20037185.269__multipart_批量上传走joss文件 -003-KZyxg.docx.VLRHO5x1l3nV4-v5W4r6YA2Fkqlfwj3.107
osdmap e16540 pool 'oss-uat.rgw.buckets.data' (189) object '-003-KZyxg.docx.VLRHO5x1l3nV4-v5W4r6YA2Fkqlfwj3.107/135882fc-2865-43ab-9f71-7dd4b2095406.20037185.269__multipart_批量上传走joss文件' -> pg 189.db7b914a (189.14a) -> up ([66,9,68], p66) acting ([66,9,68], p66)

For details, please refer to "How to Locate Files in Ceph"

2) Confirm the OSD where the PG is located

# ceph pg dump pgs_brief |grep ^19|grep 19.3f

3) Find the object that will fall into the specified pg through the above two steps, and put the specified file into the resource pool under the name of the object

# rados -p oss-uat.rgw.buckets.data put 135882fc-2865-43ab-9f71-7dd4b2095406.20037185.269__multipart_batch upload joss file-003-KZyxg.docx.VLRHO5x1l3nV4-v5W4r6YA2Fkqlfw j3.107 test.file

4) Select any one down from the osd set where the pg is located, and check the written 135882fc-2865-43ab-9f71...log information

# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-66/ --journal-path /dev/disk/by-id/virtio-ca098220-3a26-404a-8-part1 --type filestore --pgid 189.14a --op log

5. How PGLog participates in recovery

According to the state machine of PG (mainly reset -> activtethe state transition of pg from the process, including the processing flow of pg state recovery when pg changes from peering to activate and epoch. As shown in the figure below), we can see that,

ceph-chapter6-10

The process of restoring the PG status to active needs to be distinguished Primaryand Replicatedtwo, because the messages of both pg and osd are dominated Primaryand then distributed to the slave components.

At the same time, PGLog's participation in recovery is mainly reflected in the establishment of a missing list to mark outdated data when ceph performs peering, so as to restore these data. After the faulty OSD comes back online, the PG will be marked as peering and will suspend processing requests.

Primary PG owned by the faulty OSD
  • As the main body of this part of the data “权责”, it needs to send a request to query the PG metadata pg_info to all Replicate role nodes belonging to the PG;

  • 权威The Replicate role node of the PG actually becomes the Primary role and maintains the PGLog when the faulty OSD is offline . The PG will send a response after receiving the query request from the Primary PG of the faulty OSD;

  • Primary PG finds that it is behind by comparing the metadata and PG version information sent by Replicate PG, so it will merge the obtained PGLog and create 权威PGLog, and will also create a missing list to mark outdated data;

  • After the Primary PG completes 权威the establishment of PGLog, it can mark itself as Active.

For the Replicate PG owned by the faulty OSD
  • At this time, the Replica PG of the faulty OSD will get the query request from the Primary PG after going online, and send its own “过时”metadata and PGLog;

  • Primary PG compares the data and finds that the PG is behind and outdated, and then creates a missing list through PGLog (note: this is actually a peer_missing list);

  • Primary PG marks itself as Active;

The main steps involving PGLog (pg_info, pg_log) in the Peering process

1) GetInfo

The Primary OSD of the PG obtains the pg_info information of each Replicate OSD by sending a message. After receiving the pg_info of each Replicate OSD, it will call PG::proc_replica_info() to process the pg_info of the replica OSD, and call info.history.merge() to merge the pg_info information sent by the Replicate OSD. The principle of merging is to update The latest fields (such as last_epoch_started and last_epoch_clean both become latest)

bool PG::proc_replica_info(
  pg_shard_t from, const pg_info_t &oinfo, epoch_t send_epoch)
{
	...
	unreg_next_scrub();
	if (info.history.merge(oinfo.history))
		dirty_info = true;
	reg_next_scrub();
	...
}

bool merge(const pg_history_t &other) {
	// Here, we only update the fields which cannot be calculated from the OSDmap.
	bool modified = false;
	if (epoch_created < other.epoch_created) {
		epoch_created = other.epoch_created;
		modified = true;
	}
	if (last_epoch_started < other.last_epoch_started) {
		last_epoch_started = other.last_epoch_started;
		modified = true;
	}
	if (last_epoch_clean < other.last_epoch_clean) {
		last_epoch_clean = other.last_epoch_clean;
		modified = true;
	}
	if (last_epoch_split < other.last_epoch_split) {
		last_epoch_split = other.last_epoch_split; 
		modified = true;
	}
	if (last_epoch_marked_full < other.last_epoch_marked_full) {
		last_epoch_marked_full = other.last_epoch_marked_full;
		modified = true;
	}
	if (other.last_scrub > last_scrub) {
		last_scrub = other.last_scrub;
		modified = true;
	}
	if (other.last_scrub_stamp > last_scrub_stamp) {
		last_scrub_stamp = other.last_scrub_stamp;
		modified = true;
	}
	if (other.last_deep_scrub > last_deep_scrub) {
		last_deep_scrub = other.last_deep_scrub;
		modified = true;
	}
	if (other.last_deep_scrub_stamp > last_deep_scrub_stamp) {
		last_deep_scrub_stamp = other.last_deep_scrub_stamp;
		modified = true;
	}
	if (other.last_clean_scrub_stamp > last_clean_scrub_stamp) {
		last_clean_scrub_stamp = other.last_clean_scrub_stamp;
		modified = true;
	}
	return modified;
}

2) GetLog

According to the comparison of pg_info, select an OSD (auth_log_shard) with authoritative logs. If the Primary OSD is not an OSD with authoritative logs, go to the OSD to obtain authoritative logs.

PG::RecoveryState::GetLog::GetLog(my_context ctx)
  : my_base(ctx),
    NamedState(
      context< RecoveryMachine >().pg->cct, "Started/Primary/Peering/GetLog"),
    msg(0)
{
	...

	// adjust acting?
	if (!pg->choose_acting(auth_log_shard, false,
			&context< Peering >().history_les_bound)){
	}

	...
}

When selecting an OSD with authoritative logs, follow 3 principles (in the find_best_info() function)

/**
 * find_best_info
 *
 * Returns an iterator to the best info in infos sorted by:
 *  1) Prefer newer last_update
 *  2) Prefer longer tail if it brings another info into contiguity
 *  3) Prefer current primary
 */
map<pg_shard_t, pg_info_t>::const_iterator PG::find_best_info(
  const map<pg_shard_t, pg_info_t> &infos,
  bool restrict_to_up_acting,
  bool *history_les_bound) const
{
	...
}

That is to say, compare the pg_info_t of each OSD, whoever has the largest last_update will be selected; if the last_update is the same, whoever has the smallest log_tail will be selected; if the log_tail is the same, the current Primary OSD will be selected

If the Primary OSD is not an OSD with authoritative logs, you need to go to the OSD with authoritative logs to pull the authoritative logs. After receiving the authoritative logs, it will call proc_master_log() to merge the authoritative logs into the local pg log.

In the process of merging the authoritative log to the local pg log, the oid and eversion corresponding to the pg_log_entry_t of the merge will be put into the missing list. The objects in the missing list are the missing objects of the Primary OSD. Later, they need to be retrieved from other OSDs during recovery. pull.

void PG::proc_master_log(
  ObjectStore::Transaction& t, pg_info_t &oinfo,
  pg_log_t &olog, pg_missing_t& omissing, pg_shard_t from)
{	
	...
	merge_log(t, oinfo, olog, from);
	...
}

3)  GetMissing

Pull the pg log of other Replicate OSD (or part of it, or all of it FULL_LOG), compare it with the local auth log, call proc_replica_log() to process the log, and put the missing objects in the Replicate OSD into the peer_missing list for use Based on the follow-up recovery process.

The generation of the peer_missing list is also implemented on the master.

Note: In fact, the peer_missing list is updated in PG::activate(). What is processed in proc_replica_log() is only the local missing from the replica (that is, the missing list constructed according to its own last_update and last_complete after the replica restarts). Normally this missing list is empty.

Guess you like

Origin blog.csdn.net/weixin_43778179/article/details/132701314