Ceph Consistency Check Tool Scrub Mechanism

This chapter introduces Ceph's consistency check tool Scrub mechanism. It first introduces the basic knowledge of data verification, then introduces the basic concept of Scrub, then introduces the scheduling mechanism of Scrub, and finally introduces the source code analysis of the specific implementation of Scrub.

1. End-to-end data verification

Data silent corruption (Silent Data Corruption) may occur in a storage system, most of which occur due to an abnormal inversion (Bit Error Rate) of a certain bit of data.

Figure 12-1 below is the protocol stack of a general storage system. Data corruption will occur in all modules of the system:

ceph-chapter12-1

  • Hardware errors, such as memory, CPU, network card, etc.;

  • Signal-to-noise interference during data transmission, such as SATA, FC and other protocols;

  • Firmware bugs, such as RAID controllers, disk controllers, network cards, etc.;

  • Software bugs, such as operating system kernel bugs, local file system bugs, SCSI software module bugs, etc.

In traditional high-end disk arrays, end-to-end data verification is generally used to achieve data consistency. The so-called end-to-end data verification means that when the client (application layer) writes data, it calculates a CRC verification information for each data block, and sends the verification information and the data block to the disk (Disk). After the disk receives the data packet, it will recalculate the verification information and compare it with the received verification information. If not, it is considered that there is an error on the entire IO path, and the IO operation failure is returned; if the verification is successful, the data verification information and data are saved on the disk. Similarly, when the data is read, the client also needs to check again for consistency when obtaining the data block and disk read verification information.

Through this method, the application layer can clearly know whether the data of an IO request is consistent. If the operation was successful, then the data on disk must be correct.

This method can improve the integrity of data read and write without affecting the IO performance or with a relatively small impact. But this approach also has some disadvantages:

  • Unable to solve the problem of data corruption caused by wrong destination address;

  • An end-to-end solution requires additional parity information along the entire IO path. The current IO protocol stack involves many modules, and it is difficult to implement such verification information for each module.

Because this implementation method has a great impact on Ceph's IO performance, Ceph does not implement end-to-end data verification, but implements the Ceph Scrub mechanism, using a solution to solve Ceph consistency checks by scanning in the background.

2. Scrub concept introduction

Ceph internally implements a tool for consistency checking: Ceph Scrub. The principle is: by comparing the data and metadata of each object copy, complete the copy consistency check.

The advantage of this method is that data inconsistencies caused by disk corruption can be found in the background. The disadvantage is that the timing of discovery is often delayed.

Scrub is divided into two methods according to the scanned content:

  • One is called Scrub, which only checks the consistency of data by comparing the metadata of each copy of the object. Since only metadata is checked, the amount of data read and the amount of calculation are relatively small, which is a relatively light check.

  • The other is called deep-scrub, which further checks whether the data content of the object is consistent, realizes a deep scan, scans almost all the data on the disk and calculates the crc32 check value, so it is time-consuming and takes up more system resources.

Scrub is divided into two types according to the scanning method:

  • Online scanning: does not affect the normal business of the system;

  • Offline scanning: Requires the entire system to be suspended or frozen

Ceph's Scrub function implements online inspection, that is, the client can continue to complete read and write access without interrupting the current read and write requests of the system. The entire system will not be suspended, but the object that is being scrubbed in the background will be locked to temporarily prevent access, and it will not be unlocked to allow access until the object completes the Scrub operation.

3. Scrub scheduling

Scrub scheduling solves when a PG starts the Scrub scanning mechanism. There are mainly the following ways:

  • Manually start the scan immediately;

  • Set a certain time interval in the background, and start according to the time interval. For example, the default time is to execute once a day;

  • Set the time period for startup. Generally, a time period with a relatively light system load is set to execute the Scrub operation.

3.1 Related Data Structures

There are the following data structures related to Scrub in class PG (src/osd/osd.h):

class OSDService{
	...
public:
	Mutex sched_scrub_lock;           //Scrub相关变量的保护锁
	int scrubs_pending;               //资源预约已经成功,正等待Scrub的PG
	int scrubs_active;                //正在进行Scrub的PG
	set<ScrubJob> sched_scrub_pg;     //PG对应的所有ScrubJob列表
};

The structure ScrubJob encapsulates the parameters related to a PG Scrub task:

struct ScrubJob {

	//Scrub对应的PG
	spg_t pgid;
	

	//Scrub任务的调度时间,如果当前负载比较高,或者当前的时间不在设定的Scrub工作时间段
	//内,就会延迟调度
	utime_t sched_time;
	
	//调度时间的上限,过了该时间必须进行Scrub操作,而不受系统负载和Scrub时间段的限制
	utime_t deadline;
}

3.2 Scrub scheduling implementation

In the OSD initialization function OSD::init(), a scheduled task is registered:

tick_timer_without_osd_lock.add_event_after(cct->_conf->osd_heartbeat_interval, new C_Tick_WithoutOSDLock(this));

The timing task will trigger the callback function OSD::tick_without_osd_lock() of the timer every osd_heartbeat_interval period (the default is 1 second). The function call relationship of the processing process is shown in Figure 12-2 below:

ceph-chapter12-2

The above functions realize the Scrub scheduling work of PG. The implementation of the two key functions OSD::sched_scrub() and PG::sched_scrub() in the process of processing will be introduced below.

3.2.1 OSD::sched_scrub() function

This function is used to control the start timing of a PG's scrub process. The specific process is as follows:

1) Call the function can_inc_scrubs_pending() to check whether there is a quota to allow the PG to start the Scrub operation. The variable scrubs_pending records the number of PGs that have completed resource reservations and is waiting for Scrub, and the variable scrubs_active records the number of PGs that are being scrubbed. The sum of the two numbers cannot exceed the value of the system configuration parameter cct->_conf->osd_max_scrubs. This value sets the maximum number of PGs allowed to scrub at the same time.

2) Call the function scrub_time_permit() to check whether it is within the allowed time period. If cct->_conf->osd_scrub_begin_hour is less than cct->_conf->osd_scrub_end_hour, the current time must be between the time ranges set by the two. If cct->_conf->osd_scrub_begin_hour is greater than or equal to cct->_conf->osd_scrub_end_hour, the current time is allowed outside the time range set by the two.

3) Call the function scrub_load_below_threshold() to check whether the current system load is allowed. The function getloadavg() obtains the system load of the last 1 minute, 5 minutes, and 15 minutes:

  a) If the load in the last 1 minute is less than the setting value of cct->_conf->osd_scrub_load_threshold, it is allowed to execute;

  b) If the load in the last minute is less than the value of daily_loadavg, and the load in the last minute is less than the load in the last 15 minutes, execution is allowed;

4) Obtain the list of the first ScrubJob waiting to execute the Scrub operation. If its scrub.sched_time is greater than the current time now value, it means that the time is not up, skip the PG and execute the next task first.

5) Obtain the PG object corresponding to the PG, if the pgbackend of the PG supports Scrub and is in active state:

  a) If the scrub.deadline is less than the now value, that is, the deadline has passed, the Scrub operation must be started;

  b) or at this time time_permit and load_is_low, that is, both time and load are allowed;

In the above two cases, the function pg->sched_scrub() is called to perform the Scrub operation.

3.2.2 PG::sched_scrub() function

This function implements the setting of relevant parameters when executing the Scrub task, and completes the reservation of the required resources. Its processing is as follows:

1) First check the status of PG, it must be the main OSD, and it is in active and clean state, and there is no Scrub operation in progress;

2) Set the value of deep_scrub_inerval: If the value is not set in the Pool option where the PG is located, set it to the value of the system configuration parameter cct->_conf->osd_deep_scrub_interval;

3) Check whether to start deep_scrub, if the current time is greater than the sum of info.history.last_deep_scrub_stamp and deep_scrub_interval, start the deep_scrub operation;

4) If the value of scrubber.must_scrub is true, manually start the deep_scrub operation for the user. If the value is false, the system needs to automatically start the deep_scrub operation with a certain probability. The specific implementation is: automatically generate a random number, and if the random number is less than cct->_conf->osd_deep_scrub_randomize_ratio, start the deep_scrub operation.

5) Decide whether to start deep_scrub in the end, as long as one of them is set in step 3) and step 4), start the deep_scrub operation;

6) If the osdmap or pool has a mark that does not support deep_scrub, set time_for_deep to false and do not start the deep_scrub operation;

7) If there is a mark that does not support Scrub in osdmap or pool, and the deep_scrub operation is not started, return and exit;

8) If cct->_conf->osd_scrub_auto_repair is set for automatic repair, and pgbackend also supports it, and it is a deep_scrub operation, the following judgment process is performed:

  a) If the user sets must_repair, or must_scrub, or must_deep_scrub, it means that this Scrub operation is triggered by the user, and the system respects the user's choice, and will not automatically set the value of scrubber.auto_repair to true;

  b) Otherwise, the system sets the value of scrubber.auto_repair to true to automatically repair.

9) The Scrub process is similar to the Recovery process, both need to consume a lot of system resources, and need to reserve resources on the OSD where the PG is located. If the value of scrubber.reserved is false and the resource reservation has not been completed, the following operations need to be performed:

  a) Add yourself to scrubber.reserved_peers;

  b) Call the function scrub_reserve_replicas() to send a CEPH_OSD_OP_SCRUB_RESERVE message to OSD to reserve resources;

  c) If scrubber.reserved_peers.size() is equal to acting.size(), it means that all secondary OSD resource reservations are successful, and the PG is set to PG_STATE_DEEP_SCRUB state. Call the function queue_scrub() to add the PG to the work queue op_wq, and trigger the Scrub task to start executing;

4. Execution of Scrub

The specific execution process of Scrub is roughly as follows: the verification of metadata and data is completed by comparing the metadata and data of the copies on each OSD of the object. Its core processing flow is controlled and completed in the function PG::chunky_scrub().

4.1 Related Data Structures

There are two main data structures related to the Scrub operation: one is the Scrubber control structure, which is equivalent to the context of a Scrub operation and controls the operation process of a PG. The other is that ScrubMap saves the metadata and data summary information of each copy of the object that needs to be compared.

1)  The Scrubber  structure Scrubber is used to control the Scrub process of a PG (src/osd/pg.h):

// -- scrub --
struct Scrubber {
	// 元数据
	set<pg_shard_t> reserved_peers;                //资源预约的shard
	bool reserved, reserve_failed;                 //是否预约资源,预约资源是否失败
	epoch_t epoch_start;                           //开始Scrub操作的epoch
	
	// common to both scrubs
	bool active;                                   //Scrub是否开始
	
	/*
	 * 当PG有snap_trim操作时,如果检查Scrubber处于active状态,说明正在进行Scrub操作,那么
	 * snap_trim操作暂停,设置queue_snap_trim的值为true。当PG完成Scrub任务后,如果queue_snap_trim
	 * 的值为true,就把PG添加到相应的工作队列里,继续完成snap_trim操作
	 */
	bool queue_snap_trim;
	int waiting_on;                                //等待的副本计数
	set<pg_shard_t> waiting_on_whom;               //等待的副本
	int shallow_errors;                            //轻度扫描错误数
	int deep_errors;                               //深度扫描错误数
	int fixed;                                     //已修复的对象数

	ScrubMap primary_scrubmap;                     //主副本的ScrubMap
	map<pg_shard_t, ScrubMap> received_maps;       //接收到的从副本的ScrubMap
	OpRequestRef active_rep_scrub;
	utime_t scrub_reg_stamp;                       //stamp we registered for
	
	// For async sleep
	bool sleeping = false;
	bool needs_sleep = true;
	utime_t sleep_start;
	
	// flags to indicate explicitly requested scrubs (by admin)
	bool must_scrub, must_deep_scrub, must_repair;
	
	// Priority to use for scrub scheduling
	unsigned priority;
	
	// this flag indicates whether we would like to do auto-repair of the PG or not
	bool auto_repair;

	// Maps from objects with errors to missing/inconsistent peers
	map<hobject_t, set<pg_shard_t>, hobject_t::BitwiseComparator> missing;            //扫描出的缺失对象
	map<hobject_t, set<pg_shard_t>, hobject_t::BitwiseComparator> inconsistent;       //扫描出的不一致对象
	
	/*
	 * Map from object with errors to good peers
	 * 如果所有副本对象中有不一致的对象,authoritative记录了正确对象所在的OSD
	 */
	map<hobject_t, list<pair<ScrubMap::object, pg_shard_t> >, hobject_t::BitwiseComparator> authoritative;
	
	// Cleaned map pending snap metadata scrub
	ScrubMap cleaned_meta_map;
	
	// digest updates which we are waiting on
	int num_digest_updates_pending;                //等待更新digest的对象数
	
	// chunky scrub
	hobject_t start, end;                          //扫描对象列表的开始和结尾
	eversion_t subset_last_update;                 //扫描对象列表中最新的版本号
	
	// chunky scrub state
	enum State {
		INACTIVE,
		NEW_CHUNK,
		WAIT_PUSHES,
		WAIT_LAST_UPDATE,
		BUILD_MAP,
		WAIT_REPLICAS,
		COMPARE_MAPS,
		WAIT_DIGEST_UPDATES,
		FINISH,
	} state;
	
	std::unique_ptr<Scrub::Store> store;
	
	bool deep;                                    //是否为深度扫描
	uint32_t seed;                                //计算crc32校验码的种子
	
	list<Context*> callbacks;
};

2) ScrubMap

The data structure ScrubMap saves the objects to be verified and the corresponding verification information (src/osd/osd_types.h):

struct ScrubMap {
	struct object {
		map<string,bufferptr> attrs;           //对象的属性
		set<snapid_t> snapcolls;               //该对象所有的snap序号
		uint64_t size;                         //对象的size
		__u32 omap_digest;                     //omap的crc32c校验码
		__u32 digest;                          //对象数据的crc32校验码
		uint32_t nlinks;                       //snap对象(clone对象)对应的snap数量
		bool negative:1;                       
		bool digest_present:1;                 //是否计算了数据的校验码标志
		bool omap_digest_present:1;            //是否有omap的校验码标志
		bool read_error:1;                     //读对象的数据出错标志
		bool stat_error:1;                     //调用stat获取对象的元数据出错标志
		bool ec_hash_mismatch:1;               //
		bool ec_size_mismatch:1;
	};

	bool bitwise;                             // ephemeral, not encoded

	//需要校验的对象(hobject) -> 校验信息(object)的映射
	map<hobject_t,object, hobject_t::ComparatorWithDefault> objects;
	eversion_t valid_through;
	eversion_t incr_since;
};

The internal class object is used to save the information that the object needs to verify, including the following five aspects:

  • The size of the object (size)

  • Object attributes (attrs)

  • The check code (digest) of the object omap

  • Checksum of object data (digest)

  • The snapshot sequence number of all cloned objects of the object

4.2 Scrub control process

The Scrub task is completed by the OSD work queue OpWq, and the corresponding processing function pg->scrub(handle) is called to execute.

void PG::scrub(epoch_t queued, ThreadPool::TPHandle &handle){
	...
	chunky_scrub(handle);
}

The PG::scrub() function finally calls the PG::chunky_scrub() function to implement. The PG::chunky_scrub() function controls the Scrub operation state transition and core processing.

The specific analysis process is as follows:

1) The initial state of the Scrubber is PG::Scrubber::INACTIVE, and the processing of this state is as follows

  a) Set the value of scrubber.epoch_start to info.history.same_interval_since;

  b) Set the value of scrubber.active to true

  c) Set the state scrubber.state to PG::Scrubber::NEW_CHUNK

  d) According to peer_features, set the type of scrubber.seed, this seed is to calculate the initialization hash value of crc32

2) The processing of PG::Scrubber::NEW_CHUNK status is as follows

  a) Call the get_pgbackend()->objects_list_partial() function to scan a group of objects starting from the start object. The number of objects scanned at a time is between the following two configuration parameters: cct->_conf->osd_scrub_chunk_min (the default value is 5) and cct ->_conf->osd_scrub_chunk_max (default value is 25)

  b) Calculate the bounds of the object. Identical objects have the same hash value. Look for objects with different hash values ​​starting at the back of the list, delimiting from there. The purpose of this is to divide all related objects (snapshot objects, rollback objects) of an object into a scan verification process;

  c) Call the function _range_available_for_scrub() to check the objects in the list. If there are blocked objects, set the value of done to true and exit the Scrub process of PG;

  d) Calculate the maximum update version number of the object from start to end according to pg_log. The latest version number is set in scrubber.subset_last_update;

  e) Call the function _request_scrub_map() to send messages to all replicas to obtain the verification information of the corresponding ScrubMap;

  f) Set the status to PG::Scrubber::WAIT_PUSHES.

3) The processing of PG::Scrubber::WAIT_PUSHES state is as follows

  a) If the value of active_pushes is 0, set the state to PG::Scrubber::WAIT_LAST_UPDATE and enter the next state processing;

  b) If active_pushes is not 0, it means that the PG is in Recovery operation. Set the value of done to true to end directly. When entering chunky_scrub(), the PG should be in the CLEAN state, and there is no Recovery operation, where the Recovery operation may be the repair operation after the last chunky_scrub() operation;

4) The processing of PG::Scrubber::WAIT_LAST_UPDATE state is as follows:

  a) If the value of last_update_applied is less than the value of scrubber.subset_last_update, it means that although the operation has been written to the log, it has not been applied to the object. Since the steps after the Scrub operation have object read operations, it is necessary to wait for the log application to complete. Set the value of done to true to end the Scrub process of this PG;

  b) Otherwise set the state to PG::Scrubber::BUILD_MAP.

5) The processing of PG::Scrubber::BUILD_MAP state is as follows:

  a) Call the function build_scrub_map_chunk() to construct the ScrubMap structure of the object on the main OSD;

  b) If the construction is successful, the value of the count scrubber.waiting_on is decremented by 1, and scrubber.waiting_on_whom is removed from the queue, and the corresponding state is set to PG::Scrubber.WAIT_REPLICAS.

6) The processing of PG::Scrubber::WAIT_REPLICAS state is as follows:

  a) If scrubber.waiting_on is not zero, it means that there is no response to the replica request, set the value of done to true, exit and wait;

  b) Otherwise, enter the PG::Scrubber::CAMPARE_MAPS state;

7) The processing of PG::Scrubber::COMPARE_MAPS status is as follows:

  a) Call the function scrub_compare_maps() to compare the verification information of each copy;

  b) Update the value of the parameter scrubber.start to scrubber.end.

  c) Call the function requeue_ops() to re-add the read and write operations blocked by Scrub to the operation queue for execution;

  d) The state is set to PG::Scrubber::WAIT_DIGEST_UPDATES;

8) The processing of PG::Scrubber::WAIT_DIGEST_UPDATES state is as follows:

  a) If there is scrubber.num_digest_updates_pending waiting, wait for the digest of updated data or the digest of omap;

  b) If scrubber.end is less than hobject_t::get_max(), there is no object in this PG that has completed the Scrub operation, set the status scrubber::state to PG::Scrubber::NEW_CHUNK, and continue to add PG to osd->scrub_wq ;

  c) Otherwise, set the status to the value of PG::Scrubber::FINISH;

9) The processing of PG::Scrubber::FINISH state is as follows:

  a) Call the function scrub_finish() to set relevant statistical information and trigger the repair of inconsistent objects;

  b) Set the state to PG::Scrubber::INACTIVE;

4.3 Build ScrubMap

There are multiple function implementations for building ScrubMap, which are described below.

1. build_scrub_map_chunk() function

The function build_scrub_map_chunk() is used to build the verification information of all objects from start to end and save it in the ScrubMap structure:

int PG::build_scrub_map_chunk(
  ScrubMap &map,
  hobject_t start, hobject_t end, bool deep, uint32_t seed,
  ThreadPool::TPHandle &handle);

The processing analysis is as follows:

1) Set the value of map.valid_through to info.last_update;

2) Call the get_pgbackend()->objects_list_range() function to list all objects within the range of start and end, the ls queue stores head and snap objects, and the rollback_obs queue stores ghobject_t objects for rollback;

3) Call the function get_pgbackend()->be_scan_list() to scan the object and build the ScrubMap structure;

4) Call the function _scan_rollback_obs() to check the rollback object: if the generation of the object is less than the last_rollback_info_trimmed_to_applied value, delete the object;

5) Call _scan_snaps() to repair the snap information saved in SnapMapper;

2. _scan_snaps()

The function _scan_snaps() scans whether the snap information saved by the head object is consistent with the snap information of the object saved in the SnapMapper. It takes the object snap information saved by the former as the standard, and repairs the object snap information saved in SnapMapper.

void PG::_scan_snaps(ScrubMap &smap);

The specific implementation process is as follows: for each object in the ScrubMap loop, do the following operations:

1) If the value of the object's hoid.snap is less than the value of CEPH_MAXSNAP, then the object is a snap object, and the object_info_t information is obtained from o.attrs[OI_ATTR];

2) Check the snaps of oi. If oi.snaps.empty() is 0, set nlinks equal to 1; if io.snaps.size() is 1, set nlinks equal to 2; otherwise set nlinks equal to 3;

3) Get oi_snaps from oi, get cur_snaps from snap_mapper, compare the two snap information, and the information of oi shall prevail:

  a) If the result of the function snap_mapper.get_snaps(hoid, &cur_snaps) is -ENOENT, add the information to snap_mapper;

  b) If the information is inconsistent, first delete the inconsistent object in snap_mapper, and then add the snap information of the object to snap_mapper.

3. be_scan_list()

The function be_scan_list() is used to construct the verification information of the objects in the ScrubMap:

void PGBackend::be_scan_list(
  ScrubMap &map, const vector<hobject_t> &ls, bool deep, uint32_t seed,
  ThreadPool::TPHandle &handle);

The specific process is to scan the objects in the ls vector in a loop:

1) Call store->stat() to get the stat information of the object:

  a) If the acquisition is successful, set the value of o.size equal to st.st_size, and call store->getattrs() to save the attr information of the object in o.attrs;

  b) If the result r returned by stat is -ENOENT, skip the object directly (this object may be missing on this OSD, and it will be checked out later when comparing the results);

  c) If the result r returned by stat is -EIO, set the value of o.stat_error to true;

2) If the value of deep is true, call the function be_deep_scrub() to perform deep scanning to obtain the digest information of the object's omap and data;

4. be_deep_scrub()

The function be_deep_scrub() implements deep scanning of objects:

void ReplicatedBackend::be_deep_scrub(
  const hobject_t &poid,                             //深度扫描的对象
  uint32_t seed,                                     //crc32的种子
  ScrubMap::object &o,                               //保存对应的校验信息
  ThreadPool::TPHandle &handle);

The implementation process is analyzed as follows:

1) Set the initial values ​​of the bufferhash of data and omap to seed;

2) The function store->read() is called cyclically to read the data of the object. The length of each read is the configuration parameter cct->_conf->osd_deep_scrub_stride(512k), and the crc32 check value is counted by bufferhash. If there is an error in the middle (r==-EIO), set the value of o.read_error to true. Finally, set o.digest to calculate the check value of crc32, and set the value of o.digest_present to true;

3) Call the function store->omap_get_header() to obtain the header, iteratively obtain the key-value value of the omap of the object, calculate the digest information of the header and key-value, and set it in o.omap_digest, and mark the value of o.omap_digest_present as true ;

In summary, the metadata of the object is obtained through the function be_scan_list(), and the data of the object and the digest information of omap are obtained through the be_deep_scrub() function and stored in the ScrubMap structure;

4.4 Processing from replicas

When the slave replica receives the MOSDRepScrub type message sent by the master replica to obtain the verification information of the object, it calls the function replica_scrub() to complete.

void PG::replica_scrub(
  OpRequestRef op,
  ThreadPool::TPHandle &handle);

The specific implementation of the function replica_scrub() is as follows:

1) First ensure that scrubber.active_rep_scrub is not empty;

2) Check if the value of msg->map_epoch is less than the value of info.history.same_interval_since, return directly. Here, the obsolete MOSDRepScrub request is directly discarded from the copy;

3) If the value of last_update_applied is less than the value of msg->scrub_to, that is, the operation of completing the log application from the copy is behind the version of the scrub operation of the main copy, you must wait for them to be consistent. Save the current op operation in scrubber.active_rep_scrub and wait;

4) If active_pushes is greater than 0, it indicates that there is a Recovery operation in progress, and also save the current op operation in scrubber.active_rep_scrub and wait;

5) Otherwise, call the function build_scrub_map_chunk() to build the ScrubMap and send it to the master copy.

When the waiting local operation application is completed, check in the function ReplicatedPG::op_applied(), if scrubber.active_rep_scrub is not empty, and the version of the operation is equal to msg->scrub_to, the saved op operation will be put back into osd- >op_wq request queue, continue to complete the request.

4.5 Copy Comparison

When both the master copy and the slave copy of the object have completed the construction of the verification information and stored it in the corresponding structure ScrubMap, the next step is to compare the verification information of each copy to complete the consistency check. First select an authoritative object through the information of the object itself, and then use the authoritative object to compare with other objects to test. The functions used for comparison are described below.

4.5.1 scrub_compare_maps()

The function scrub_compre_maps() realizes whether the information of different copies is consistent. The processing flow is as follows:

void PG::scrub_compare_maps();

1) First ensure that acting.size() is greater than 1, if the PG has only one OSD, it cannot be compared;

2) Put the ScrubMap corresponding to the OSD of actingbackfill into the maps;

3) Call the function be_compare_scrubmaps() to compare the objects of each copy, and save the shard where the complete copy of the object is located in the authoritative;

4) Call the _scrub() function to continue comparing the consistency of objects between snaps;

4.5.2 be_compare_scrubmaps()

The function be_compare_scrubmaps() is used to compare the consistency of each copy of the object. The specific processing process is analyzed as follows:

void PGBackend::be_compare_scrubmaps(
  const map<pg_shard_t,ScrubMap*> &maps,
  bool repair,
  map<hobject_t, set<pg_shard_t>, hobject_t::BitwiseComparator> &missing,
  map<hobject_t, set<pg_shard_t>, hobject_t::BitwiseComparator> &inconsistent,
  map<hobject_t, list<pg_shard_t>, hobject_t::BitwiseComparator> &authoritative,
  map<hobject_t, pair<uint32_t,uint32_t>, hobject_t::BitwiseComparator> &missing_digest,
  int &shallow_errors, int &deep_errors,
  Scrub::Store *store,
  const spg_t& pgid,
  const vector<int> &acting,
  ostream &errorstream);

1) First build the master set, which is the union of objects on all replica OSDs;

2) Perform the following operations on each object in the master set:

  a) Call the function be_select_auth_object() to select a copy auth with an authoritative object. If no authoritative object is selected, add 1 to the variable shallow_errors to record this error;

  b) Call the function be_compare_scrub_objects() to compare the objects and authoritative objects on each shard: respectively compare the digest of data, omap_digest and attributes of omap:

* 如果结果为clean,表明该对象和权威对象的各项比较完全一致,就把该shard添加到auth_list列表中;

* 如果结果不为clean,就把该对象添加到cur_inconsistent列表中,分别统计shallow_errors和deep_errors的值;

* 如果该对象在该shard上不存在,添加到cur_missing列表中,统计shallow_errors的值;

  c) Check all the comparison results of the object: if cur_missing is not empty, add it to the missing queue; if there is a cur_inconsistent object, add it to the inconsistent object; if the object has an incomplete copy, put the records without problems in authoritative;

  d) If the digest of data recorded in the authoritative object object_info and the omap_digest of omap are inconsistent with the calculated results of the actual scanned data, the update mode is set to FORCE to force repair. If there is no data digest and omap digest in object_info, the repair mode update is set to MAYBE.

  e) Finally check, if the update mode is FORCE, or the age of the object is greater than the value of the configuration parameter g_conf->osd_deep_scrub_update_digest_min_age, add it to the missing_digest list;

4.5.3 be_select_auth_object()

The function be_select_auth_object() is used to select an authoritative object among the replica objects on each OSD: the auth_obj object. The principle is to verify the integrity of itself based on the redundant information it carries. The specific process is as follows:

map<pg_shard_t, ScrubMap *>::const_iterator
  PGBackend::be_select_auth_object(
  const hobject_t &obj,
  const map<pg_shard_t,ScrubMap*> &maps,
  object_info_t *auth_oi,
  map<pg_shard_t, shard_info_wrapper> &shard_map,
  inconsistent_obj_wrapper &object_error);

1) First confirm that the read_error and stat_error of the object are not set, that is, there is no error in the process of obtaining the data and metadata of the object, otherwise skip it directly;

2) Confirm that the obtained attribute OI_ATTR value is not empty, and correctly decode the data structure object_info_t, and set the current object as the auth_obj object;

3) Verify that the size value stored in object_info_t is consistent with the size value of the scanned object, if not, continue to search for a better auth_obj object;

4) If it is a replicated type of PG, verify whether the data stored in object_info_t and the digest value of omap are consistent with the values ​​calculated during the scanning process. If inconsistent, continue to find a better auth_obj object;

5) If the above are all consistent, the loop is terminated directly, and a satisfactory auth_obj object has been found;

From the above selection process, we can see that the conditions for selecting an authoritative object are as follows:

  • Step 1), 2) two conditions are the basis, the data and attributes of the object can be read correctly;

  • Steps 3), 4) use the object size stored in object_info_t, and the redundant information of the digest of omap and data. Verify by comparing this information with the information calculated from the data read from the object scan.

4.5.4 _scrub()

The function _scrub() checks the consistency between the object and the snapshot object:

void ReplicatedPG::_scrub(
  ScrubMap &scrubmap,
  const map<hobject_t, pair<uint32_t, uint32_t>, hobject_t::BitwiseComparator> &missing_digest);

If the pool has a cache pool layer, then it is allowed to copy objects to have an inconsistent state, because some objects may still exist in the cache pool and have not been flushed back. This is determined by the function pool.info.allow_incomplete_clones().

In fact, the code is more complicated, and the following examples illustrate the basic process of its implementation.

例12-1 An example of the implementation process of the _scrub() function is shown in Table 12-1 below

The implementation process of _scrub() is described as follows:

1) The object obj1 snap1 is a snapshot object, so it should have a head or snapdir object. But the snapshot object does not have a corresponding head or snapdir object, then the object is marked as an unexpected object;

2) The object obj2 head is a head object, which is the expected object. Through the head object, get the clones list of the snapshot set as [6,4,2,1]

3) Check that the object obj2 snap7 is not in the clones list of the object obj2's snapset, which is an abnormal object;

4) The snapshot object snap6 of the object obj2 is in the clones list of the snapset;

5) The snapshot object snap4 of the object obj2 is in the clones list of the snapset;

6) When the obj3 head object is encountered, there should be snapshot objects snap2 and snap1 in the expected object obj2 as missing objects. Continue to obtain the clones value of the snapset of snap3 as a list [3,1];

7) The snapshot object snap3 of object obj3 is consistent with the expected object;

8) The snapshot object snap1 of object obj3 is consistent with the expected object;

9) The snapdir object for object obj4, as expected. The clones list of the snapset that gets the object is [4].

10) The scanned object list is over, but the expected object is the snapshot object snap4 of obj4, and the object obj4 snap4 is missing;

Currently, the expected and unexpected objects are only marked in the log without further processing.

Finally, for those data objects whose digests are incorrect but the others are correct, that is, objects whose digests need to be updated in missing_digest, an update digest request is sent.

4.6 End the Scrub process

The scrub_finish() function is used to end the Scrub process, and its processing is as follows:

void PG::scrub_finish();

1) Set the status and statistical information of the relevant PG;

2) Call the function scrub_process_inconsistent to repair the missing and inconsistent objects marked in the scrubber, which finally calls the repair_object function. It just marks the object missing in peer_missing;

3) Finally, the DoRecovery event is triggered and sent to the PG state machine to initiate the actual object recovery operation.

5. Chapter Summary

This chapter introduces the basic principles of Scrub and the scheduling mechanism of the Scrub process, and then introduces the specific process of building verification information and comparing verification.

Finally, summarize the key points of Ceph's consistency check Scrub function:

  • If there is an object being written between start and end in the object range of the scrub operation check, exit the scrub process; if the scrub has been started, then the object write operation between start and end needs to wait for the scrub operation Finish;

  • The check is to compare whether the metadata (size, attrs, omap) of the master-slave copy is consistent with the data. The selection of the authoritative object is based on whether the information saved by the object itself (object info) is consistent with the information of the read object. Then check against the authoritative object against other copies of the object.

Guess you like

Origin blog.csdn.net/weixin_43778179/article/details/132718238