data structure

class OpHistory {

set<pair<utime_t, TrackedOpRef> > arrived;//Sort by arrival time from early to late

set<pair<double, TrackedOpRef> > duration;//Sort by op duration from small to large

Mutex ops_history_lock;//Protect the above 2 variables

bool shutdown;//Set when osd is down

uint32_t history_size;//The maximum number of historical ops reserved

uint32_t history_duration//The longest time for historical op retention

};

Save the historical op information that has been completed within a certain period of time.

Class OpTracker {

Class RemoveOnDelete{

OpTracker * tracker;

};

atomic64_t seq;//Each request has an incremental id, initially 0

struct ShardedTrackingData {

Mutex ops_in_flight_lock_sharded;

xlist<TrackedOp *> ops_in_flight_sharded;

};

vector<ShardedTrackingData*> sharded_in_flight_list;//Save the shard list of TrackedOp

uint32_t num_optracker_shards;//The number of shard lists, which cannot be dynamically modified

OpHistory history;//Instance of historical TrackedOp

float complaint_time;//Check whether trackedop needs alarm time threshold

int log_threshold;//The maximum number of alarm logs output by each check

public:

bool tracking_enabled;//Whether to enable op tracking

CephContext *cct;

};

Management class for the entire op tracking

class TrackedOp {

xlist<TrackedOp*>::item xitem;//An item in xlist in OpTracker

protected:

OpTracker * tracker;

utime_t initiated_at;//The time when the request arrives

list<pair<utime_t, string> > events; //The event points experienced by op and the corresponding time

mutable Mutex lock; //保护events

string current; //current event

uint64_t seq; //seq allocated by OpTracker

uint32_t warn_interval_multiplier; //limit output op warning

};

Instance parent class tracked by a single op

struct OpRequest : public TrackedOp {

int rmw_flags;//op flag refers to CEPH_OSD_RMW_FLAG_READ, etc.

private:

Message *request; /// the logical request we are tracking

osd_reqid_t reqid;//Client's request id

uint8_t hit_flag_points;//What flags are brought, referring to flag_reached_pg, etc., currently not used

uint8_t latest_flag_point;//The latest flag, currently not used

utime_t dequeued_time;//Time out of the op_shardedwq queue

};

A specific instance of a single op tracking

key function implementation

**OpTracker :: RemoveOnDelete :: operator () (TrackedOp * op)**

Called when TrackedOp's smart pointer is released.

1. Mark the current TrackedOp as done.

2. Call unregister_inflight_op to release TraackedOp from the shard corresponding to OpTracker's sharded_in_flight_list

3. Add TrackedOp to the history instance.

void OpHistory::cleanup(utime_t now)

Called in OpHistory::insert or OpHistory::dump_ops functions, and when inserting a new TrackedOp or dumping all TrackedOps.

This function traverses the arrived and duration lists of OpHistory, and first deletes the TrackedOp that has exceeded the time. By default History will save requests within 600s. When deleting too many ops, the default History only saves 20 requests.

bool OpTracker::check_ops_in_flight(std::vector<string> &warning_vector)

This function is called by the tick thread of osd to check whether the TrackerOp whose timed check has not been completed is normal.

The function traverses all shards to get the oldest op and saves it in oldest_op, and counts the total number of current ops and saves it in total_ops_in_flight. If the oldest op to the current time is smaller than the complaint_time, or if there is no op, it is normal, and returns false directly. Otherwise, continue to traverse all shards, find out the slow requests whose TrackedOp arrival time is less than complaint_time, save them in warning_vector, and record the number. When the number exceeds log_threshold, the loop will not be repeated.

There is also a little trick here is that warning_vector reserves the first index first, and when all statistics are finished, the statistical information is saved in the first one.

Event summary

Common event events

Event event	meaning
Initiated	The event set in the constructor of TrackedOp, the initialization event
reached_pg	The op_shardedwq queue that just came out of osd
started	There are many places to call, the normal main osd io process is called in do_op, after checking all exceptions, start calling execute_ctx
waiting for subops from	When the master osd sends the request to the replica osd
commit_queued_for_journal_write	When the request is ready to enter the log queue
write_thread_in_journal_buffer	The log data has been prepared in the buffer and has not been written yet
journaled_completion_queued	The log has been written, and the callback enters the queue
on_commit	Write commit return in multi-copy scenario
op_applied	The apply return of the multi-replica master osd
sub_op_commit_rec	In a multi-copy scenario, the master osd processes the commit message of the replica osd and returns
commit_sent	When all the commits requested by the three replicas are returned, it is triggered by the return of the latest replica.
sub_op_applied_rec	In a multi-replica scenario, the master osd processes the return of the apply message of the replica osd, and the normal read-write replica will not return the apply message.
waiting for rw locks	The read-write process will acquire the relevant lock in the do_op function. If the lock cannot be obtained, the request will be saved in the objectcontext and will be processed after the lock is released.

Event sequence of Io process

normal circumstances:

Read requests when writing to disk is slow:

ceph trackop analysis