data structure
class OpHistory {
set<pair<utime_t, TrackedOpRef> > arrived;//Sort by arrival time from early to late
set<pair<double, TrackedOpRef> > duration;//Sort by op duration from small to large
Mutex ops_history_lock;//Protect the above 2 variables
bool shutdown;//Set when osd is down
uint32_t history_size;//The maximum number of historical ops reserved
uint32_t history_duration//The longest time for historical op retention
};
Save the historical op information that has been completed within a certain period of time.
Class OpTracker {
Class RemoveOnDelete{
OpTracker * tracker;
};
atomic64_t seq;//Each request has an incremental id, initially 0
struct ShardedTrackingData {
Mutex ops_in_flight_lock_sharded;
xlist<TrackedOp *> ops_in_flight_sharded;
};
vector<ShardedTrackingData*> sharded_in_flight_list;//Save the shard list of TrackedOp
uint32_t num_optracker_shards;//The number of shard lists, which cannot be dynamically modified
OpHistory history;//Instance of historical TrackedOp
float complaint_time;//Check whether trackedop needs alarm time threshold
int log_threshold;//The maximum number of alarm logs output by each check
public:
bool tracking_enabled;//Whether to enable op tracking
CephContext *cct;
};
Management class for the entire op tracking
class TrackedOp {
xlist<TrackedOp*>::item xitem;//An item in xlist in OpTracker
protected:
OpTracker * tracker;
utime_t initiated_at;//The time when the request arrives
list<pair<utime_t, string> > events; //The event points experienced by op and the corresponding time
mutable Mutex lock; //保护events
string current; //current event
uint64_t seq; //seq allocated by OpTracker
uint32_t warn_interval_multiplier; //limit output op warning
};
Instance parent class tracked by a single op
struct OpRequest : public TrackedOp {
int rmw_flags;//op flag refers to CEPH_OSD_RMW_FLAG_READ, etc.
private:
Message *request; /// the logical request we are tracking
osd_reqid_t reqid;//Client's request id
uint8_t hit_flag_points;//What flags are brought, referring to flag_reached_pg, etc., currently not used
uint8_t latest_flag_point;//The latest flag, currently not used
utime_t dequeued_time;//Time out of the op_shardedwq queue
};
A specific instance of a single op tracking
key function implementation
OpTracker :: RemoveOnDelete :: operator () (TrackedOp * op)
Called when TrackedOp's smart pointer is released.
1. Mark the current TrackedOp as done.
2. Call unregister_inflight_op to release TraackedOp from the shard corresponding to OpTracker's sharded_in_flight_list
3. Add TrackedOp to the history instance.
void OpHistory::cleanup(utime_t now)
Called in OpHistory::insert or OpHistory::dump_ops functions, and when inserting a new TrackedOp or dumping all TrackedOps.
This function traverses the arrived and duration lists of OpHistory, and first deletes the TrackedOp that has exceeded the time. By default History will save requests within 600s. When deleting too many ops, the default History only saves 20 requests.
bool OpTracker::check_ops_in_flight(std::vector<string> &warning_vector)
This function is called by the tick thread of osd to check whether the TrackerOp whose timed check has not been completed is normal.
The function traverses all shards to get the oldest op and saves it in oldest_op, and counts the total number of current ops and saves it in total_ops_in_flight. If the oldest op to the current time is smaller than the complaint_time, or if there is no op, it is normal, and returns false directly. Otherwise, continue to traverse all shards, find out the slow requests whose TrackedOp arrival time is less than complaint_time, save them in warning_vector, and record the number. When the number exceeds log_threshold, the loop will not be repeated.
There is also a little trick here is that warning_vector reserves the first index first, and when all statistics are finished, the statistical information is saved in the first one.
Event summary
Common event events
Event event |
meaning |
Initiated |
The event set in the constructor of TrackedOp, the initialization event |
reached_pg |
The op_shardedwq queue that just came out of osd |
started |
There are many places to call, the normal main osd io process is called in do_op, after checking all exceptions, start calling execute_ctx |
waiting for subops from |
When the master osd sends the request to the replica osd |
commit_queued_for_journal_write |
When the request is ready to enter the log queue |
write_thread_in_journal_buffer |
The log data has been prepared in the buffer and has not been written yet |
journaled_completion_queued |
The log has been written, and the callback enters the queue |
on_commit |
Write commit return in multi-copy scenario |
op_applied |
The apply return of the multi-replica master osd |
sub_op_commit_rec |
In a multi-copy scenario, the master osd processes the commit message of the replica osd and returns |
commit_sent |
When all the commits requested by the three replicas are returned, it is triggered by the return of the latest replica. |
sub_op_applied_rec |
In a multi-replica scenario, the master osd processes the return of the apply message of the replica osd, and the normal read-write replica will not return the apply message. |
waiting for rw locks |
The read-write process will acquire the relevant lock in the do_op function. If the lock cannot be obtained, the request will be saved in the objectcontext and will be processed after the lock is released. |
Event sequence of Io process
normal circumstances:
Read requests when writing to disk is slow: