CephFS stores the file system metadata in the metadata pool through ceph-mds. Generally, in the actual production environment of the metadata pool, high-performance SSDs are recommended to speed up the performance of metadata placement and loading into memory.
This article introduces how ceph-mds stores metadata in the metadata pool, and how to view related journal information through cephfs-journal-tool.
class MDLog() class
class MDLog () 管理整个journal模块,内存数据与journal模块通道,所有元数据操作都通过此类与journal相关联
//主要成员
int num_events; //事件总数
inodeno_t ino;//metadata pool中对应inode,默认以0x200.mdsrank
Journaler *journaler;
Thread replay_thread;//当mds为hot standby时replay_thread从metadata_pool中将元数据回放至内存
Thread _recovery_thread;
Thread submit_thread;//事务条件线程,该线程负责将事务提交,flush至journal中
map<uint64_t,LogSegment*> segments;
map<uint64_t,list<PendingEvent> > pending_events;
class MDLog () Subclass member introduction
class Journaler (under src/osdc/ directory)) This class is mainly responsible for managing journal reading, writing and refreshing
. In the case of single mds, journal continues to increase from 200.0000000 in the metadata pool, of
which 200.0000000 is class Journaler Contains the class class Header
This class contains the Header class, which records the specific operation position of the entire journal
class Header {
public:
uint64_t trimmed_pos;//已裁剪到达位置
uint64_t expire_pos;//失效到达位置
uint64_t unused_field;//未使用位置
uint64_t write_pos; //当前journal起始写位置
string magic;
file_layout_t layout; //< The mapping from byte stream offsets
// to RADOS objects
stream_format_t stream_format; //< The encoding of LogEvents
// within the journal byte stream
Header(const char *m="") :
trimmed_pos(0), expire_pos(0), unused_field(0), write_pos(0), magic(m),
stream_format(-1) {
}
The actual content of the Header is as follows.
Verify that in the write_pos
test cluster, the largest obj starting with 0x200
is 200.000000da. The default obj size in this cluster is 4MB, and the write_pos is currently 917417563, 917417563/1024/1024/4=218.7293918132782 The last obj starts with 0xda=218.
Journaler In addition to the Header class, the class itself also has read and write bit shift control read and write as follows
// writer
uint64_t prezeroing_pos;
uint64_t prezero_pos; ///< we zero journal space ahead of write_pos to
// avoid problems with tail probing
uint64_t write_pos; ///< logical write position, where next entry
// will go
uint64_t flush_pos; ///< where we will flush. if
/// write_pos>flush_pos, we're buffering writes.
uint64_t safe_pos; ///< what has been committed safely to disk.
uint64_t next_safe_pos; /// start postion of the first entry that isn't
/// being fully flushed. If we don't flush any
// partial entry, it's equal to flush_pos.
bufferlist write_buf; ///< write buffer. flush_pos +
/// write_buf.length() == write_pos.
// reader
uint64_t read_pos; // logical read position, where next entry starts.
uint64_t requested_pos; // what we've requested from OSD.
uint64_t received_pos; // what we've received from OSD.
// read buffer. unused_field + read_buf.length() == prefetch_pos.
bufferlist read_buf;
The main member functions are as follows
void _flush(C_OnFinisher *onsafe);
void _do_flush(unsigned amount=0);
void _finish_flush(int r, uint64_t start, ceph::real_time stamp);
To put it simply, the Journaler class is responsible for writing the serialized data in memory to the obj in the corresponding metadata pool, and provides an interface to read the obj, so what data is serialized?
Go back to the submit_thread thread in the Mdlog class, which calls journaler->flush() to flush the serialized data into obj.
map<uint64_t,list<PendingEvent> >::iterator it = pending_events.begin();
if (it == pending_events.end()) {
submit_cond.Wait(submit_mutex);
continue;
}
From the above code, submit_thread will traverse pending_events, map<uint64_t, list > pending_events is a collection of submission events, which is assigned in the _submit_entry function
void MDLog::_submit_entry(LogEvent *le, MDSLogContextBase *c)
{
pending_events[ls->seq].push_back(PendingEvent(le, c));
}
The _submit_entry function is called by submit_mdlog_entry in journal_and_reply, and the journal_and_reply function will be called after all metadata requests are processed.
It can be seen that when mds processes the client request, it is divided into LogEvent events, submitted to the pending_events collection through submit_mdlog_entry in the journal_and_reply function, and then refreshed to the metadata pool through the submit_thread thread. The following describes the LogEvent class
class LogEvent is the base class of all operation log events, and the specific member functions are implemented by integrating subclasses of this class, where EventType indicates the type of the subclass
class LogEvent {
public:
typedef __u32 EventType;
}
std::string LogEvent::get_type_str() const
{
switch(_type) {
case EVENT_SUBTREEMAP: return "SUBTREEMAP";
case EVENT_SUBTREEMAP_TEST: return "SUBTREEMAP_TEST";
case EVENT_EXPORT: return "EXPORT";
case EVENT_IMPORTSTART: return "IMPORTSTART";
case EVENT_IMPORTFINISH: return "IMPORTFINISH";
case EVENT_FRAGMENT: return "FRAGMENT";
case EVENT_RESETJOURNAL: return "RESETJOURNAL";
case EVENT_SESSION: return "SESSION";
case EVENT_SESSIONS_OLD: return "SESSIONS_OLD";
case EVENT_SESSIONS: return "SESSIONS";
case EVENT_UPDATE: return "UPDATE";
case EVENT_SLAVEUPDATE: return "SLAVEUPDATE";
case EVENT_OPEN: return "OPEN";
case EVENT_COMMITTED: return "COMMITTED";
case EVENT_TABLECLIENT: return "TABLECLIENT";
case EVENT_TABLESERVER: return "TABLESERVER";
case EVENT_NOOP: return "NOOP";
default:
generic_dout(0) << "get_type_str: unknown type " << _type << dendl;
return "UNKNOWN";
}
}
Let's take EVENT_UPDATE as an example, this event represents the update metadata event, corresponding to class EUpdate
class EUpdate : public LogEvent {
public:
EMetaBlob metablob;
string type;
bufferlist client_map;
version_t cmapv;
metareqid_t reqid;
bool had_slaves;
EUpdate() : LogEvent(EVENT_UPDATE), cmapv(0), had_slaves(false) { }
EUpdate(MDLog *mdlog, const char *s) :
LogEvent(EVENT_UPDATE), metablob(mdlog),
type(s), cmapv(0), had_slaves(false) { }
void print(ostream& out) const override {
if (type.length())
out << "EUpdate " << type << " ";
out << metablob;
}
EMetaBlob *get_metablob() override { return &metablob; }
void encode(bufferlist& bl, uint64_t features) const override;
void decode(bufferlist::iterator& bl) override;
void dump(Formatter *f) const override;
static void generate_test_instances(list<EUpdate*>& ls);
void update_segment() override;
void replay(MDSRank *mds) override;
EMetaBlob const *get_metablob() const {return &metablob;}
};
EVENT_UPDATE events are generated whenever client operations involving metadata updates are involved
handle_client_openc ---------> EUpdate *le = new EUpdate(mdlog, "openc")
handle_client_setattr ---------> EUpdate *le = new EUpdate(mdlog, "setattr")
handle_client_setlayout ---------> EUpdate *le = new EUpdate(mdlog, "setlayout")
.
.
.
handle_client_mknod -------------> EUpdate *le = new EUpdate(mdlog, "mknod")
class EUpdate member variable EMetaBlob metablob, EMetaBlob is a serialized specific class, this class contains all the metadata information of the inode operation, including the statistics information of the parent directory of the inode, the metadata information of the inode, and the client information of the inode operation in the
LogEvent class encode_with_header, serialize EMetaBlob, EUpdate is a subclass of LogEvent, the specific serialization method is as follows
void encode_with_header(bufferlist& bl, uint64_t features) {
::encode(EVENT_NEW_ENCODING, bl);
ENCODE_START(1, 1, bl)
::encode(_type, bl);
encode(bl, features);
ENCODE_FINISH(bl);
}
EMetaBlob serialization
void EMetaBlob::encode(bufferlist& bl, uint64_t features) const
{
ENCODE_START(8, 5, bl);
::encode(lump_order, bl);
::encode(lump_map, bl, features);
::encode(roots, bl, features);
::encode(table_tids, bl);
::encode(opened_ino, bl);
::encode(allocated_ino, bl);
::encode(used_preallocated_ino, bl);
::encode(preallocated_inos, bl);
::encode(client_name, bl);
::encode(inotablev, bl);
::encode(sessionmapv, bl);
::encode(truncate_start, bl);
::encode(truncate_finish, bl);
::encode(destroyed_inodes, bl);
::encode(client_reqs, bl);
::encode(renamed_dirino, bl);
::encode(renamed_dir_frags, bl);
{
// make MDSRank use v6 format happy
int64_t i = -1;
bool b = false;
::encode(i, bl);
::encode(b, bl);
}
::encode(client_flushes, bl);
ENCODE_FINISH(bl);
}
Authentication method:
touch /mnt/cephfs/yyn/testtouch
cephfs-journal-tool --rank=cephfs:0 event get --type=UPDATE binary --path /opt/test
ceph-dencoder import /opt/test/0x36aee5de_UPDATE.bin type EUpdate decode dump_json
Take mknod as an example
// prepare finisher
mdr->ls = mdlog->get_current_segment();
EUpdate *le = new EUpdate(mdlog, "mknod");
mdlog->start_entry(le);
le->metablob.add_client_req(req->get_reqid(), req->get_oldest_client_tid());
journal_allocated_inos(mdr, &le->metablob);
mdcache->predirty_journal_parents(mdr, &le->metablob, newi, dn->get_dir(),
PREDIRTY_PRIMARY|PREDIRTY_DIR, 1);
le->metablob.add_primary_dentry(dn, newi, true, true, true);
journal_and_reply(mdr, newi, dn, le, new C_MDS_mknod_finish(this, mdr, dn, newi));
In mknod, the required metablob information is filled by add_client_req, predirty_journal_parents, add_primary_dentry, and it is solidified into the metadata pool by journal_and_reply serialization.
3. The above two points have been analyzed that mds receives the request, serializes the metadata information, and brushes it into the obj of the journal corresponding to the metadata pool. How does the journal obj return the data to the metadata of each directory?
Each folder in the metadata pool will save an obj information corresponding to the folder with the inode number of the folder. The obj stores all the file information in the directory and the statistics of the directory itself.
The metablob in the journal is only temporarily saved, and it must be flushed back to the obj of the corresponding directory in the future.
In ceph-mds, the dirty inode information is saved through the class LogSegment class.
class LogSegment {
public:
const log_segment_seq_t seq;
uint64_t offset, end;
int num_events;
// dirty items
elist<CDir*> dirty_dirfrags, new_dirfrags;
elist<CInode*> dirty_inodes;
elist<CDentry*> dirty_dentries;
elist<CInode*> open_files;
elist<CInode*> dirty_parent_inodes;
elist<CInode*> dirty_dirfrag_dir;
elist<CInode*> dirty_dirfrag_nest;
elist<CInode*> dirty_dirfrag_dirfragtree;
elist<MDSlaveUpdate*> slave_updates;
set<CInode*> truncating_inodes;
map<int, ceph::unordered_set<version_t> > pending_commit_tids; // mdstable
set<metareqid_t> uncommitted_masters;
set<dirfrag_t> uncommitted_fragments;
// client request ids
map<int, ceph_tid_t> last_client_tids;
// potentially dirty sessions
std::set<entity_name_t> touched_sessions;
// table version
version_t inotablev;
version_t sessionmapv;
map<int,version_t> tablev;
}
The LogSegment class saves a collection of metadata that has been modified during the public num_events event, such as:
elist<CDir*> dirty_dirfrags which directories are modified
elist<CInode*> dirty_inodes which inodes are modified
elist<CInode*> open_files which are opened inode
By default, a single LogSegment records the inode modification set of 1024 LogEvents, which is controlled by the mds_log_events_per_segment parameter. All LogSegments are stored in the map<uint64_t, LogSegment*> segments in the MDlog class.
Or take mknod as an example
ournal_and_reply(mdr, newi, dn, le, new C_MDS_mknod_finish(this, mdr, dn, newi));
When the metadata is written to the journal, C_MDS_mknod_finish is called back, and the function will push the modified inode information to LogSegment dirty_inodes and dirty_parent_inodes.
newi->mark_dirty(newi->inode.version + 1, mdr->ls);
newi->_mark_dirty_parent(mdr->ls, true);
void CInode::_mark_dirty(LogSegment *ls)
{
if (!state_test(STATE_DIRTY)) {
state_set(STATE_DIRTY);
get(PIN_DIRTY);
assert(ls);
}
// move myself to this segment's dirty list
if (ls)
ls->dirty_inodes.push_back(&item_dirty);
}
When mdsrank.cc passes trim_mdlog, it traverses all segments in MDlog to perform try_expire, and flushes dirty data into obj corresponding to the directory through LogSegment.
map<uint64_t,LogSegment*>::iterator p = segments.begin();
while (p != segments.end() &&
p->first < last_seq &&
p->second->end < safe_pos) { // next segment should have been started
LogSegment *ls = p->second;
++p;
try_expire(ls, CEPH_MSG_PRIO_DEFAULT);
}
void LogSegment::try_to_expire(MDSRank *mds, MDSGatherBuilder &gather_bld, int op_prio)
{
{
set<CDir*> commit;
dout(6) << "LogSegment(" << seq << "/" << offset << ").try_to_expire" << dendl;
assert(g_conf->mds_kill_journal_expire_at != 1);
// commit dirs
for (elist<CDir*>::iterator p = new_dirfrags.begin(); !p.end(); ++p) {
dout(20) << " new_dirfrag " << **p << dendl;
assert((*p)->is_auth());
commit.insert(*p);
}
for (elist<CDir*>::iterator p = dirty_dirfrags.begin(); !p.end(); ++p) {
dout(20) << " dirty_dirfrag " << **p << dendl;
assert((*p)->is_auth());
commit.insert(*p);
}
for (elist<CDentry*>::iterator p = dirty_dentries.begin(); !p.end(); ++p) {
dout(20) << " dirty_dentry " << **p << dendl;
assert((*p)->is_auth());
commit.insert((*p)->get_dir());
}
for (elist<CInode*>::iterator p = dirty_inodes.begin(); !p.end(); ++p) {
dout(20) << " dirty_inode " << **p << dendl;
assert((*p)->is_auth());
if ((*p)->is_base()) {
(*p)->store(gather_bld.new_sub());
} else
commit.insert((*p)->get_parent_dn()->get_dir());
}
if (!commit.empty()) {
for (set<CDir*>::iterator p = commit.begin();
p != commit.end();
++p) {
CDir *dir = *p;
assert(dir->is_auth());
if (dir->can_auth_pin()) {
dout(15) << "try_to_expire committing " << *dir << dendl;
dir->commit(0, gather_bld.new_sub(), false, op_prio);
} else {
dout(15) << "try_to_expire waiting for unfreeze on " << *dir << dendl;
dir->add_waiter(CDir::WAIT_UNFREEZE, gather_bld.new_sub());
}
}
}
// master ops with possibly uncommitted slaves
for (set<metareqid_t>::iterator p = uncommitted_masters.begin();
p != uncommitted_masters.end();
++p) {
dout(10) << "try_to_expire waiting for slaves to ack commit on " << *p << dendl;
mds->mdcache->wait_for_uncommitted_master(*p, gather_bld.new_sub());
}
// uncommitted fragments
for (set<dirfrag_t>::iterator p = uncommitted_fragments.begin();
p != uncommitted_fragments.end();
++p) {
dout(10) << "try_to_expire waiting for uncommitted fragment " << *p << dendl;
mds->mdcache->wait_for_uncommitted_fragment(*p, gather_bld.new_sub());
}
// nudge scatterlocks
for (elist<CInode*>::iterator p = dirty_dirfrag_dir.begin(); !p.end(); ++p) {
CInode *in = *p;
dout(10) << "try_to_expire waiting for dirlock flush on " << *in << dendl;
mds->locker->scatter_nudge(&in->filelock, gather_bld.new_sub());
}
for (elist<CInode*>::iterator p = dirty_dirfrag_dirfragtree.begin(); !p.end(); ++p) {
CInode *in = *p;
dout(10) << "try_to_expire waiting for dirfragtreelock flush on " << *in << dendl;
mds->locker->scatter_nudge(&in->dirfragtreelock, gather_bld.new_sub());
}
for (elist<CInode*>::iterator p = dirty_dirfrag_nest.begin(); !p.end(); ++p) {
CInode *in = *p;
dout(10) << "try_to_expire waiting for nest flush on " << *in << dendl;
mds->locker->scatter_nudge(&in->nestlock, gather_bld.new_sub());
}
assert(g_conf->mds_kill_journal_expire_at != 2);
// open files and snap inodes
if (!open_files.empty()) {
assert(!mds->mdlog->is_capped()); // hmm FIXME
EOpen *le = 0;
LogSegment *ls = mds->mdlog->get_current_segment();
assert(ls != this);
elist<CInode*>::iterator p = open_files.begin(member_offset(CInode, item_open_file));
while (!p.end()) {
CInode *in = *p;
++p;
if (in->last == CEPH_NOSNAP && in->is_auth() &&
!in->is_ambiguous_auth() && in->is_any_caps()) {
if (in->is_any_caps_wanted()) {
dout(20) << "try_to_expire requeueing open file " << *in << dendl;
if (!le) {
le = new EOpen(mds->mdlog);
mds->mdlog->start_entry(le);
}
le->add_clean_inode(in);
ls->open_files.push_back(&in->item_open_file);
} else {
// drop inodes that aren't wanted
dout(20) << "try_to_expire not requeueing and delisting unwanted file " << *in << dendl;
in->item_open_file.remove_myself();
}
} else if (in->last != CEPH_NOSNAP && !in->client_snap_caps.empty()) {
// journal snap inodes that need flush. This simplify the mds failover hanlding
dout(20) << "try_to_expire requeueing snap needflush inode " << *in << dendl;
if (!le) {
le = new EOpen(mds->mdlog);
mds->mdlog->start_entry(le);
}
le->add_clean_inode(in);
ls->open_files.push_back(&in->item_open_file);
} else {
/*
* we can get a capless inode here if we replay an open file, the client fails to
* reconnect it, but does REPLAY an open request (that adds it to the logseg). AFAICS
* it's ok for the client to replay an open on a file it doesn't have in it's cache
* anymore.
*
* this makes the mds less sensitive to strict open_file consistency, although it does
* make it easier to miss subtle problems.
*/
dout(20) << "try_to_expire not requeueing and delisting capless file " << *in << dendl;
in->item_open_file.remove_myself();
}
}
if (le) {
mds->mdlog->submit_entry(le);
mds->mdlog->wait_for_safe(gather_bld.new_sub());
dout(10) << "try_to_expire waiting for open files to rejournal" << dendl;
}
}
assert(g_conf->mds_kill_journal_expire_at != 3);
// backtraces to be stored/updated
for (elist<CInode*>::iterator p = dirty_parent_inodes.begin(); !p.end(); ++p) {
CInode *in = *p;
assert(in->is_auth());
if (in->can_auth_pin()) {
dout(15) << "try_to_expire waiting for storing backtrace on " << *in << dendl;
in->store_backtrace(gather_bld.new_sub(), op_prio);
} else {
dout(15) << "try_to_expire waiting for unfreeze on " << *in << dendl;
in->add_waiter(CInode::WAIT_UNFREEZE, gather_bld.new_sub());
}
}
assert(g_conf->mds_kill_journal_expire_at != 4);
// slave updates
for (elist<MDSlaveUpdate*>::iterator p = slave_updates.begin(member_offset(MDSlaveUpdate,
item));
!p.end(); ++p) {
MDSlaveUpdate *su = *p;
dout(10) << "try_to_expire waiting on slave update " << su << dendl;
assert(su->waiter == 0);
su->waiter = gather_bld.new_sub();
}
// idalloc
if (inotablev > mds->inotable->get_committed_version()) {
dout(10) << "try_to_expire saving inotable table, need " << inotablev
<< ", committed is " << mds->inotable->get_committed_version()
<< " (" << mds->inotable->get_committing_version() << ")"
<< dendl;
mds->inotable->save(gather_bld.new_sub(), inotablev);
}
// sessionmap
if (sessionmapv > mds->sessionmap.get_committed()) {
dout(10) << "try_to_expire saving sessionmap, need " << sessionmapv
<< ", committed is " << mds->sessionmap.get_committed()
<< " (" << mds->sessionmap.get_committing() << ")"
<< dendl;
mds->sessionmap.save(gather_bld.new_sub(), sessionmapv);
}
// updates to sessions for completed_requests
mds->sessionmap.save_if_dirty(touched_sessions, &gather_bld);
touched_sessions.clear();
// pending commit atids
for (map<int, ceph::unordered_set<version_t> >::iterator p = pending_commit_tids.begin();
p != pending_commit_tids.end();
++p) {
MDSTableClient *client = mds->get_table_client(p->first);
assert(client);
for (ceph::unordered_set<version_t>::iterator q = p->second.begin();
q != p->second.end();
++q) {
dout(10) << "try_to_expire " << get_mdstable_name(p->first) << " transaction " << *q
<< " pending commit (not yet acked), waiting" << dendl;
assert(!client->has_committed(*q));
client->wait_for_ack(*q, gather_bld.new_sub());
}
}
// table servers
for (map<int, version_t>::iterator p = tablev.begin();
p != tablev.end();
++p) {
MDSTableServer *server = mds->get_table_server(p->first);
assert(server);
if (p->second > server->get_committed_version()) {
dout(10) << "try_to_expire waiting for " << get_mdstable_name(p->first)
<< " to save, need " << p->second << dendl;
server->save(gather_bld.new_sub());
}
}
}
In the try_to_expire function,
the modified inode information is called through dir->commit(0, gather_bld.new_sub(), false, op_prio) to call the commit method in the CDIR class, and the updated information is written into the omap of the obj of the inode corresponding to the dir at the
same time . Also modify in->store_backtrace(gather_bld.new_sub(), op_prio) and mds->inotable->save(gather_bld.new_sub(), inotablev); corresponding to mds0_inotable obj
mds->sessionmap.save(gather_bld.new_sub in metadata (), sessionmapv);
Original: https://www.csdn.net/tags/NtzaggysOTI0NjMtYmxvZwO0O0OO0O0O.html