[ceph] Interpretation of ceph-mds journal module


CephFS stores the file system metadata in the metadata pool through ceph-mds. Generally, in the actual production environment of the metadata pool, high-performance SSDs are recommended to speed up the performance of metadata placement and loading into memory.
This article introduces how ceph-mds stores metadata in the metadata pool, and how to view related journal information through cephfs-journal-tool.
class MDLog() class

class MDLog () 管理整个journal模块,内存数据与journal模块通道,所有元数据操作都通过此类与journal相关联
//主要成员
int num_events; //事件总数
inodeno_t ino;//metadata pool中对应inode,默认以0x200.mdsrank
Journaler *journaler;
Thread replay_thread;//当mds为hot standby时replay_thread从metadata_pool中将元数据回放至内存
Thread _recovery_thread;
Thread submit_thread;//事务条件线程,该线程负责将事务提交,flush至journal中
map<uint64_t,LogSegment*> segments;  
map<uint64_t,list<PendingEvent> > pending_events; 

class MDLog () Subclass member introduction
class Journaler (under src/osdc/ directory)) This class is mainly responsible for managing journal reading, writing and refreshing
. In the case of single mds, journal continues to increase from 200.0000000 in the metadata pool, of
insert image description here
which 200.0000000 is class Journaler Contains the class class Header
This class contains the Header class, which records the specific operation position of the entire journal

 class Header {
    public:
    uint64_t trimmed_pos;//已裁剪到达位置
    uint64_t expire_pos;//失效到达位置
    uint64_t unused_field;//未使用位置
    uint64_t write_pos; //当前journal起始写位置
    string magic;
    file_layout_t layout; //< The mapping from byte stream offsets
			     //  to RADOS objects
    stream_format_t stream_format; //< The encoding of LogEvents
				   //  within the journal byte stream

    Header(const char *m="") :
      trimmed_pos(0), expire_pos(0), unused_field(0), write_pos(0), magic(m),
      stream_format(-1) {
    }

The actual content of the Header is as follows.
insert image description here
Verify that in the write_pos
test cluster, the largest obj starting with 0x200
insert image description here
is 200.000000da. The default obj size in this cluster is 4MB, and the write_pos is currently 917417563, 917417563/1024/1024/4=218.7293918132782 The last obj starts with 0xda=218.
Journaler In addition to the Header class, the class itself also has read and write bit shift control read and write as follows

 // writer
  uint64_t prezeroing_pos;
  uint64_t prezero_pos; ///< we zero journal space ahead of write_pos to
			//   avoid problems with tail probing
  uint64_t write_pos; ///< logical write position, where next entry
		      //   will go
  uint64_t flush_pos; ///< where we will flush. if
		      ///  write_pos>flush_pos, we're buffering writes.
  uint64_t safe_pos; ///< what has been committed safely to disk.

  uint64_t next_safe_pos; /// start postion of the first entry that isn't
			  /// being fully flushed. If we don't flush any
			  // partial entry, it's equal to flush_pos.

  bufferlist write_buf; ///< write buffer.  flush_pos +
			///  write_buf.length() == write_pos.
 // reader
  uint64_t read_pos;      // logical read position, where next entry starts.
  uint64_t requested_pos; // what we've requested from OSD.
  uint64_t received_pos;  // what we've received from OSD.
  // read buffer.  unused_field + read_buf.length() == prefetch_pos.
  bufferlist read_buf;

The main member functions are as follows

void _flush(C_OnFinisher *onsafe);
void _do_flush(unsigned amount=0);
void _finish_flush(int r, uint64_t start, ceph::real_time stamp);

To put it simply, the Journaler class is responsible for writing the serialized data in memory to the obj in the corresponding metadata pool, and provides an interface to read the obj, so what data is serialized?
Go back to the submit_thread thread in the Mdlog class, which calls journaler->flush() to flush the serialized data into obj.

 map<uint64_t,list<PendingEvent> >::iterator it = pending_events.begin();
    if (it == pending_events.end()) {
      submit_cond.Wait(submit_mutex);
      continue;
    }

From the above code, submit_thread will traverse pending_events, map<uint64_t, list > pending_events is a collection of submission events, which is assigned in the _submit_entry function

void MDLog::_submit_entry(LogEvent *le, MDSLogContextBase *c)
{
    pending_events[ls->seq].push_back(PendingEvent(le, c));
}

The _submit_entry function is called by submit_mdlog_entry in journal_and_reply, and the journal_and_reply function will be called after all metadata requests are processed.
It can be seen that when mds processes the client request, it is divided into LogEvent events, submitted to the pending_events collection through submit_mdlog_entry in the journal_and_reply function, and then refreshed to the metadata pool through the submit_thread thread. The following describes the LogEvent class
class LogEvent is the base class of all operation log events, and the specific member functions are implemented by integrating subclasses of this class, where EventType indicates the type of the subclass

class LogEvent {
public:
 typedef __u32 EventType;
}

std::string LogEvent::get_type_str() const
{
  switch(_type) {
  case EVENT_SUBTREEMAP: return "SUBTREEMAP";
  case EVENT_SUBTREEMAP_TEST: return "SUBTREEMAP_TEST";
  case EVENT_EXPORT: return "EXPORT";
  case EVENT_IMPORTSTART: return "IMPORTSTART";
  case EVENT_IMPORTFINISH: return "IMPORTFINISH";
  case EVENT_FRAGMENT: return "FRAGMENT";
  case EVENT_RESETJOURNAL: return "RESETJOURNAL";
  case EVENT_SESSION: return "SESSION";
  case EVENT_SESSIONS_OLD: return "SESSIONS_OLD";
  case EVENT_SESSIONS: return "SESSIONS";
  case EVENT_UPDATE: return "UPDATE";
  case EVENT_SLAVEUPDATE: return "SLAVEUPDATE";
  case EVENT_OPEN: return "OPEN";
  case EVENT_COMMITTED: return "COMMITTED";
  case EVENT_TABLECLIENT: return "TABLECLIENT";
  case EVENT_TABLESERVER: return "TABLESERVER";
  case EVENT_NOOP: return "NOOP";

  default:
    generic_dout(0) << "get_type_str: unknown type " << _type << dendl;
    return "UNKNOWN";
  }
}

Let's take EVENT_UPDATE as an example, this event represents the update metadata event, corresponding to class EUpdate

class EUpdate : public LogEvent {
public:
  EMetaBlob metablob;
  string type;
  bufferlist client_map;
  version_t cmapv;
  metareqid_t reqid;
  bool had_slaves;

  EUpdate() : LogEvent(EVENT_UPDATE), cmapv(0), had_slaves(false) { }
  EUpdate(MDLog *mdlog, const char *s) : 
    LogEvent(EVENT_UPDATE), metablob(mdlog),
    type(s), cmapv(0), had_slaves(false) { }
  
  void print(ostream& out) const override {
    if (type.length())
      out << "EUpdate " << type << " ";
    out << metablob;
  }

  EMetaBlob *get_metablob() override { return &metablob; }

  void encode(bufferlist& bl, uint64_t features) const override;
  void decode(bufferlist::iterator& bl) override;
  void dump(Formatter *f) const override;
  static void generate_test_instances(list<EUpdate*>& ls);

  void update_segment() override;
  void replay(MDSRank *mds) override;
  EMetaBlob const *get_metablob() const {return &metablob;}
};

EVENT_UPDATE events are generated whenever client operations involving metadata updates are involved

handle_client_openc   ---------> EUpdate *le = new EUpdate(mdlog, "openc")
handle_client_setattr ---------> EUpdate *le = new EUpdate(mdlog, "setattr")
handle_client_setlayout ---------> EUpdate *le = new EUpdate(mdlog, "setlayout")
.
.
.
handle_client_mknod   -------------> EUpdate *le = new EUpdate(mdlog, "mknod")

class EUpdate member variable EMetaBlob metablob, EMetaBlob is a serialized specific class, this class contains all the metadata information of the inode operation, including the statistics information of the parent directory of the inode, the metadata information of the inode, and the client information of the inode operation in the
LogEvent class encode_with_header, serialize EMetaBlob, EUpdate is a subclass of LogEvent, the specific serialization method is as follows

void encode_with_header(bufferlist& bl, uint64_t features) {
    ::encode(EVENT_NEW_ENCODING, bl);
    ENCODE_START(1, 1, bl)
    ::encode(_type, bl);
    encode(bl, features);
    ENCODE_FINISH(bl);
  }

EMetaBlob serialization

void EMetaBlob::encode(bufferlist& bl, uint64_t features) const
{
  ENCODE_START(8, 5, bl);
  ::encode(lump_order, bl);
  ::encode(lump_map, bl, features);
  ::encode(roots, bl, features);
  ::encode(table_tids, bl);
  ::encode(opened_ino, bl);
  ::encode(allocated_ino, bl);
  ::encode(used_preallocated_ino, bl);
  ::encode(preallocated_inos, bl);
  ::encode(client_name, bl);
  ::encode(inotablev, bl);
  ::encode(sessionmapv, bl);
  ::encode(truncate_start, bl);
  ::encode(truncate_finish, bl);
  ::encode(destroyed_inodes, bl);
  ::encode(client_reqs, bl);
  ::encode(renamed_dirino, bl);
  ::encode(renamed_dir_frags, bl);
  {
    // make MDSRank use v6 format happy
    int64_t i = -1;
    bool b = false;
    ::encode(i, bl);
    ::encode(b, bl);
  }
  ::encode(client_flushes, bl);
  ENCODE_FINISH(bl);
}

Authentication method:

touch /mnt/cephfs/yyn/testtouch
cephfs-journal-tool --rank=cephfs:0 event get --type=UPDATE binary --path /opt/test
ceph-dencoder import /opt/test/0x36aee5de_UPDATE.bin type EUpdate decode dump_json

insert image description here
insert image description here
Take mknod as an example

// prepare finisher
  mdr->ls = mdlog->get_current_segment();
  EUpdate *le = new EUpdate(mdlog, "mknod");
  mdlog->start_entry(le);
  le->metablob.add_client_req(req->get_reqid(), req->get_oldest_client_tid());
  journal_allocated_inos(mdr, &le->metablob);
  
  mdcache->predirty_journal_parents(mdr, &le->metablob, newi, dn->get_dir(),
				    PREDIRTY_PRIMARY|PREDIRTY_DIR, 1);
  le->metablob.add_primary_dentry(dn, newi, true, true, true);
  journal_and_reply(mdr, newi, dn, le, new C_MDS_mknod_finish(this, mdr, dn, newi));

In mknod, the required metablob information is filled by add_client_req, predirty_journal_parents, add_primary_dentry, and it is solidified into the metadata pool by journal_and_reply serialization.

3. The above two points have been analyzed that mds receives the request, serializes the metadata information, and brushes it into the obj of the journal corresponding to the metadata pool. How does the journal obj return the data to the metadata of each directory?
Each folder in the metadata pool will save an obj information corresponding to the folder with the inode number of the folder. The obj stores all the file information in the directory and the statistics of the directory itself.
insert image description here
The metablob in the journal is only temporarily saved, and it must be flushed back to the obj of the corresponding directory in the future.
In ceph-mds, the dirty inode information is saved through the class LogSegment class.

class LogSegment {
 public:
  const log_segment_seq_t seq;
  uint64_t offset, end;
  int num_events;

  // dirty items
  elist<CDir*>    dirty_dirfrags, new_dirfrags;
  elist<CInode*>  dirty_inodes;
  elist<CDentry*> dirty_dentries;

  elist<CInode*>  open_files;
  elist<CInode*>  dirty_parent_inodes;
  elist<CInode*>  dirty_dirfrag_dir;
  elist<CInode*>  dirty_dirfrag_nest;
  elist<CInode*>  dirty_dirfrag_dirfragtree;

  elist<MDSlaveUpdate*> slave_updates;
  
  set<CInode*> truncating_inodes;

  map<int, ceph::unordered_set<version_t> > pending_commit_tids;  // mdstable
  set<metareqid_t> uncommitted_masters;
  set<dirfrag_t> uncommitted_fragments;

  // client request ids
  map<int, ceph_tid_t> last_client_tids;

  // potentially dirty sessions
  std::set<entity_name_t> touched_sessions;

  // table version
  version_t inotablev;
  version_t sessionmapv;
  map<int,version_t> tablev;
  }

The LogSegment class saves a collection of metadata that has been modified during the public num_events event, such as:
elist<CDir*> dirty_dirfrags which directories are modified
elist<CInode*> dirty_inodes which inodes are modified
elist<CInode*> open_files which are opened inode

By default, a single LogSegment records the inode modification set of 1024 LogEvents, which is controlled by the mds_log_events_per_segment parameter. All LogSegments are stored in the map<uint64_t, LogSegment*> segments in the MDlog class.

Or take mknod as an example

ournal_and_reply(mdr, newi, dn, le, new C_MDS_mknod_finish(this, mdr, dn, newi));

When the metadata is written to the journal, C_MDS_mknod_finish is called back, and the function will push the modified inode information to LogSegment dirty_inodes and dirty_parent_inodes.

newi->mark_dirty(newi->inode.version + 1, mdr->ls);
newi->_mark_dirty_parent(mdr->ls, true);

void CInode::_mark_dirty(LogSegment *ls)
{
  if (!state_test(STATE_DIRTY)) {
    state_set(STATE_DIRTY);
    get(PIN_DIRTY);
    assert(ls);
  }
  
  // move myself to this segment's dirty list
  if (ls) 
    ls->dirty_inodes.push_back(&item_dirty);
}

When mdsrank.cc passes trim_mdlog, it traverses all segments in MDlog to perform try_expire, and flushes dirty data into obj corresponding to the directory through LogSegment.

map<uint64_t,LogSegment*>::iterator p = segments.begin();
  while (p != segments.end() &&
	 p->first < last_seq &&
	 p->second->end < safe_pos) { // next segment should have been started
    LogSegment *ls = p->second;
    ++p;
    try_expire(ls, CEPH_MSG_PRIO_DEFAULT);
    }
void LogSegment::try_to_expire(MDSRank *mds, MDSGatherBuilder &gather_bld, int op_prio)
{
{
  set<CDir*> commit;

  dout(6) << "LogSegment(" << seq << "/" << offset << ").try_to_expire" << dendl;

  assert(g_conf->mds_kill_journal_expire_at != 1);

  // commit dirs
  for (elist<CDir*>::iterator p = new_dirfrags.begin(); !p.end(); ++p) {
    dout(20) << " new_dirfrag " << **p << dendl;
    assert((*p)->is_auth());
    commit.insert(*p);
  }
  for (elist<CDir*>::iterator p = dirty_dirfrags.begin(); !p.end(); ++p) {
    dout(20) << " dirty_dirfrag " << **p << dendl;
    assert((*p)->is_auth());
    commit.insert(*p);
  }
  for (elist<CDentry*>::iterator p = dirty_dentries.begin(); !p.end(); ++p) {
    dout(20) << " dirty_dentry " << **p << dendl;
    assert((*p)->is_auth());
    commit.insert((*p)->get_dir());
  }
  for (elist<CInode*>::iterator p = dirty_inodes.begin(); !p.end(); ++p) {
    dout(20) << " dirty_inode " << **p << dendl;
    assert((*p)->is_auth());
    if ((*p)->is_base()) {
      (*p)->store(gather_bld.new_sub());
    } else
      commit.insert((*p)->get_parent_dn()->get_dir());
  }

  if (!commit.empty()) {
    for (set<CDir*>::iterator p = commit.begin();
	 p != commit.end();
	 ++p) {
      CDir *dir = *p;
      assert(dir->is_auth());
      if (dir->can_auth_pin()) {
	dout(15) << "try_to_expire committing " << *dir << dendl;
	dir->commit(0, gather_bld.new_sub(), false, op_prio);
      } else {
	dout(15) << "try_to_expire waiting for unfreeze on " << *dir << dendl;
	dir->add_waiter(CDir::WAIT_UNFREEZE, gather_bld.new_sub());
      }
    }
  }

  // master ops with possibly uncommitted slaves
  for (set<metareqid_t>::iterator p = uncommitted_masters.begin();
       p != uncommitted_masters.end();
       ++p) {
    dout(10) << "try_to_expire waiting for slaves to ack commit on " << *p << dendl;
    mds->mdcache->wait_for_uncommitted_master(*p, gather_bld.new_sub());
  }

  // uncommitted fragments
  for (set<dirfrag_t>::iterator p = uncommitted_fragments.begin();
       p != uncommitted_fragments.end();
       ++p) {
    dout(10) << "try_to_expire waiting for uncommitted fragment " << *p << dendl;
    mds->mdcache->wait_for_uncommitted_fragment(*p, gather_bld.new_sub());
  }

  // nudge scatterlocks
  for (elist<CInode*>::iterator p = dirty_dirfrag_dir.begin(); !p.end(); ++p) {
    CInode *in = *p;
    dout(10) << "try_to_expire waiting for dirlock flush on " << *in << dendl;
    mds->locker->scatter_nudge(&in->filelock, gather_bld.new_sub());
  }
  for (elist<CInode*>::iterator p = dirty_dirfrag_dirfragtree.begin(); !p.end(); ++p) {
    CInode *in = *p;
    dout(10) << "try_to_expire waiting for dirfragtreelock flush on " << *in << dendl;
    mds->locker->scatter_nudge(&in->dirfragtreelock, gather_bld.new_sub());
  }
  for (elist<CInode*>::iterator p = dirty_dirfrag_nest.begin(); !p.end(); ++p) {
    CInode *in = *p;
    dout(10) << "try_to_expire waiting for nest flush on " << *in << dendl;
    mds->locker->scatter_nudge(&in->nestlock, gather_bld.new_sub());
  }

  assert(g_conf->mds_kill_journal_expire_at != 2);

  // open files and snap inodes 
  if (!open_files.empty()) {
    assert(!mds->mdlog->is_capped()); // hmm FIXME
    EOpen *le = 0;
    LogSegment *ls = mds->mdlog->get_current_segment();
    assert(ls != this);
    elist<CInode*>::iterator p = open_files.begin(member_offset(CInode, item_open_file));
    while (!p.end()) {
      CInode *in = *p;
      ++p;
      if (in->last == CEPH_NOSNAP && in->is_auth() &&
	  !in->is_ambiguous_auth() && in->is_any_caps()) {
	if (in->is_any_caps_wanted()) {
	  dout(20) << "try_to_expire requeueing open file " << *in << dendl;
	  if (!le) {
	    le = new EOpen(mds->mdlog);
	    mds->mdlog->start_entry(le);
	  }
	  le->add_clean_inode(in);
	  ls->open_files.push_back(&in->item_open_file);
	} else {
	  // drop inodes that aren't wanted
	  dout(20) << "try_to_expire not requeueing and delisting unwanted file " << *in << dendl;
	  in->item_open_file.remove_myself();
	}
      } else if (in->last != CEPH_NOSNAP && !in->client_snap_caps.empty()) {
	// journal snap inodes that need flush. This simplify the mds failover hanlding
	dout(20) << "try_to_expire requeueing snap needflush inode " << *in << dendl;
	if (!le) {
	  le = new EOpen(mds->mdlog);
	  mds->mdlog->start_entry(le);
	}
	le->add_clean_inode(in);
	ls->open_files.push_back(&in->item_open_file);
      } else {
	/*
	 * we can get a capless inode here if we replay an open file, the client fails to
	 * reconnect it, but does REPLAY an open request (that adds it to the logseg).  AFAICS
	 * it's ok for the client to replay an open on a file it doesn't have in it's cache
	 * anymore.
	 *
	 * this makes the mds less sensitive to strict open_file consistency, although it does
	 * make it easier to miss subtle problems.
	 */
	dout(20) << "try_to_expire not requeueing and delisting capless file " << *in << dendl;
	in->item_open_file.remove_myself();
      }
    }
    if (le) {
      mds->mdlog->submit_entry(le);
      mds->mdlog->wait_for_safe(gather_bld.new_sub());
      dout(10) << "try_to_expire waiting for open files to rejournal" << dendl;
    }
  }

  assert(g_conf->mds_kill_journal_expire_at != 3);

  // backtraces to be stored/updated
  for (elist<CInode*>::iterator p = dirty_parent_inodes.begin(); !p.end(); ++p) {
    CInode *in = *p;
    assert(in->is_auth());
    if (in->can_auth_pin()) {
      dout(15) << "try_to_expire waiting for storing backtrace on " << *in << dendl;
      in->store_backtrace(gather_bld.new_sub(), op_prio);
    } else {
      dout(15) << "try_to_expire waiting for unfreeze on " << *in << dendl;
      in->add_waiter(CInode::WAIT_UNFREEZE, gather_bld.new_sub());
    }
  }

  assert(g_conf->mds_kill_journal_expire_at != 4);

  // slave updates
  for (elist<MDSlaveUpdate*>::iterator p = slave_updates.begin(member_offset(MDSlaveUpdate,
									     item));
       !p.end(); ++p) {
    MDSlaveUpdate *su = *p;
    dout(10) << "try_to_expire waiting on slave update " << su << dendl;
    assert(su->waiter == 0);
    su->waiter = gather_bld.new_sub();
  }

  // idalloc
  if (inotablev > mds->inotable->get_committed_version()) {
    dout(10) << "try_to_expire saving inotable table, need " << inotablev
	      << ", committed is " << mds->inotable->get_committed_version()
	      << " (" << mds->inotable->get_committing_version() << ")"
	      << dendl;
    mds->inotable->save(gather_bld.new_sub(), inotablev);
  }

  // sessionmap
  if (sessionmapv > mds->sessionmap.get_committed()) {
    dout(10) << "try_to_expire saving sessionmap, need " << sessionmapv 
	      << ", committed is " << mds->sessionmap.get_committed()
	      << " (" << mds->sessionmap.get_committing() << ")"
	      << dendl;
    mds->sessionmap.save(gather_bld.new_sub(), sessionmapv);
  }

  // updates to sessions for completed_requests
  mds->sessionmap.save_if_dirty(touched_sessions, &gather_bld);
  touched_sessions.clear();

  // pending commit atids
  for (map<int, ceph::unordered_set<version_t> >::iterator p = pending_commit_tids.begin();
       p != pending_commit_tids.end();
       ++p) {
    MDSTableClient *client = mds->get_table_client(p->first);
    assert(client);
    for (ceph::unordered_set<version_t>::iterator q = p->second.begin();
	 q != p->second.end();
	 ++q) {
      dout(10) << "try_to_expire " << get_mdstable_name(p->first) << " transaction " << *q 
	       << " pending commit (not yet acked), waiting" << dendl;
      assert(!client->has_committed(*q));
      client->wait_for_ack(*q, gather_bld.new_sub());
    }
  }
  
  // table servers
  for (map<int, version_t>::iterator p = tablev.begin();
       p != tablev.end();
       ++p) {
    MDSTableServer *server = mds->get_table_server(p->first);
    assert(server);
    if (p->second > server->get_committed_version()) {
      dout(10) << "try_to_expire waiting for " << get_mdstable_name(p->first) 
	       << " to save, need " << p->second << dendl;
      server->save(gather_bld.new_sub());
    }
  }

}

In the try_to_expire function,
the modified inode information is called through dir->commit(0, gather_bld.new_sub(), false, op_prio) to call the commit method in the CDIR class, and the updated information is written into the omap of the obj of the inode corresponding to the dir at the
same time . Also modify in->store_backtrace(gather_bld.new_sub(), op_prio) and mds->inotable->save(gather_bld.new_sub(), inotablev); corresponding to mds0_inotable obj
insert image description here
mds->sessionmap.save(gather_bld.new_sub in metadata (), sessionmapv);
insert image description here

Original: https://www.csdn.net/tags/NtzaggysOTI0NjMtYmxvZwO0O0OO0O0O.html 

Guess you like

Origin blog.csdn.net/bandaoyu/article/details/124213938