[ceph] CEPH single-active MDS master-slave switching process | REPLAY

Code based on ceph nautilus version

MDS key concepts

To understand the MDS switching process, you first need to clarify some basic concepts.

MDSMAP

  • Contains the status information of all mds of the entire ceph cluster: the number of fs, the name of the fs, the status of each mds, the data pool, the metadata pool information, etc.
  • Contains the current MDS map epoch, i.e. when the map was created, and when it was last changed. It also contains the pool used to store metadata, a list of metadata servers, and which metadata servers are up and in. To view the MDS map, execute ceph fs dump.

RANK

  • Rank defines the direct division of metadata load by multiple mds. Each mds can only hold at most one rank, each rank corresponds to a directory subtree, and the rank id starts from 0.
  • Ranks define how the metadata workload is shared among multiple Metadata Server (MDS) daemons. The number of ranks is the maximum number of MDS daemons that can be active at the same time. Each MDS daemon handles a subset of the Ceph File System metadata assigned to that rank.
  • Each MDS daemon is initially started without a rank. Monitor assigns a rank to the daemon. An MDS daemon can only have one rank at a time. Daemons only lose rank when they are stopped.

Pay attention to a special case here, that is, the mds in the stand-replay state also hold the rank, and the s and id are the same as the rank id of the active mds of its follow, as shown in the following figure

MDS JOURNAL

  • The journal of cephfs is a journal for recording metadata events
  • Journals are stored in the metadata pool of rados in the form of events
  • When processing each io behavior, write the journal first and then perform the actual io operation
  • Each active mds maintains its own journal
  • The journal is divided into multiple objects
  • mds will trim (trim) unwanted journal entries
  • journal event 查看:cephfs-journal-tool --rank=<fs>:<rank> event get list

( cephfs-journal-tool — Ceph Documentation   Chinese: https://www.bookstack.cn/read/ceph-10-zh/8a5dc6d87f084b2a.md )

Here's an example of a journal event (with a lot of content folded in between):

CAPS

That is, the distributed lock implemented by cephfs, see https://docs.ceph.com/docs/master/cephfs/client-auth/

MDS state machine

To understand the switching of mds, you first need to know which states of clearing mds are and which state transitions can be made. The knowledge about the mds state machine is described in detail on the official website: https://docs.ceph.com/docs/master/cephfs/mds-states/

MDS class diagram

Refer to the following figure: Note that only part of the mds architecture and core classes are listed in the figure. The actual composition is more complex and involves more classes and logic.

Cold Standby OR Hot Standby

  • Cold standby : In the default configuration, all MDSs except the active mds are in the standby state. Except for keeping the heartbeat with the mon, nothing else is done, the cache is empty, and there is no rank.

  • Hot standby : configure allow_standby_replay to true, each active mds will have a dedicated standby-replay mds to follow, hold the same rank id as the active mds, and constantly read journals from rados and load them into the cache to be as close as possible to actvie mds keep in sync.

Obviously, in the hot standby state, there will be a standby-replay mds that updates the cache all the time, so that the switching process will be faster when the switching occurs.

Switching Process Analysis

Before analyzing the switching process, first clarify the two core ideas of MDS switching :

  1. The election of the new active mds is determined by the mon, and the entire switching process involves multiple interactions between the mon and the mds
  2. All switching processes are driven by mdsmap. The mdsmap marks the planning for the status of each mds in the current cluster. After each stage of planning is processed, mds will actively request the next stage from mon.

The message distribution processing of mds has been introduced above, and in the switching process, the interaction between mon and mds is mdsmap. Let's take a look at the processing flow of mdsmap:

MDSDaemon::handle_mds_map:

MDSRankDispatcher::handle_mds_map:
Too much logic, only part of the code is intercepted

void MDSRankDispatcher::handle_mds_map(
    const MMDSMap::const_ref &m,
    const MDSMap &oldmap)
{
  // I am only to be passed MDSMaps in which I hold a rank
  ceph_assert(whoami != MDS_RANK_NONE);

  // 当前状态为oldstate,从mds map中获取新的状态为state,
  // 如果两者不相等,则更新last_state和incarnation,incarnation表示rank当前在哪个dameon?
  MDSMap::DaemonState oldstate = state;
  mds_gid_t mds_gid = mds_gid_t(monc->get_global_id());
  state = mdsmap->get_state_gid(mds_gid);
  if (state != oldstate) {
    last_state = oldstate;
    incarnation = mdsmap->get_inc_gid(mds_gid);
  }

  version_t epoch = m->get_epoch();

  // note source's map version
  // 当前mds集群状态已经准备变更,进入了新的epoch,那么需要更新其他mds的epoch值
  if (m->get_source().is_mds() &&
      peer_mdsmap_epoch[mds_rank_t(m->get_source().num())] < epoch) {
    dout(15) << " peer " << m->get_source()
	     << " has mdsmap epoch >= " << epoch
	     << dendl;
    peer_mdsmap_epoch[mds_rank_t(m->get_source().num())] = epoch;
  }

  // Validate state transitions while I hold a rank
  // 根据新旧状态进行校验,如果是invalid的状态跃迁则重启,哪些状态跃迁是合法的:
  // 参考https://docs.ceph.com/docs/master/cephfs/mds-states/#mds-states中的图
  if (!MDSMap::state_transition_valid(oldstate, state)) {
    derr << "Invalid state transition " << ceph_mds_state_name(oldstate)
      << "->" << ceph_mds_state_name(state) << dendl;
    respawn();
  }

  // mdsmap and oldmap can be discontinuous. failover might happen in the missing mdsmap.
  // the 'restart' set tracks ranks that have restarted since the old mdsmap
  set<mds_rank_t> restart;
  // replaying mds does not communicate with other ranks
  // 如果新的状态>=resolve,则进行一堆逻辑处理,resolve只会在多active mds中存在,目前不关注
  if (state >= MDSMap::STATE_RESOLVE) {
    // did someone fail?
    //   new down?
    set<mds_rank_t> olddown, down;
    oldmap.get_down_mds_set(&olddown);
    mdsmap->get_down_mds_set(&down);
    for (const auto& r : down) {
      if (oldmap.have_inst(r) && olddown.count(r) == 0) {
	messenger->mark_down_addrs(oldmap.get_addrs(r));
	handle_mds_failure(r);
      }
    }

  // did it change?
  if (oldstate != state) {
    dout(1) << "handle_mds_map state change "
	    << ceph_mds_state_name(oldstate) << " --> "
	    << ceph_mds_state_name(state) << dendl;
    beacon.set_want_state(*mdsmap, state);

    // 如果当前是standby-replay状态,则无需走下面的大串分支,直接走到最后
    if (oldstate == MDSMap::STATE_STANDBY_REPLAY) {
        dout(10) << "Monitor activated us! Deactivating replay loop" << dendl;
        assert (state == MDSMap::STATE_REPLAY);
    } else {
      // did i just recover?
      if ((is_active() || is_clientreplay()) &&
          (oldstate == MDSMap::STATE_CREATING ||
	   oldstate == MDSMap::STATE_REJOIN ||
	   oldstate == MDSMap::STATE_RECONNECT))
        recovery_done(oldstate);

      // 根据新的mdsmap中的状态来决定接下来的过程
      if (is_active()) {
        active_start();
      } else if (is_any_replay()) {
        // standby状态下的mds收到将其标记为stand-replay的mdsmap后也会走此分支
        replay_start();
      } else if (is_resolve()) {
        resolve_start();
      } else if (is_reconnect()) {
        reconnect_start();
      } else if (is_rejoin()) {
	rejoin_start();
      } else if (is_clientreplay()) {
        clientreplay_start();
      } else if (is_creating()) {
        boot_create();
      } else if (is_starting()) {
        boot_start();
      } else if (is_stopping()) {
        ceph_assert(oldstate == MDSMap::STATE_ACTIVE);
        stopping_start();
      }
    }
  }

  // RESOLVE
  // is someone else newly resolving?
  if (state >= MDSMap::STATE_RESOLVE) {
    // recover snaptable
    if (mdsmap->get_tableserver() == whoami) {

    }

    if ((!oldmap.is_resolving() || !restart.empty()) && mdsmap->is_resolving()) {
      set<mds_rank_t> resolve;
      mdsmap->get_mds_set(resolve, MDSMap::STATE_RESOLVE);
      dout(10) << " resolve set is " << resolve << dendl;
      calc_recovery_set();
      mdcache->send_resolves();
    }
  }

  // REJOIN
  // is everybody finally rejoining?
  if (state >= MDSMap::STATE_REJOIN) {
    // did we start?
    if (!oldmap.is_rejoining() && mdsmap->is_rejoining())
      rejoin_joint_start();

    // did we finish?
    if (g_conf()->mds_dump_cache_after_rejoin &&
	oldmap.is_rejoining() && !mdsmap->is_rejoining())
      mdcache->dump_cache();      // for DEBUG only

    if (oldstate >= MDSMap::STATE_REJOIN ||
	oldstate == MDSMap::STATE_STARTING) {
      // ACTIVE|CLIENTREPLAY|REJOIN => we can discover from them.

  }

  if (oldmap.is_degraded() && !cluster_degraded && state >= MDSMap::STATE_ACTIVE) {
    dout(1) << "cluster recovered." << dendl;
    auto it = waiting_for_active_peer.find(MDS_RANK_NONE);
    if (it != waiting_for_active_peer.end()) {
      queue_waiters(it->second);
      waiting_for_active_peer.erase(it);
    }
  }

  // did someone go active?
  if (state >= MDSMap::STATE_CLIENTREPLAY &&
      oldstate >= MDSMap::STATE_CLIENTREPLAY) {
    set<mds_rank_t> oldactive, active;
    oldmap.get_mds_set_lower_bound(oldactive, MDSMap::STATE_CLIENTREPLAY);
    mdsmap->get_mds_set_lower_bound(active, MDSMap::STATE_CLIENTREPLAY);
    for (const auto& r : active) {
      if (r == whoami)
	continue; // not me
      if (!oldactive.count(r) || restart.count(r))  // newly so?
	handle_mds_recovery(r);
    }
  }

  if (is_clientreplay() || is_active() || is_stopping()) {
    // did anyone stop?
    set<mds_rank_t> oldstopped, stopped;
    oldmap.get_stopped_mds_set(oldstopped);
    mdsmap->get_stopped_mds_set(stopped);
    for (const auto& r : stopped)
      if (oldstopped.count(r) == 0) {     // newly so?
	mdcache->migrator->handle_mds_failure_or_stop(r);
	if (mdsmap->get_tableserver() == whoami)
	  snapserver->handle_mds_failure_or_stop(r);
      }
  }

  // 唤醒所有waiting_for_mdsmap中的线程,并将其从中移出
  {
    map<epoch_t,MDSContext::vec >::iterator p = waiting_for_mdsmap.begin();
    while (p != waiting_for_mdsmap.end() && p->first <= mdsmap->get_epoch()) {
      MDSContext::vec ls;
      ls.swap(p->second);
      waiting_for_mdsmap.erase(p++);
      // 唤醒ls
      queue_waiters(ls);
    }
  }

  if (is_active()) {
    // Before going active, set OSD epoch barrier to latest (so that
    // we don't risk handing out caps to clients with old OSD maps that
    // might not include barriers from the previous incarnation of this MDS)
    set_osd_epoch_barrier(objecter->with_osdmap(
			    std::mem_fn(&OSDMap::get_epoch)));

    /* Now check if we should hint to the OSD that a read may follow */
    if (mdsmap->has_standby_replay(whoami))
      mdlog->set_write_iohint(0);
    else
      mdlog->set_write_iohint(CEPH_OSD_OP_FLAG_FADVISE_DONTNEED);
  }

  if (oldmap.get_max_mds() != mdsmap->get_max_mds()) {
    purge_queue.update_op_limit(*mdsmap);
  }

  mdcache->handle_mdsmap(*mdsmap);
}

Cold standby master-slave switchover

Hot standby master-slave switchover

It can be seen that the difference between the switching process of cold standby and hot standby is mainly reflected in the two stages before switching and replay, and the rest of the process is basically the same.

The specific switching process

UP:REPLAY

REPLAY START

The replay process is triggered by MDSRank::replay_start(), which triggers the boot start process and obtains a new osdmap

void MDSRank::replay_start()
{
  dout(1) << "replay_start" << dendl;

  if (is_standby_replay())
    standby_replaying = true;

  // 解释见上方
  calc_recovery_set();

  // Check if we need to wait for a newer OSD map before starting
  // 触发从第一阶段开始的boot start(boot start共分4个阶段,每个阶段完成后会自动调用下一阶段)
  Context *fin = new C_IO_Wrapper(this, new C_MDS_BootStart(this, MDS_BOOT_INITIAL));
  // 根据最后一次失败的osdmap的epoch获取新的osdmap
  bool const ready = objecter->wait_for_map(
      mdsmap->get_last_failure_osd_epoch(),
      fin);

  // 获取到了osdmap之后则已经ready去replay了,调用boot_start进行replay
  if (ready) {
    delete fin;
    boot_start();
  } else {
    dout(1) << " waiting for osdmap " << mdsmap->get_last_failure_osd_epoch()
	    << " (which blacklists prior instance)" << dendl;
  }
}

BOOT START

  • Happens before standby mds does actual replay
  • Read inode table, session map, purge queue, openfile table, snap table from journal and load them into cache, create recovery thread, submit thread
  • Two inodes of 0x01 and 0x100+rank id are newly created in the cache, of which 0x01 is the root directory inode
  • Call mdlog for replay, which will start a replay thread to complete the actual replay step

ACTUAL REPLAY

The logic of the replay thread:

while(1)
{
	1、读取一条journal记录,如果满足条件则flush
	2、解码成logEvent格式
	3、replay:根据journal信息在内存中重建CInode,CDir,CDentry等信息,并根据journal内容对dentry进行各种设置
}

UP:RECONNECT

BEFORE CLIENT RECONNECT

Get the blacklist from osdmap
and notify non-blacklisted clients in some way to initiate a reconnection (not concerned)

MDSRank::reconnect_start():

void MDSRank::reconnect_start()
{
  dout(1) << "reconnect_start" << dendl;

  if (last_state == MDSMap::STATE_REPLAY) {
    reopen_log();
  }

  // Drop any blacklisted clients from the SessionMap before going
  // into reconnect, so that we don't wait for them.
  // 通过osdmap获取blacklist(命令行下可通过ceph osd blacklist ls查看),并与
  // sessionmap进行对比,如果sessionmap中存在blacklist中的client,则kill掉这些session,并且不对其进行reconnect
  objecter->enable_blacklist_events();
  std::set<entity_addr_t> blacklist;
  epoch_t epoch = 0;
  objecter->with_osdmap([&blacklist, &epoch](const OSDMap& o) {
      o.get_blacklist(&blacklist);
      epoch = o.get_epoch();
  });
  auto killed = server->apply_blacklist(blacklist);
  dout(4) << "reconnect_start: killed " << killed << " blacklisted sessions ("
          << blacklist.size() << " blacklist entries, "
          << sessionmap.get_sessions().size() << ")" << dendl;
  if (killed) {
    set_osd_epoch_barrier(epoch);
  }

  // 对其他的sessionmap中的合法的client进行reconnect,最终是由client发起reconnect
  server->reconnect_clients(new C_MDS_VoidFn(this, &MDSRank::reconnect_done));
  finish_contexts(g_ceph_context, waiting_for_reconnect);
}

HANDLE CLIENT REPLAY REQ

Before the switch, there may be clients with unfinished metadata requests. After the switch, these clients will resend the replay request or retry request (inaccurate) to the new mds, and the new mds will record the information of these clients (need clientreplay)

void Server::dispatch(const Message::const_ref &m)
{
  ......

      // 满足条件的client加入到replay_queue中,replay_queue不为空则需要经历client_replay阶段
      if (queue_replay) {
    req->mark_queued_for_replay();
    mds->enqueue_replay(new C_MDS_RetryMessage(mds, m));
    return;
      }
    }

    ......
}

HANDLE CLIENT RECONNECT

Process the client reconnection request, re-establish the session, and traverse the client's caps:
1) If the inode corresponding to the client caps is in the cache, then rebuild the caps directly in
the cache. 2) If the inode corresponding to the client caps is not in the cache, record it first.

Server::handle_client_reconnect part of the code:

UP:REJOIN

  • Open all inodes in the openfile table and record them in the opening inode map in the cache
    (the former is used to speed up the switching process, and the latter is the truly maintained opening inode)

  • Process the caps recorded in the reconnect phase, and rebuild the caps for these clients in the cache according to the caps, session and other information recorded in the reconnect phase

  • Traverse the inode in the cache and all its corresponding writable clients (the inode cache maintains which clients are writable and the writable range of each inode), if a client is writable but has no caps, record it

bool MDCache::process_imported_caps()
{
  // 按梯度依次打开inode,通过mdcache->open_ino
  /* 共有4种state:
    enum {
      DIR_INODES = 1,
      DIRFRAGS = 2,
      FILE_INODES = 3,
      DONE = 4,
    };
  根据此pr介绍:https://github.com/ceph/ceph/pull/20132
      For inodes that need to open for reconnected/imported caps. First open
      directory inodes that are in open file table. then open regular inodes
      that are in open file table. finally open the rest ones.
  */ 
  if (!open_file_table.is_prefetched() &&
      open_file_table.prefetch_inodes()) {
    open_file_table.wait_for_prefetch(
    new MDSInternalContextWrapper(mds,
      new FunctionContext([this](int r) {
        ceph_assert(rejoin_gather.count(mds->get_nodeid()));
        process_imported_caps();
        })
      )
    );
    return true;
  }

  // reconnect阶段处理client reconnect请求的时候,那些有caps但是不在cache中的inode会加到cap_imports中
  for (auto p = cap_imports.begin(); p != cap_imports.end(); ++p) {
    CInode *in = get_inode(p->first);
    // 如果在caps_imports中的inode则从cap_imports_missing中去除
    if (in) {
      ceph_assert(in->is_auth());
      cap_imports_missing.erase(p->first);
      continue;
    }
    if (cap_imports_missing.count(p->first) > 0)
      continue;

    cap_imports_num_opening++;
    // 对所有在cap_imports但不在inode_map和cap_imports_missing中的inode执行open_ino操作
    dout(10) << "  opening missing ino " << p->first << dendl;
    open_ino(p->first, (int64_t)-1, new C_MDC_RejoinOpenInoFinish(this, p->first), false);
    if (!(cap_imports_num_opening % 1000))
      mds->heartbeat_reset();
  }

  if (cap_imports_num_opening > 0)
    return true;

  // called by rejoin_gather_finish() ?
  // 初次进入rejoin阶段时在rejoin_start函数中将本节点加入rejoin_gather中,所以rejoin时
  // 在send rejoin之前是不会走此分支的
  if (rejoin_gather.count(mds->get_nodeid()) == 0) {
    // rejoin_client_map在处理client reconnect时填充,rejoin_session_map在一开始是为空的
    if (!rejoin_client_map.empty() &&
    rejoin_session_map.empty()) {
      // https://github.com/ceph/ceph/commit/e5457dfbe21c79c1aeddcae8d8d013898343bb93
      // 为rejoin imported caps打开session
      C_MDC_RejoinSessionsOpened *finish = new C_MDC_RejoinSessionsOpened(this);
      // prepare_force_open_sessions中会根据rejoin_client_map来填充finish->session_map
      version_t pv = mds->server->prepare_force_open_sessions(rejoin_client_map,
                                  rejoin_client_metadata_map,
                                  finish->session_map);
      ESessions *le = new ESessions(pv, std::move(rejoin_client_map),
                    std::move(rejoin_client_metadata_map));
      mds->mdlog->start_submit_entry(le, finish);
      mds->mdlog->flush();
      rejoin_client_map.clear();
      rejoin_client_metadata_map.clear();
      return true;
    }

    // process caps that were exported by slave rename
    // 多mds相关的,不考虑
    for (map<inodeno_t,pair<mds_rank_t,map<client_t,Capability::Export> > >::iterator p = rejoin_slave_exports.begin();
     p != rejoin_slave_exports.end();
     ++p) {
      ......
    }
    rejoin_slave_exports.clear();
    rejoin_imported_caps.clear();

    // process cap imports
    //  ino -> client -> frommds -> capex
    // 遍历cap_imports中的且已经存在于cache中的inode
    for (auto p = cap_imports.begin(); p != cap_imports.end(); ) {
      CInode *in = get_inode(p->first);
      if (!in) {
    dout(10) << " still missing ino " << p->first
             << ", will try again after replayed client requests" << dendl;
    ++p;
    continue;
      }
      ceph_assert(in->is_auth());
      for (auto q = p->second.begin(); q != p->second.end(); ++q) {
    Session *session;
    {
    // 寻找该inode也有对应的session
      auto r = rejoin_session_map.find(q->first);
      session = (r != rejoin_session_map.end() ? r->second.first : nullptr);
    }
    for (auto r = q->second.begin(); r != q->second.end(); ++r) {
      if (!session) {
        if (r->first >= 0)
          (void)rejoin_imported_caps[r->first][p->first][q->first]; // all are zero
        continue;
      }
    //添加caps并设置,一份添加到CInode::client_caps,一份添加到MDCache::reconnected_caps
      Capability *cap = in->reconnect_cap(q->first, r->second, session);
      add_reconnected_cap(q->first, in->ino(), r->second);
    // client id>=0,即合法client id,client_t默认构造为-2
      if (r->first >= 0) {
        if (cap->get_last_seq() == 0) // don't increase mseq if cap already exists
          cap->inc_mseq();
      // 构建caps
        do_cap_import(session, in, cap, r->second.capinfo.cap_id, 0, 0, r->first, 0);

      // 并将建立的caps存在rejoin_imported_caps中
        Capability::Import& im = rejoin_imported_caps[r->first][p->first][q->first];
        im.cap_id = cap->get_cap_id();
        im.issue_seq = cap->get_last_seq();
        im.mseq = cap->get_mseq();
      }
    }
      }
      cap_imports.erase(p++);  // remove and move on
    }
  } 
  else {
    trim_non_auth();

    ceph_assert(rejoin_gather.count(mds->get_nodeid()));
    rejoin_gather.erase(mds->get_nodeid());
    ceph_assert(!rejoin_ack_gather.count(mds->get_nodeid()));
    // 如果rejoin被pending了,则重新发起rejoin
    maybe_send_pending_rejoins();
  }
  return false;
}
void MDCache::rejoin_send_rejoins()
{
  map<mds_rank_t, MMDSCacheRejoin::ref> rejoins;


  // if i am rejoining, send a rejoin to everyone.
  // otherwise, just send to others who are rejoining.
  for (set<mds_rank_t>::iterator p = recovery_set.begin();
       p != recovery_set.end();
       ++p) {
    if (*p == mds->get_nodeid())  continue;  // nothing to myself!
    if (rejoin_sent.count(*p)) continue;     // already sent a rejoin to this node!
    if (mds->is_rejoin())
      // 正常走这里,rejoins表里记录recovery_set中每个mds的rank编号和MMDSCacheRejoin
      rejoins[*p] = MMDSCacheRejoin::create(MMDSCacheRejoin::OP_WEAK);
    else if (mds->mdsmap->is_rejoin(*p))
      rejoins[*p] = MMDSCacheRejoin::create(MMDSCacheRejoin::OP_STRONG);
  }

  // 根据cap_exports来填充rejoins,单活mds不涉及cap_exports
  if (mds->is_rejoin()) {
    ......
  }
  
  
  // check all subtrees
  // 重建子树,单活mds无子树划分
  for (map<CDir*, set<CDir*> >::iterator p = subtrees.begin();
       p != subtrees.end();
       ++p) {
    ......
  }
  
  // rejoin root inodes, too
  for (auto &p : rejoins) {
    if (mds->is_rejoin()) {
      // weak
      ......
  }  

  if (!mds->is_rejoin()) {
    // i am survivor.  send strong rejoin.
    // note request remote_auth_pins, xlocks

  }

UP:CLIENTREPLAY

Recover those file inodes last recorded by rejoin

Clientreplay is mainly to reply to the original mds of some clients' requests before the switch, but has not yet saved the journal (which requests will be? Guess it is written by the client cache?). These requests are re-initiated in the reconneect phase, which is the handle client replay req mentioned earlier. The client request information recorded in this stage is in the replay_queue, and these requests are replayed in the clientreplay stage, and the involved inodes are recovered.

Summarize

  • The master-slave switching of mds is scheduled by mon and driven by mdsmap, which involves coordination of multiple components such as mon, osd, mds, and cephfs client.
  • mds hot standby (standby-replay) will synchronize the cache as much as possible, which can speed up the switching process, and it seems that there are no disadvantages at present
  • switch process
    • up:replay: Read journals in rados that are newer than your own cache, decode and play back these journals, and improve the cache
    • up:reconnect: Notify all non-blacklisted clients to reconnect, and the client side initiates a reconnection request, carrying its own caps, openfile, path and many other information. After mds processes these requests, it rebuilds the session with the legitimate client, and rebuilds the caps for the inode in the cache, otherwise it records
    • up:rejoin: Re-open the open file and record it in the cache, and process the caps recorded in the reconnect phase
    • up:clientreplay (non-essential state): replay and restore those mds that have been replied, but the inode involved in the journal request has not yet been saved

Copyright statement: This article is an original article by jiang4357291 and follows the  CC 4.0 BY-SA  copyright agreement. Please attach the original source link and this statement for reprinting.

Link to this article: ceph single-active mds master-slave switching process_jiang4357291's blog - CSDN blog

Guess you like

Origin blog.csdn.net/bandaoyu/article/details/124221373