ceph rbd: asynchronous operation flow of librbd

overall

In the code in librbd, almost all operations are asynchronous. The following takes a piece of code as an example to analyze its operation flow.

The following code is the step of creating id obj in the rbd image creation process. The final effect is to create an rbd_id.<image name>object named named in the pool corresponding to rbd, and the content of the object is the id of rbd.

  // 创建一个op对象,后续的操作会先存入该对象,不会被执行
  // 所有存入的操作会按顺序聚合成一个原子batch
  // 直到aio_operate调用时才开始异步执行
  // 所有操作执行完成后,会调用AioCompletion上注册的回调W函数
  librados::ObjectWriteOperation op;
  // 在op内存入一个create操作
  op.create(true);
  // 在op内存入一个set_id操作(细节与op.create不同,本质上一样)
  cls_client::set_id(&op, m_image_id);
  // handle_create_id_object为op上所有操作都完成后执行的回调函数
  using klass = CreateRequest<I>;
  librados::AioCompletion *comp =
    create_rados_callback<klass, &klass::handle_create_id_object>(this);
  // 开始
  int r = m_ioctx.aio_operate(m_id_obj, comp, &op);
  assert(r == 0);
  comp->release();

The content of the above code mainly includes the following points:

  1. The role of ObjectXXXXOperation object
  2. ceph cls_client call registration function
  3. What is AioCompletion
  4. what is done behind aio_operate

ObjectXXXXOperation

librados :: ObjectWriteOperation
From the comments of ObjectWriteOperation, you can see some content, the object aggregates multiple write operations into one request. It does not have much content in itself, but only contains interfaces for various operation functions.

  /*
   * ObjectWriteOperation : compound object write operation
   * Batch multiple object operations into a single request, to be applied
   * atomically.
   */
  class CEPH_RADOS_API ObjectWriteOperation : public ObjectOperation
  {
  protected:
    time_t *unused;
  public:
    ObjectWriteOperation() : unused(NULL) {}
    ~ObjectWriteOperation() override {}
    // 开头代码片中调用的create函数
    void create(bool exclusive)
    {
      ::ObjectOperation *o = &impl->o;
      o->create(exclusive);
    }
    ...
    // 成员函数,太多,略去
  };

librados :: ObjectOperation
all similar ObjectWriteOperation classes inherit from librados :: ObjectOperation. Its specific implementation is in ObjectOperationImpl.

class CEPH_RADOS_API ObjectOperation
  {
  ...
  protected:
    ObjectOperationImpl *impl;
  };

librados :: ObjectOperationImpl
still does not have much content, the key content should be in :: ObjectOperation. Keep watching.

struct ObjectOperationImpl {
  ::ObjectOperation o;
  real_time rt;
  real_time *prt;

  ObjectOperationImpl() : prt(NULL) {}
};

:: ObjectOperation
is the core of the mechanism. See notes for details.

struct ObjectOperation {
  vector<OSDOp> ops; // 操作数组
  int flags; // flags
  int priority; // 优先级

  vector<bufferlist*> out_bl; // 用于存放ops每个操作的输出内容
  vector<Context*> out_handler; // 存放ops每个操作完成后执行的回调函数
  vector<int*> out_rval; // 存放ops每个操作的返回值
  
  // 为了便于理解上述参数内容。给出了两个函数
  OSDOp& add_op(int op) {
    int s = ops.size();
    ops.resize(s+1);
    ops[s].op.op = op;
    out_bl.resize(s+1);
    out_bl[s] = NULL;
    out_handler.resize(s+1);
    out_handler[s] = NULL;
    out_rval.resize(s+1);
    out_rval[s] = NULL;
    return ops[s];
  }
  // 开头代码片中create函数最终做的事
  // 仅仅是在ops数组中增加了一项,并没有执行操作
  void create(bool excl) {
    OSDOp& o = add_op(CEPH_OSD_OP_CREATE);
    o.op.flags = (excl ? CEPH_OSD_OP_FLAG_EXCL : 0);
  }
  // 省略其他成员函数
  ...
};

cls_client

The registration and calling mechanism of the cls backend will not be concerned first (it is worth analyzing separately). It depends on how cls_client calls the registered function.

Take the set_idexample above .

cls_client :: set_id
can be seen, here is to encode the input into the bufflist, and then transpose the exec function of ObjectWriteOperation.

void set_id(librados::ObjectWriteOperation *op, const std::string id)
{
  bufferlist bl;
  encode(id, bl);
  op->exec("rbd", "set_id", bl);
}

librados :: ObjectOperation :: exec
can see that the cls_clientfollowing functions will not be executed directly, but also through the mechanism of ObjectOperation function.

void librados::ObjectOperation::exec(const char *cls, const char *method, bufferlist& inbl)
{
  ::ObjectOperation *o = &impl->o;
  o->call(cls, method, inbl);
}

::ObjectOperation::call

  void call(const char *cname, const char *method, bufferlist &indata) {
    add_call(CEPH_OSD_OP_CALL, cname, method, indata, NULL, NULL, NULL);
  }

:: ObjectOperation :: add_call
The create function called directly on the op before add_opadding an independent CEPH_OSD_OP_CREATEoperation. The functions related to cls are add_calladded and shared CEPH_OSD_OP_CALL. Different functions are distinguished by adding classname and methodname in indata.

  void add_call(int op, const char *cname, const char *method,
        bufferlist &indata,
        bufferlist *outbl, Context *ctx, int *prval) {
    OSDOp& osd_op = add_op(op);

    unsigned p = ops.size() - 1;
    out_handler[p] = ctx;
    out_bl[p] = outbl;
    out_rval[p] = prval;

    osd_op.op.cls.class_len = strlen(cname);
    osd_op.op.cls.method_len = strlen(method);
    osd_op.op.cls.indata_len = indata.length();
    osd_op.indata.append(cname, osd_op.op.cls.class_len);
    osd_op.indata.append(method, osd_op.op.cls.method_len);
    osd_op.indata.append(indata);
  }

AioCompletion

AioCompletion is mainly used to encapsulate and call the callback function after the completion of op execution.

Its creation process.
create_rados_callback

template <typename T, void(T::*MF)(int)>
librados::AioCompletion *create_rados_callback(T *obj) {
  return librados::Rados::aio_create_completion(
    obj, &detail::rados_callback<T, MF>, nullptr);
}

aio_create_completion

librados::AioCompletion *librados::Rados::aio_create_completion(void *cb_arg,
                                callback_t cb_complete,
                                callback_t cb_safe)
{
  AioCompletionImpl *c;
  int r = rados_aio_create_completion(cb_arg, cb_complete, cb_safe, (void**)&c);
  assert(r == 0);
  return new AioCompletion(c);
}

rados_aio_create_completion

extern "C" int rados_aio_create_completion(void *cb_arg,
                       rados_callback_t cb_complete,
                       rados_callback_t cb_safe,
                       rados_completion_t *pc)
{
  tracepoint(librados, rados_aio_create_completion_enter, cb_arg, cb_complete, cb_safe);
  librados::AioCompletionImpl *c = new librados::AioCompletionImpl;
  if (cb_complete)
    c->set_complete_callback(cb_arg, cb_complete);
  if (cb_safe)
    c->set_safe_callback(cb_arg, cb_safe);
  *pc = c;
  tracepoint(librados, rados_aio_create_completion_exit, 0, *pc);
  return 0;
}

AioCompletion

  struct CEPH_RADOS_API AioCompletion {
    AioCompletion(AioCompletionImpl *pc_) : pc(pc_) {}
    int set_complete_callback(void *cb_arg, callback_t cb);
    int set_safe_callback(void *cb_arg, callback_t cb);
    int wait_for_complete();
    int wait_for_safe();
    int wait_for_complete_and_cb();
    int wait_for_safe_and_cb();
    bool is_complete();
    bool is_safe();
    bool is_complete_and_cb();
    bool is_safe_and_cb();
    int get_return_value();
    int get_version() __attribute__ ((deprecated));
    uint64_t get_version64();
    void release();
    AioCompletionImpl *pc;
  };

AioCompletionImpl

struct librados::AioCompletionImpl {
  Mutex lock;
  Cond cond;
  int ref, rval;
  bool released;
  bool complete;
  version_t objver;
  ceph_tid_t tid;

  rados_callback_t callback_complete, callback_safe;
  void *callback_complete_arg, *callback_safe_arg;

  // for read
  bool is_read;
  bufferlist bl;
  bufferlist *blp;
  char *out_buf;

  IoCtxImpl *io;
  ceph_tid_t aio_write_seq;
  xlist<AioCompletionImpl*>::item aio_write_list_item; //默认值为this

  int set_complete_callback(void *cb_arg, rados_callback_t cb);
  int set_safe_callback(void *cb_arg, rados_callback_t cb);
  int wait_for_complete();
  int wait_for_safe();
  int is_complete();
  int is_safe();
  int wait_for_complete_and_cb();
  int wait_for_safe_and_cb();
  int is_complete_and_cb();
  int is_safe_and_cb();
  int get_return_value();
  uint64_t get_version();
  void get();
  void _get();
  void release();
  void put();
  void put_unlock();
};

aio_operate

librados :: IoCtx :: aio_operate
transposes the io_ctx_implcorresponding function.

int librados::IoCtx::aio_operate(const std::string& oid, AioCompletion *c,
                 librados::ObjectWriteOperation *o)
{
  object_t obj(oid);
  return io_ctx_impl->aio_operate(obj, &o->impl->o, c->pc,
                  io_ctx_impl->snapc, 0);
}

delivered :: IoCtxImpl :: aio_operate

int librados::IoCtxImpl::aio_operate(const object_t& oid,
                     ::ObjectOperation *o, AioCompletionImpl *c,
                     const SnapContext& snap_context, int flags,
                                     const blkin_trace_info *trace_info)
{
  FUNCTRACE(client->cct);
  OID_EVENT_TRACE(oid.name.c_str(), "RADOS_WRITE_OP_BEGIN");
  auto ut = ceph::real_clock::now();
  /* can't write to a snapshot */
  if (snap_seq != CEPH_NOSNAP)
    return -EROFS;
  // 将AioCompletion转换成Centext对象
  // Context对象是ceph中标准的回调函数格式对象
  Context *oncomplete = new C_aio_Complete(c);
#if defined(WITH_LTTNG) && defined(WITH_EVENTTRACE)
  ((C_aio_Complete *) oncomplete)->oid = oid;
#endif

  c->io = this;
  // 在IoCtxImpl中有一个队列,用于存储所有的AioCompletionImpl*
  // 这个函数就是将AioCompletionImpl*放入该队列
  queue_aio_write(c);

  ZTracer::Trace trace;
  if (trace_info) {
    ZTracer::Trace parent_trace("", nullptr, trace_info);
    trace.init("rados operate", &objecter->trace_endpoint, &parent_trace);
  }

  trace.event("init root span");
  // 将ObjectOperation对象封装成Objecter::Op类型的对象,增加了对象位置相关的信息
  Objecter::Op *op = objecter->prepare_mutate_op(
    oid, oloc, *o, snap_context, ut, flags,
    oncomplete, &c->objver, osd_reqid_t(), &trace);
  // 将操作通过网络发送出去
  objecter->op_submit(op, &c->tid);
  
  trace.event("rados operate op submitted");

  return 0;
}

Objecter :: op_submit

void Objecter::op_submit(Op *op, ceph_tid_t *ptid, int *ctx_budget)
{
  shunique_lock rl(rwlock, ceph::acquire_shared);
  ceph_tid_t tid = 0;
  if (!ptid)
    ptid = &tid;
  op->trace.event("op submit");
  // 转调
  _op_submit_with_budget(op, rl, ptid, ctx_budget);
}

Objecter :: _ op_submit_with_budget
budget means budget, and here involves ceph's Throttle mechanism, which is used for flow control. Not for the time being.
There is still no specific logic here. Mainly includes flow control and timeout processing.

void Objecter::_op_submit_with_budget(Op *op, shunique_lock& sul,
                      ceph_tid_t *ptid,
                      int *ctx_budget)
{
  assert(initialized);

  assert(op->ops.size() == op->out_bl.size());
  assert(op->ops.size() == op->out_rval.size());
  assert(op->ops.size() == op->out_handler.size());
  // 流量控制相关
  // throttle.  before we look at any state, because
  // _take_op_budget() may drop our lock while it blocks.
  if (!op->ctx_budgeted || (ctx_budget && (*ctx_budget == -1))) {
    int op_budget = _take_op_budget(op, sul);
    // take and pass out the budget for the first OP
    // in the context session
    if (ctx_budget && (*ctx_budget == -1)) {
      *ctx_budget = op_budget;
    }
  }
  // 如果设置了超时时间,则注册超时函数到定时器。
  // 超时使,调用op_cancel取消操作
  if (osd_timeout > timespan(0)) {
    if (op->tid == 0)
      op->tid = ++last_tid;
    auto tid = op->tid;
    op->ontimeout = timer.add_event(osd_timeout,
                    [this, tid]() {
                      op_cancel(tid, -ETIMEDOUT); });
  }
  // 转调
  _op_submit(op, sul, ptid);
}

Objecter :: _ op_submit
completes the selection and sending operation of osd here.

void Objecter::_op_submit(Op *op, shunique_lock& sul, ceph_tid_t *ptid)
{
  // rwlock is locked

  ldout(cct, 10) << __func__ << " op " << op << dendl;

  // pick target
  assert(op->session == NULL);
  OSDSession *s = NULL;
  // 从pool到pg到osd,一层层定位,最后定位到一个osd
  bool check_for_latest_map = _calc_target(&op->target, nullptr)
    == RECALC_OP_TARGET_POOL_DNE;
  // 建立到osd的连接会话
  // Try to get a session, including a retry if we need to take write lock
  int r = _get_session(op->target.osd, &s, sul);
  if (r == -EAGAIN ||
      (check_for_latest_map && sul.owns_lock_shared())) {
    epoch_t orig_epoch = osdmap->get_epoch();
    sul.unlock();
    if (cct->_conf->objecter_debug_inject_relock_delay) {
      sleep(1);
    }
    sul.lock();
    if (orig_epoch != osdmap->get_epoch()) {
      // map changed; recalculate mapping
      ldout(cct, 10) << __func__ << " relock raced with osdmap, recalc target"
             << dendl;
      check_for_latest_map = _calc_target(&op->target, nullptr)
    == RECALC_OP_TARGET_POOL_DNE;
      if (s) {
    put_session(s);
    s = NULL;
    r = -EAGAIN;
      }
    }
  }
  if (r == -EAGAIN) {
    assert(s == NULL);
    r = _get_session(op->target.osd, &s, sul);
  }
  assert(r == 0);
  assert(s);  // may be homeless
  // perf count
  _send_op_account(op);

  // send?

  assert(op->target.flags & (CEPH_OSD_FLAG_READ|CEPH_OSD_FLAG_WRITE));

  if (osdmap_full_try) {
    op->target.flags |= CEPH_OSD_FLAG_FULL_TRY;
  }

  bool need_send = false;
  // 根据状态判断是暂停发送还是开始发送
  if (osdmap->get_epoch() < epoch_barrier) {
    ldout(cct, 10) << " barrier, paused " << op << " tid " << op->tid
           << dendl;
    op->target.paused = true;
    _maybe_request_map();
  } else if ((op->target.flags & CEPH_OSD_FLAG_WRITE) &&
             osdmap->test_flag(CEPH_OSDMAP_PAUSEWR)) {
    ldout(cct, 10) << " paused modify " << op << " tid " << op->tid
           << dendl;
    op->target.paused = true;
    _maybe_request_map();
  } else if ((op->target.flags & CEPH_OSD_FLAG_READ) &&
         osdmap->test_flag(CEPH_OSDMAP_PAUSERD)) {
    ldout(cct, 10) << " paused read " << op << " tid " << op->tid
           << dendl;
    op->target.paused = true;
    _maybe_request_map();
  } else if (op->respects_full() &&
         (_osdmap_full_flag() ||
          _osdmap_pool_full(op->target.base_oloc.pool))) {
    ldout(cct, 0) << " FULL, paused modify " << op << " tid "
          << op->tid << dendl;
    op->target.paused = true;
    _maybe_request_map();
  } else if (!s->is_homeless()) {
    need_send = true;
  } else {
    _maybe_request_map();
  }

  MOSDOp *m = NULL;
  if (need_send) {
    // 准备请求,创建并填充数据到MOSDOp *m
    m = _prepare_osd_op(op);
  }

  OSDSession::unique_lock sl(s->lock);
  if (op->tid == 0)
    op->tid = ++last_tid;

  ldout(cct, 10) << "_op_submit oid " << op->target.base_oid
         << " '" << op->target.base_oloc << "' '"
         << op->target.target_oloc << "' " << op->ops << " tid "
         << op->tid << " osd." << (!s->is_homeless() ? s->osd : -1)
         << dendl;

  _session_op_assign(s, op);

  if (need_send) {
    // 发送请求,后续调用了Messager模块的Connection::send_messager将m放入对应的发送队列
    // 放入队列成功后直接返回
    _send_op(op, m);
  }

  // Last chance to touch Op here, after giving up session lock it can
  // be freed at any time by response handler.
  ceph_tid_t tid = op->tid;
  if (check_for_latest_map) {
    _send_op_map_check(op);
  }
  if (ptid)
    *ptid = tid;
  op = NULL;

  sl.unlock();
  // 释放会话
  put_session(s);

  ldout(cct, 5) << num_in_flight << " in flight" << dendl;
}

As for the call of the callback function, it is done through the dispatch mechanism of Messenger.
The librados :: RadosClient :: connect
connect function completes the initialization of RadosClient. One of the steps is to register the objecter to messenger with dispatcher

int librados::RadosClient::connect()
{
  ...
  objecter->init();
  messenger->add_dispatcher_head(&mgrclient);
  messenger->add_dispatcher_tail(objecter);
  ...
}

Messenger::ms_deliver_dispatch

  void ms_deliver_dispatch(Message *m) {
    m->set_dispatch_stamp(ceph_clock_now());
    for (list<Dispatcher*>::iterator p = dispatchers.begin();
     p != dispatchers.end();
     ++p) {
      if ((*p)->ms_dispatch(m))
    return;
    }
    lsubdout(cct, ms, 0) << "ms_deliver_dispatch: unhandled message " << m << " " << *m << " from "
             << m->get_source_inst() << dendl;
    assert(!cct->_conf->ms_die_on_unhandled_msg);
    m-

Object :: ms_dispatch

bool Objecter::ms_dispatch(Message *m)
{
  ldout(cct, 10) << __func__ << " " << cct << " " << *m << dendl;
  switch (m->get_type()) {
    // these we exlusively handle
  case CEPH_MSG_OSD_OPREPLY:
    handle_osd_op_reply(static_cast<MOSDOpReply*>(m));
    return true;

  case CEPH_MSG_OSD_BACKOFF:
    handle_osd_backoff(static_cast<MOSDBackoff*>(m));
    return true;

  case CEPH_MSG_WATCH_NOTIFY:
    handle_watch_notify(static_cast<MWatchNotify*>(m));
    m->put();
    return true;

  case MSG_COMMAND_REPLY:
    if (m->get_source().type() == CEPH_ENTITY_TYPE_OSD) {
      handle_command_reply(static_cast<MCommandReply*>(m));
      return true;
    } else {
      return false;
    }

  case MSG_GETPOOLSTATSREPLY:
    handle_get_pool_stats_reply(static_cast<MGetPoolStatsReply*>(m));
    return true;

  case CEPH_MSG_POOLOP_REPLY:
    handle_pool_op_reply(static_cast<MPoolOpReply*>(m));
    return true;

  case CEPH_MSG_STATFS_REPLY:
    handle_fs_stats_reply(static_cast<MStatfsReply*>(m));
    return true;

    // these we give others a chance to inspect

    // MDS, OSD
  case CEPH_MSG_OSD_MAP:
    handle_osd_map(static_cast<MOSDMap*>(m));
    return false;
  }
  return false;
}

Supplement on Objecter

In librados, RadosClient is its core management class, handles the management of the rados level and pool level, and is responsible for creating and maintaining components such as Objecter. IoctxImpl saves the context of a pool and is responsible for the operation of a single pool. The osdc module is responsible for encapsulating the request and sending the request through the network module. Its core class is Objecter.

 



Author: chnmagnus
link: https: //www.jianshu.com/p/600f5b51a190
Source: Jane books
are copyrighted by the author. For commercial reproduction, please contact the author for authorization, and for non-commercial reproduction, please indicate the source.

Published 13 original articles · Likes6 · Visitors 10,000+

Guess you like

Origin blog.csdn.net/majianting/article/details/103011155