ceph pg scrub 源码分析

ceph scrub介绍

  1. scrub的调度
    1.1 相关数据结构
    1.2 scrub的调度实现
    1.2.1 OSD::sched_scrub函数
    1.2.2 PG::sched_scrub()函数
    1.3 scrub资源预约消息转换
  2. scrub的实现
    2.1 相关数据结构
    2.1.1 Scrubber
    2.1.2 Scrubmap
    2.2 Scrub的控制流程
    2.2.1 chunky_scrub()
    2.3 构建Scrubmap
    2.3.1 build_scrub_map_chunk
    2.3.2 PGBackend::be_scan_list
    2.3.3 ReplicatedBackend::be_deep_scrub
    2.4 从副本处理,构建scrubmap
    2.5 副本对比
    2.5.1 scrub_compare_maps
    2.5.2 be_compare_scrubmaps
    2.5.3 be_select_auth_object
    2.6 结束scrub过程

ceph scrub介绍

ceph通过scrub保证数据的一致性,scrub 以PG 的chunky为单位,对于每一个pg,ceph 分析该pg下的所有object, 产生一个类似于元数据信息摘要的数据结构,如对象大小,属性等,叫scrubmap, 比较所有有副本的scrubmap,选出auth 对象,通过对比auth 对象,得到缺失或者损坏的对象,并进行修复。scrub(daily)比较object size 和属性。deep scrub (weekly)读取数据部分并通过checksum(这里是CRC32)比较保证数据一致性。 每次scrub 只取chunk(chunk大小可以通过ceph的配置选项进行配置)数量的object比较,这期间被校验对象的数据是不能被修改的,所以write请求会被block。等待该chunk的对象scrub完毕,会把阻塞的请求重新加入队列,进行处理。 scrub操作可以手动触发,也会根据配置项和系统负载情况每天定时触发。

1. scrub的调度

scurb的调度解决了一个PG何时启动scrub扫描机制。主要有以下方式:

  • 手动立即启动;
  • 后台设置一定间隔,按照间隔来启动,比如一天执行一次;
  • 设置启动时间段。一般选择比较系统负载较轻的时间段进行;

图一:scrub的调度

ceph pg scrub 源码分析

1.1 相关数据结构

class OSDService {
  // -- scrub scheduling --
  Mutex sched_scrub_lock;//scrub 相关变量的保护锁
  int scrubs_pending;//资源预约已经成功,等待scrub的pg
  int scrubs_active;//正在进行scrub的pg
}

struct ScrubJob {//封装了一个pg 的scrub相关参数
    CephContext* cct;
    /// pg to be scrubbed
    spg_t pgid;
    /// a time scheduled for scrub. but the scrub could be delayed if system
    /// load is too high or it fails to fall in the scrub hours
    utime_t sched_time;
    /// the hard upper bound of scrub time
    utime_t deadline;
    }

1.2 scrub的调度实现

该定时任务大概每隔1s就会触发 OSD::tick_without_osd_lock() 一次,L版本中给了个增益的因子,时间稍微1s上下调动;

int OSD::init()
{
  {
    Mutex::Locker l(tick_timer_lock);
    tick_timer_without_osd_lock.add_event_after(get_tick_interval(),
                        new C_Tick_WithoutOSDLock(this));
  }

}

double OSD::get_tick_interval() const
{
  // vary +/- 5% to avoid scrub scheduling livelocks
  constexpr auto delta = 0.05;
  std::default_random_engine rng{static_cast<unsigned>(whoami)};
  return (OSD_TICK_INTERVAL *
          std::uniform_real_distribution<>{1.0 - delta, 1.0 + delta}(rng));
}

class OSD::C_Tick_WithoutOSDLock : public Context {
  OSD *osd;
  public:
  explicit C_Tick_WithoutOSDLock(OSD *o) : osd(o) {}
  void finish(int r) override {
    osd->tick_without_osd_lock();
  }
};

void OSD::tick_without_osd_lock()
{
  if (is_active()) {
    if (!scrub_random_backoff()) {
      sched_scrub();// 调度scrub
    }
 }
}

1.2.1 OSD::sched_scrub函数

本函数用于控制一个PG的scrub过程启动时机。

  • 检查配额是否允许启动scrub操作;
  • osd 是否处于recovery过程中,如果是,不会进行scrub
  • 检查是否在scrub允许时间段内
  • 检查当前系统负载是否允许
  • 获取第一个等待scrub的scrubjob
  • pg 支持scrub并且是active状态,才会进行scrub
void OSD::sched_scrub()
{
  // if not permitted, fail fast
  if (!service.can_inc_scrubs_pending()) {
    return;
  }
  //osd处于recovery过程中,不会进行scrub
  if (!cct->_conf->osd_scrub_during_recovery && service.is_recovery_active()) {//osd_scrub_during_recovery=false
    dout(20) << __func__ << " not scheduling scrubs due to active recovery" << dendl;
    return;
  }

  utime_t now = ceph_clock_now();
  bool time_permit = scrub_time_permit(now);// 检查是否在scrub允许时间段内
  bool load_is_low = scrub_load_below_threshold();// 检查当前系统负载是否允许
  dout(20) << "sched_scrub load_is_low=" << (int)load_is_low << dendl;

  OSDService::ScrubJob scrub;
  if (service.first_scrub_stamp(&scrub)) {//获取第一个等待scrub的scrubjob
    do {
      dout(30) << "sched_scrub examine " << scrub.pgid << " at " << scrub.sched_time << dendl;

      if (scrub.sched_time > now) {// 还没到时间,跳过执行下一个任务
    // save ourselves some effort
    dout(10) << "sched_scrub " << scrub.pgid << " scheduled at " << scrub.sched_time
         << " > " << now << dendl;
    break;
      }

      if ((scrub.deadline >= now) && !(time_permit && load_is_low)) {
        dout(10) << __func__ << " not scheduling scrub for " << scrub.pgid << " due to "
                 << (!time_permit ? "time not permit" : "high load") << dendl;
        continue;
      }

      PG *pg = _lookup_lock_pg(scrub.pgid);
      if (!pg)
    continue;
      if (pg->get_pgbackend()->scrub_supported() && pg->is_active()) {// pg 支持scrub并且是active状态,才会进行scrub
    dout(10) << "sched_scrub scrubbing " << scrub.pgid << " at " << scrub.sched_time
         << (pg->scrubber.must_scrub ? ", explicitly requested" :
             (load_is_low ? ", load_is_low" : " deadline < now"))
         << dendl;
    if (pg->sched_scrub()) {// 执行scrub操作
      pg->unlock();
      break;
    }
      }
      pg->unlock();
    } while (service.next_scrub_stamp(scrub, &scrub));
  }
  dout(20) << "sched_scrub done" << dendl;
}

1.2.2 PG::sched_scrub()函数

  • 主osd触发primary、pg active、pg clean、pg 没有在scrub过程中,这些条件任何一个为假直接退出scrub调度;
  • 设置deep_scrub_interval,如果该值没有设置,就设置为osd_deep_scrub_interval(7_day)
  • time_for_deep 判断是否执行deep-scrub。
  • scrub和recovery过程类似,都需要耗费大量的系统资源,需要到PG所在的OSD上进行资源预约。如果scrubber.reserved 为false,代表着还没有预约完成,需要先进行资源预约。
    • 本端置为scrubber.reserved = true
    • 把自己加入到scrubber.reserved_peers中
    • 调用scrub_reserve_replicas向其他osd发送资源预约请求.

涉及MOSDScrubReserve::REQUEST,MOSDScrubReserve::GRANT,MOSDScrubReserve::REJECT,MOSDScrubReserve::RELEASE。
参考《1.3 scrub资源预约消息转换》

  • 当 scrubber.reserved_peers.size() == acting.size()说明所有osd资源预约成功,然后判断是否要进行deep操作。调用 PG::queue_scrub() 函数把该PG加入到op_wq 中,触发scrub任务执行。
bool PG::sched_scrub()
{
  bool nodeep_scrub = false;
  assert(is_locked());
  //主osd触发primary、pg active、pg clean、pg 没有在scrub过程中,这些条件任何一个为假直接退出scrub调度;
  if (!(is_primary() && is_active() && is_clean() && !is_scrubbing())) {
    return false;
  }

  double deep_scrub_interval = 0;
  pool.info.opts.get(pool_opts_t::DEEP_SCRUB_INTERVAL, &deep_scrub_interval);
  if (deep_scrub_interval <= 0) {
    deep_scrub_interval = cct->_conf->osd_deep_scrub_interval;//7_day
  }
  bool time_for_deep = ceph_clock_now() >=
    info.history.last_deep_scrub_stamp + deep_scrub_interval;

  bool deep_coin_flip = false;
  // Only add random deep scrubs when NOT user initiated scrub
  if (!scrubber.must_scrub)//must_scrub 为用户手动启动deepscrub操作
      deep_coin_flip = (rand() % 100) < cct->_conf->osd_deep_scrub_randomize_ratio * 100;//osd_deep_scrub_randomize_ratio =0.15
  //sched_scrub: time_for_deep=0 deep_coin_flip=0      
  dout(20) << __func__ << ": time_for_deep=" << time_for_deep << " deep_coin_flip=" << deep_coin_flip << dendl;

  time_for_deep = (time_for_deep || deep_coin_flip);

  //NODEEP_SCRUB so ignore time initiated deep-scrub
  if (osd->osd->get_osdmap()->test_flag(CEPH_OSDMAP_NODEEP_SCRUB) ||
      pool.info.has_flag(pg_pool_t::FLAG_NODEEP_SCRUB)) {
    time_for_deep = false;
    nodeep_scrub = true;
  }

  if (!scrubber.must_scrub) {
    assert(!scrubber.must_deep_scrub);

    //NOSCRUB so skip regular scrubs
    if ((osd->osd->get_osdmap()->test_flag(CEPH_OSDMAP_NOSCRUB) ||
     pool.info.has_flag(pg_pool_t::FLAG_NOSCRUB)) && !time_for_deep) {
      if (scrubber.reserved) {
        // cancel scrub if it is still in scheduling,
        // so pgs from other pools where scrub are still legal
        // have a chance to go ahead with scrubbing.
        clear_scrub_reserved();
        scrub_unreserve_replicas();
      }
      return false;
    }
  }

  if (cct->_conf->osd_scrub_auto_repair/*default false*/
      && get_pgbackend()->auto_repair_supported()
      && time_for_deep
      // respect the command from user, and not do auto-repair
      && !scrubber.must_repair
      && !scrubber.must_scrub
      && !scrubber.must_deep_scrub) {
    dout(20) << __func__ << ": auto repair with deep scrubbing" << dendl;
    scrubber.auto_repair = true;
  } else {
    // this happens when user issue the scrub/repair command during
    // the scheduling of the scrub/repair (e.g. request reservation)
    scrubber.auto_repair = false;
  }

  bool ret = true;
  if (!scrubber.reserved) {// 还没有完成资源预约
    assert(scrubber.reserved_peers.empty());
    if ((cct->_conf->osd_scrub_during_recovery/*false*/ || !osd->is_recovery_active()) &&
         osd->inc_scrubs_pending()) {
      dout(20) << __func__ << ": reserved locally, reserving replicas" << dendl;
      scrubber.reserved = true;
      scrubber.reserved_peers.insert(pg_whoami);
      scrub_reserve_replicas();//向其他osd发送资源预约请求
    } else {
      dout(20) << __func__ << ": failed to reserve locally" << dendl;
      ret = false;
    }
  }
  if (scrubber.reserved) {
    if (scrubber.reserve_failed) {
      dout(20) << "sched_scrub: failed, a peer declined" << dendl;
      clear_scrub_reserved();
      scrub_unreserve_replicas();
      ret = false;
    } else if (scrubber.reserved_peers.size() == acting.size()) {//所有副本预约成功
      dout(20) << "sched_scrub: success, reserved self and replicas" << dendl;
      if (time_for_deep) {
    dout(10) << "sched_scrub: scrub will be deep" << dendl;
    state_set(PG_STATE_DEEP_SCRUB);
      } else if (!scrubber.must_deep_scrub && info.stats.stats.sum.num_deep_scrub_errors) {
    if (!nodeep_scrub) {
      osd->clog->info() << "osd." << osd->whoami
                << " pg " << info.pgid
                << " Deep scrub errors, upgrading scrub to deep-scrub";
      state_set(PG_STATE_DEEP_SCRUB);
    } else if (!scrubber.must_scrub) {
      osd->clog->error() << "osd." << osd->whoami
                 << " pg " << info.pgid
                 << " Regular scrub skipped due to deep-scrub errors and nodeep-scrub set";
      clear_scrub_reserved();
      scrub_unreserve_replicas();
      return false;
    } else {
      osd->clog->error() << "osd." << osd->whoami
                 << " pg " << info.pgid
                 << " Regular scrub request, deep-scrub details will be lost";
    }
      }
      queue_scrub();//把该pg加入到工作队列op_wq 触发scrub任务执行
    } else {
      // none declined, since scrubber.reserved is set
      dout(20) << "sched_scrub: reserved " << scrubber.reserved_peers << ", waiting for replicas" << dendl;
    }
  }

  return ret;
}
 //PG::sched_scrub()-> PG::queue_scrub()->PG::requeue_scrub() -> queue_for_scrub() ->PGQueueable::RunVis::operator() ->pg->scrub(op.epoch_queued, handle)
bool PG::queue_scrub()
{
  assert(is_locked());
  if (is_scrubbing()) {
    return false;
  }
  scrubber.priority = scrubber.must_scrub ?
         cct->_conf->osd_requested_scrub_priority/*120*/ : get_scrub_priority();
  scrubber.must_scrub = false;
  state_set(PG_STATE_SCRUBBING);
  if (scrubber.must_deep_scrub) {
    state_set(PG_STATE_DEEP_SCRUB);
    scrubber.must_deep_scrub = false;
  }
  if (scrubber.must_repair || scrubber.auto_repair) {
    state_set(PG_STATE_REPAIR);
    scrubber.must_repair = false;
  }
  requeue_scrub();
  return true;
}

bool PG::requeue_scrub(bool high_priority)
{
  assert(is_locked());
  if (scrub_queued) {
    dout(10) << __func__ << ": already queued" << dendl;
    return false;
  } else {
    dout(10) << __func__ << ": queueing" << dendl;
    scrub_queued = true;
    osd->queue_for_scrub(this, high_priority);
    return true;
  }
}

  void queue_for_scrub(PG *pg, bool with_high_priority) {
    unsigned scrub_queue_priority = pg->scrubber.priority;
    if (with_high_priority && scrub_queue_priority < cct->_conf->osd_client_op_priority) {
      scrub_queue_priority = cct->_conf->osd_client_op_priority;
    }
    enqueue_back(
      pg->info.pgid,
      PGQueueable(
    PGScrub(pg->get_osdmap()->get_epoch()),
    cct->_conf->osd_scrub_cost,
    scrub_queue_priority,
    ceph_clock_now(),
    entity_inst_t(),
    pg->get_osdmap()->get_epoch()));
  }

void PGQueueable::RunVis::operator()(const PGScrub &op) {
  pg->scrub(op.epoch_queued, handle);
}

1.3 scrub资源预约消息转换

图二:scrub消息调度

ceph pg scrub 源码分析

图三:scrub资源预约消息转换

ceph pg scrub 源码分析

void OSDService::send_message_osd_cluster(int peer, Message *m, epoch_t from_epoch)
{
  OSDMapRef next_map = get_nextmap_reserved();
  // service map is always newer/newest
  assert(from_epoch <= next_map->get_epoch());

  if (next_map->is_down(peer) ||
      next_map->get_info(peer).up_from > from_epoch) {
    m->put();
    release_map(next_map);
    return;
  }
  const entity_inst_t& peer_inst = next_map->get_cluster_inst(peer);
  ConnectionRef peer_con = osd->cluster_messenger->get_connection(peer_inst);
  share_map_peer(peer, peer_con.get(), next_map);
  peer_con->send_message(m);//发送消息到其他节点
  release_map(next_map);
}

int AsyncConnection::send_message(Message *m)
{
  FUNCTRACE();
  lgeneric_subdout(async_msgr->cct, ms,
           1) << "-- " << async_msgr->get_myaddr() << " --> "
              << get_peer_addr() << " -- "
              << *m << " -- " << m << " con "
              << m->get_connection().get()
              << dendl;

  // optimistic think it's ok to encode(actually may broken now)
  if (!m->get_priority())
    m->set_priority(async_msgr->get_default_send_priority());

  m->get_header().src = async_msgr->get_myname();
  m->set_connection(this);

  if (m->get_type() == CEPH_MSG_OSD_OP)
    OID_EVENT_TRACE_WITH_MSG(m, "SEND_MSG_OSD_OP_BEGIN", true);
  else if (m->get_type() == CEPH_MSG_OSD_OPREPLY)
    OID_EVENT_TRACE_WITH_MSG(m, "SEND_MSG_OSD_OPREPLY_BEGIN", true);

  if (async_msgr->get_myaddr() == get_peer_addr()) { //loopback connection
    ldout(async_msgr->cct, 20) << __func__ << " " << *m << " local" << dendl;
    std::lock_guard<std::mutex> l(write_lock);
    if (can_write != WriteStatus::CLOSED) {
      dispatch_queue->local_delivery(m, m->get_priority());
    } else {
      ldout(async_msgr->cct, 10) << __func__ << " loopback connection closed."
                                 << " Drop message " << m << dendl;
      m->put();
    }
    return 0;
  }

  last_active = ceph::coarse_mono_clock::now();
  // we don't want to consider local message here, it's too lightweight which
  // may disturb users
  logger->inc(l_msgr_send_messages);

  bufferlist bl;
  uint64_t f = get_features();

  // TODO: Currently not all messages supports reencode like MOSDMap, so here
  // only let fast dispatch support messages prepare message
  bool can_fast_prepare = async_msgr->ms_can_fast_dispatch(m);
  if (can_fast_prepare)
    prepare_send_message(f, m, bl);

  std::lock_guard<std::mutex> l(write_lock);
  // "features" changes will change the payload encoding
  if (can_fast_prepare && (can_write == WriteStatus::NOWRITE || get_features() != f)) {
    // ensure the correctness of message encoding
    bl.clear();
    m->get_payload().clear();
    ldout(async_msgr->cct, 5) << __func__ << " clear encoded buffer previous "
                              << f << " != " << get_features() << dendl;
  }
  if (can_write == WriteStatus::CLOSED) {
    ldout(async_msgr->cct, 10) << __func__ << " connection closed."
                               << " Drop message " << m << dendl;
    m->put();
  } else {
    m->trace.event("async enqueueing message");
    out_q[m->get_priority()].emplace_back(std::move(bl), m);
    ldout(async_msgr->cct, 15) << __func__ << " inline write is denied, reschedule m=" << m << dendl;
    if (can_write != WriteStatus::REPLACING)
      center->dispatch_event_external(write_handler);
  }
  return 0;
}

AsyncConnection::AsyncConnection(CephContext *cct, AsyncMessenger *m, DispatchQueue *q,
                                 Worker *w)
{
  read_handler = new C_handle_read(this);
  write_handler = new C_handle_write(this);
}   

class C_handle_read : public EventCallback {
  AsyncConnectionRef conn;

 public:
  explicit C_handle_read(AsyncConnectionRef c): conn(c) {}
  void do_request(int fd_or_id) override {
    conn->process();// 调用
  }
};

class C_handle_write : public EventCallback {
  AsyncConnectionRef conn;

 public:
  explicit C_handle_write(AsyncConnectionRef c): conn(c) {}
  void do_request(int fd) override {
    conn->handle_write();// 调用
  }
};

void AsyncConnection::handle_write()
{
  ldout(async_msgr->cct, 10) << __func__ << dendl;
  ssize_t r = 0;

  write_lock.lock();
  if (can_write == WriteStatus::CANWRITE) {
    if (keepalive) {
      _append_keepalive_or_ack();
      keepalive = false;
    }

    auto start = ceph::mono_clock::now();
    bool more;
    do {
      bufferlist data;
      Message *m = _get_next_outgoing(&data);
      if (!m)
        break;

      if (!policy.lossy) {
        // put on sent list
        sent.push_back(m);
        m->get();
      }
      more = _has_next_outgoing();
      write_lock.unlock();

      // send_message or requeue messages may not encode message
      if (!data.length())
        prepare_send_message(get_features(), m, data);

      r = write_message(m, data, more);
      if (r < 0) {
        ldout(async_msgr->cct, 1) << __func__ << " send msg failed" << dendl;
        goto fail;
      }
      write_lock.lock();
      if (r > 0)
        break;
    } while (can_write == WriteStatus::CANWRITE);
    write_lock.unlock();

    uint64_t left = ack_left;
    if (left) {
      ceph_le64 s;
      s = in_seq;
      outcoming_bl.append(CEPH_MSGR_TAG_ACK);
      outcoming_bl.append((char*)&s, sizeof(s));
      ldout(async_msgr->cct, 10) << __func__ << " try send msg ack, acked " << left << " messages" << dendl;
      ack_left -= left;
      left = ack_left;
      r = _try_send(left);
    } else if (is_queued()) {
      r = _try_send();
    }

    logger->tinc(l_msgr_running_send_time, ceph::mono_clock::now() - start);
    if (r < 0) {
      ldout(async_msgr->cct, 1) << __func__ << " send msg failed" << dendl;
      goto fail;
    }
  } else {
    write_lock.unlock();
    lock.lock();
    write_lock.lock();
    if (state == STATE_STANDBY && !policy.server && is_queued()) {
      ldout(async_msgr->cct, 10) << __func__ << " policy.server is false" << dendl;
      _connect();
    } else if (cs && state != STATE_NONE && state != STATE_CONNECTING && state != STATE_CONNECTING_RE && state != STATE_CLOSED) {
      r = _try_send();// 发送消息
      if (r < 0) {
        ldout(async_msgr->cct, 1) << __func__ << " send outcoming bl failed" << dendl;
        write_lock.unlock();
        fault();
        lock.unlock();
        return ;
      }
    }
    write_lock.unlock();
    lock.unlock();
  }

  return ;

 fail:
  lock.lock();
  fault();
  lock.unlock();
}

void AsyncConnection::process()
{
......
  do {
  ......
    prev_state = state;
    switch (state) {
    ......
          case STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH:
        {
                  Message *message = decode_message(async_msgr->cct, async_msgr->crcflags, current_header, footer,
                                            front, middle, data, this);

                          if (delay_state) {
            utime_t release = message->get_recv_stamp();
            double delay_period = 0;
            if (rand() % 10000 < async_msgr->cct->_conf->ms_inject_delay_probability * 10000.0) {
              delay_period = async_msgr->cct->_conf->ms_inject_delay_max * (double)(rand() % 10000) / 10000.0;
              release += delay_period;
              ldout(async_msgr->cct, 1) << "queue_received will delay until " << release << " on "
                                        << message << " " << *message << dendl;
            }
            delay_state->queue(delay_period, release, message);
          } else if (async_msgr->ms_can_fast_dispatch(message)) {
            lock.unlock();
            dispatch_queue->fast_dispatch(message);
            recv_start_time = ceph::mono_clock::now();
            logger->tinc(l_msgr_running_fast_dispatch_time,
                         recv_start_time - fast_dispatch_time);
            lock.lock();
          } else {
            dispatch_queue->enqueue(message, message->get_priority(), conn_id);
          }
          ......
      } while (prev_state != state);
}

void DispatchQueue::fast_dispatch(Message *m)
{
  uint64_t msize = pre_dispatch(m);
  msgr->ms_fast_dispatch(m);
  post_dispatch(m, msize);
}

void OSD::ms_fast_dispatch(Message *m)
{
  FUNCTRACE();
  if (service.is_stopping()) {
    m->put();
    return;
  }
  OpRequestRef op = op_tracker.create_request<OpRequest, Message*>(m);
  {
#ifdef WITH_LTTNG
    osd_reqid_t reqid = op->get_reqid();
#endif
    tracepoint(osd, ms_fast_dispatch, reqid.name._type,
        reqid.name._num, reqid.tid, reqid.inc);
  }

  if (m->trace)
    op->osd_trace.init("osd op", &trace_endpoint, &m->trace);

  // note sender epoch, min req'd epoch
  op->sent_epoch = static_cast<MOSDFastDispatchOp*>(m)->get_map_epoch();
  op->min_epoch = static_cast<MOSDFastDispatchOp*>(m)->get_min_epoch();
  assert(op->min_epoch <= op->sent_epoch); // sanity check!

  service.maybe_inject_dispatch_delay();

  if (m->get_connection()->has_features(CEPH_FEATUREMASK_RESEND_ON_SPLIT) ||
      m->get_type() != CEPH_MSG_OSD_OP) {
    // queue it directly
    enqueue_op(
      static_cast<MOSDFastDispatchOp*>(m)->get_spg(),
      op,
      static_cast<MOSDFastDispatchOp*>(m)->get_map_epoch());//入队
  } else {
    // legacy client, and this is an MOSDOp (the *only* fast dispatch
    // message that didn't have an explicit spg_t); we need to map
    // them to an spg_t while preserving delivery order.
    // 这里是兼容老版本,新版本中应该走的上边分支
    Session *session = static_cast<Session*>(m->get_connection()->get_priv());
    if (session) {
      {
    Mutex::Locker l(session->session_dispatch_lock);
    op->get();
    session->waiting_on_map.push_back(*op);
    OSDMapRef nextmap = service.get_nextmap_reserved();
    dispatch_session_waiting(session, nextmap);
    service.release_map(nextmap);
      }
      session->put();
    }
  }
  OID_EVENT_TRACE_WITH_MSG(m, "MS_FAST_DISPATCH_END", false); 
}

void OSD::enqueue_op(spg_t pg, OpRequestRef& op, epoch_t epoch)
{
  utime_t latency = ceph_clock_now() - op->get_req()->get_recv_stamp();
  dout(15) << "enqueue_op " << op << " prio " << op->get_req()->get_priority()
       << " cost " << op->get_req()->get_cost()
       << " latency " << latency
       << " epoch " << epoch
       << " " << *(op->get_req()) << dendl;
  op->osd_trace.event("enqueue op");
  op->osd_trace.keyval("priority", op->get_req()->get_priority());
  op->osd_trace.keyval("cost", op->get_req()->get_cost());
  op->mark_queued_for_pg();
  logger->tinc(l_osd_op_before_queue_op_lat, latency);
  op_shardedwq.queue(make_pair(pg, PGQueueable(op, epoch)));
}

//scrub 资源预约走这里
void PGQueueable::RunVis::operator()(const OpRequestRef &op) {
  osd->dequeue_op(pg, op, handle);
}

void PGQueueable::RunVis::operator()(const PGScrub &op) {
  pg->scrub(op.epoch_queued, handle);
}

void OSD::dequeue_op(
  PGRef pg, OpRequestRef op,
  ThreadPool::TPHandle &handle)
{
  FUNCTRACE();
  OID_EVENT_TRACE_WITH_MSG(op->get_req(), "DEQUEUE_OP_BEGIN", false);

  utime_t now = ceph_clock_now();
  op->set_dequeued_time(now);
  utime_t latency = now - op->get_req()->get_recv_stamp();
  dout(10) << "dequeue_op " << op << " prio " << op->get_req()->get_priority()
       << " cost " << op->get_req()->get_cost()
       << " latency " << latency
       << " " << *(op->get_req())
       << " pg " << *pg << dendl;

  logger->tinc(l_osd_op_before_dequeue_op_lat, latency);

  Session *session = static_cast<Session *>(
    op->get_req()->get_connection()->get_priv());
  if (session) {
    maybe_share_map(session, op, pg->get_osdmap());
    session->put();
  }

  if (pg->deleting)
    return;

  op->mark_reached_pg();
  op->osd_trace.event("dequeue_op");

  pg->do_request(op, handle);//go void PrimaryLogPG::do_request()

  // finish
  dout(10) << "dequeue_op " << op << " finish" << dendl;
  OID_EVENT_TRACE_WITH_MSG(op->get_req(), "DEQUEUE_OP_END", false);
}

void PrimaryLogPG::do_request(
  OpRequestRef& op,
  ThreadPool::TPHandle &handle)
{
......
  switch (op->get_req()->get_type()) {
  ......
    case MSG_OSD_SCRUB_RESERVE:
    {
      const MOSDScrubReserve *m =
    static_cast<const MOSDScrubReserve*>(op->get_req());
      switch (m->type) {
      case MOSDScrubReserve::REQUEST:
    handle_scrub_reserve_request(op);//处理资源预约请求
    break;
      case MOSDScrubReserve::GRANT:
    handle_scrub_reserve_grant(op, m->from);//处理资源预约请求
    break;
      case MOSDScrubReserve::REJECT:
    handle_scrub_reserve_reject(op, m->from);
    break;
      case MOSDScrubReserve::RELEASE:
    handle_scrub_reserve_release(op);
    break;
      }
    }
    break;
  ......
  }
}

void PG::handle_scrub_reserve_request(OpRequestRef op)
{
  dout(7) << __func__ << " " << *op->get_req() << dendl;
  op->mark_started();
  if (scrubber.reserved) {
    dout(10) << __func__ << " ignoring reserve request: Already reserved"
         << dendl;
    return;
  }
  if ((cct->_conf->osd_scrub_during_recovery || !osd->is_recovery_active()) &&
      osd->inc_scrubs_pending()) {
    scrubber.reserved = true;
  } else {
    dout(20) << __func__ << ": failed to reserve remotely" << dendl;
    scrubber.reserved = false;
  }
  if (op->get_req()->get_type() == MSG_OSD_SCRUB_RESERVE) {
    const MOSDScrubReserve *m =
      static_cast<const MOSDScrubReserve*>(op->get_req());
    Message *reply = new MOSDScrubReserve(
      spg_t(info.pgid.pgid, primary.shard),
      m->map_epoch,
      scrubber.reserved ? MOSDScrubReserve::GRANT : MOSDScrubReserve::REJECT,
      pg_whoami);// 这里scrub请求转换为GRANT类型。中间如果处理失败,转换为REJECT类型
    osd->send_message_osd_cluster(reply, op->get_req()->get_connection());
  } else {
    // for jewel compat only
    const MOSDSubOp *req = static_cast<const MOSDSubOp*>(op->get_req());
    assert(req->get_type() == MSG_OSD_SUBOP);
    MOSDSubOpReply *reply = new MOSDSubOpReply(
      req, pg_whoami, 0, get_osdmap()->get_epoch(), CEPH_OSD_FLAG_ACK);
    ::encode(scrubber.reserved, reply->get_data());
    osd->send_message_osd_cluster(reply, op->get_req()->get_connection());
  }
}

void PG::handle_scrub_reserve_grant(OpRequestRef op, pg_shard_t from)
{
  dout(7) << __func__ << " " << *op->get_req() << dendl;
  op->mark_started();
  if (!scrubber.reserved) {
    dout(10) << "ignoring obsolete scrub reserve reply" << dendl;
    return;
  }
  if (scrubber.reserved_peers.find(from) != scrubber.reserved_peers.end()) {
    dout(10) << " already had osd." << from << " reserved" << dendl;
  } else {
    dout(10) << " osd." << from << " scrub reserve = success" << dendl;
    scrubber.reserved_peers.insert(from);
    sched_scrub();//本osd scrub资源预约成功,返回调用 PG::sched_scrub()
  }
}

2. scrub的实现

scrub的具体执行过程大致如下,通过对比对象各个OSD副本的元数据和数据来完成元数据和数据的校验。其核心处理流程在函数PG::chunky_scrub 中控制完成。

图4:scrub的实现

ceph pg scrub 源码分析

2.1 相关数据结构

scrub操作主要2个数据结构,一个是Scrubber,相当于一次scrub操作的上下文,控制一次PG的操作过程。另一个是ScrubMap保存需要比较对象的元数据和数据的摘要信息。

2.1.1 Scrubber

  struct Scrubber {//用来控制一个pg的scrub过程
    // metadata
    set<pg_shard_t> reserved_peers;//资源预约的shard set
    bool reserved, reserve_failed;//是否预约,预约是否失败
    epoch_t epoch_start;// 开始scrub操作的epoch

    // common to both scrubs
    bool active;//scrub 是否开始
    set<pg_shard_t> waiting_on_whom;//等待的副本
    int shallow_errors;//轻度扫描错误数
    int deep_errors;//深度扫描错误数
    int large_omap_objects = 0;
    int fixed;//已经修复对象数
    ScrubMap primary_scrubmap;//主副本的scrubmap
    ScrubMapBuilder primary_scrubmap_pos;
    epoch_t replica_scrub_start = 0;
    ScrubMap replica_scrubmap;
    ScrubMapBuilder replica_scrubmap_pos;
    map<pg_shard_t, ScrubMap> received_maps;//接收到从副本的scrubmap
    OpRequestRef active_rep_scrub;
    utime_t scrub_reg_stamp;  // stamp we registered for

    // For async sleep
    bool sleeping = false;
    bool needs_sleep = true;
    utime_t sleep_start;

    // flags to indicate explicitly requested scrubs (by admin)
    bool must_scrub, must_deep_scrub, must_repair;

    // Priority to use for scrub scheduling
    unsigned priority;

    // this flag indicates whether we would like to do auto-repair of the PG or not
    bool auto_repair;//是否自动修复

    // Maps from objects with errors to missing/inconsistent peers
    map<hobject_t, set<pg_shard_t>> missing;// 扫描出的缺失对象
    map<hobject_t, set<pg_shard_t>> inconsistent;// 扫描数的不一致对象

    // Map from object with errors to good peers
    map<hobject_t, list<pair<ScrubMap::object, pg_shard_t> >> authoritative;//如果所有副本对象中有不一致的对象,authoritative记录了正确对象所在的osd

    // digest updates which we are waiting on
    int num_digest_updates_pending;// 等待更新digest的对象数目

    // chunky scrub
    hobject_t start, end;    // [start,end)  扫描对象列表的开始和结尾
    hobject_t max_end;       // Largest end that may have been sent to replicas
    eversion_t subset_last_update;//扫描对象列表中最新的版本号

    // chunky scrub state
    enum State {
      INACTIVE,
      NEW_CHUNK,
      WAIT_PUSHES,
      WAIT_LAST_UPDATE,
      BUILD_MAP,
      BUILD_MAP_DONE,
      WAIT_REPLICAS,
      COMPARE_MAPS,
      WAIT_DIGEST_UPDATES,
      FINISH,
      BUILD_MAP_REPLICA,
    } state;

    std::unique_ptr<Scrub::Store> store;
    // deep scrub
    bool deep;// 是否为深度扫描
    int preempt_left;
    int preempt_divisor;
  } scrubber

2.1.2 Scrubmap

保存准备校验的对象及相应的校验信息。

/*
 * summarize pg contents for purposes of a scrub
 */
struct ScrubMap {// 保存准备校验的对象以及相应的校验信息
  struct object {
    map<string,bufferptr> attrs;//对象的属性
    uint64_t size;//对象大小
    __u32 omap_digest;         ///< omap crc32c
    __u32 digest;              ///< data crc32c
    bool negative:1;
    bool digest_present:1;//是否校验了数据的校验码标志
    bool omap_digest_present:1;// 是否有omap的校验码标志
    bool read_error:1;//读对象数据出错标志
    bool stat_error:1;//调用stat获取对象元数据出错标志
    bool ec_hash_mismatch:1;
    bool ec_size_mismatch:1;
    bool large_omap_object_found:1;
    uint64_t large_omap_object_key_count = 0;
    uint64_t large_omap_object_value_size = 0;

    object() :
      // Init invalid size so it won't match if we get a stat EIO error
      size(-1), omap_digest(0), digest(0),
      negative(false), digest_present(false), omap_digest_present(false),
      read_error(false), stat_error(false), ec_hash_mismatch(false),
      ec_size_mismatch(false), large_omap_object_found(false) {}

    void encode(bufferlist& bl) const;
    void decode(bufferlist::iterator& bl);
    void dump(Formatter *f) const;
    static void generate_test_instances(list<object*>& o);
  };
  WRITE_CLASS_ENCODER(object)

  map<hobject_t,object> objects;// 需要校验的对象 -> 对象的校验信息映射
  eversion_t valid_through;
  eversion_t incr_since;
  bool has_large_omap_object_errors:1;
  boost::optional<bool> has_builtin_csum;
}

2.2 Scrub的控制流程

scrub 的任务是由OSD的工作队列OpWq来完成,调用对应的pg->scrub(handle)来执行。
PG::scrub 函数最终调用 PG::chunky_scrub() 函数来实现,该函数控制了scrub操作的状态转换和核心处理流程。

 //PG::requeue_scrub() -> queue_for_scrub() ->PGQueueable::RunVis::operator() ->pg->scrub(op.epoch_queued, handle)
void PG::scrub(epoch_t queued, ThreadPool::TPHandle &handle)
{ 
......
  assert(scrub_queued);
  scrub_queued = false;
  scrubber.needs_sleep = true;

  // for the replica
  if (!is_primary() &&
      scrubber.state == PG::Scrubber::BUILD_MAP_REPLICA) {
    chunky_scrub(handle);// 副本处理state:BUILD_MAP_REPLICA
    return;
  }
//如果本osd 非主,非active,非clean,非scrubbing,满足之一,就不进行scrub
  if (!is_primary() || !is_active() || !is_clean() || !is_scrubbing()) {
    dout(10) << "scrub -- not primary or active or not clean" << dendl;
    state_clear(PG_STATE_SCRUBBING);
    state_clear(PG_STATE_REPAIR);
    state_clear(PG_STATE_DEEP_SCRUB);
    publish_stats_to_osd();
    return;
  }

  if (!scrubber.active) {
    assert(backfill_targets.empty());

    scrubber.deep = state_test(PG_STATE_DEEP_SCRUB);

    dout(10) << "starting a new chunky scrub" << dendl;
  }

  chunky_scrub(handle);// 开始处理scrub主流程
}

2.2.1 chunky_scrub()

  • 1.开始了scrub,但是epoch_start != same_interval_since,直接退出
  • 2.Scrubber的初始状态为 PG::Scrubber::INACTIVE。处理如下:
    • 设置scrubber.epoch_start 值为info.history.same_interval_since。
    • 设置scrubber.active 为true
    • 设置scrubber.state状态为 PG::Scrubber::NEW_CHUNK
  • 3.PG::Scrubber::NEW_CHUNK状态处理如下:
    • 调用函数objects_list_partial 从start开始扫描一组对象,一次扫描的对象数目在min和max之间。这两个值和osd_scrub_chunk_min(5)和osd_scrub_chunk_max(25)有关
    • 计算出对象的边界。相同的对象具有相同的哈希值。从里边后边开始查找哈希值不同的对象,从该地方为界限。这样做的目的是把一个对象的所有相关对象(快照对象、回滚对象)划分在一次扫描校验过程中。
    • 函数_range_available_for_scrub,检查列表中的对象的范围,如果有被阻塞的对象,就设置done为true,退出本次PG scrub的过程。
    • 计算pglog中最新的版本号,设置为scrubber.subset_last_update
    • 调用_request_scrub_map 向所有其他副本发送消息,获取相应的scrubmap的校验信息
    • 设置状态为PG::Scrubber::WAIT_PUSHES
  • 4.PG::Scrubber::WAIT_PUSHES状态处理:
    • active_pushes为0,直接进入下一个状态 PG::Scrubber::WAIT_LAST_UPDATE;如果不为0,说明pg正在recovery状态,设置done为true,直接结束。
  • 5.PG::Scrubber::WAIT_LAST_UPDATE状态处理如下:
    • 如果last_update_applied < scrubber.subset_last_update,虽然已经把操作写入日志,但是还没有应用到对象,由于后边scrub有对象的读操作,所以需要等待日志应用完成。这里有对象没有真正写完,所以结束本地scrub。
    • 设置状态为 PG::Scrubber::BUILD_MAP
  • 6.PG::Scrubber::BUILD_MAP状态处理如下
    • 调用函数build_scrub_map_chunk 本端开始真正构建主osd的scubmap。构建start到end所有对象校验信息,并保存在scrubmap结构中
    • 设置状态为 PG::Scrubber::BUILD_MAP_DONE
  • 7.PG::Scrubber::BUILD_MAP_DONE 处理如下
    • 设置状态scrubber.state = PG::Scrubber::WAIT_REPLICAS;
  • 8.PG::Scrubber::WAIT_REPLICAS 处理如下:
    • 如果waiting_on_whom不为空,说明有部分osd没有完成scrubmap构建,结束本次scrub请求
  • -设置状态为 PG::Scrubber::COMPARE_MAPS
  • 9.PG::Scrubber::COMPARE_MAPS状态如下:
    • 调用函数 scrub_compare_maps 比较各副本的校验信息
    • 更新参数scrubber.start = scrubber.end
    • 函数requeue_ops 把scrub引起的阻塞的op重新加入队列中执行 PrimaryLogPG::do_op()-> waiting_for_scrub.push_back(op)
    • 设置状态为PG::Scrubber::WAIT_DIGEST_UPDATES
  • 10.PG::Scrubber::WAIT_DIGEST_UPDATES 状态处理如下:
    • 如果scrubber.num_digest_updates_pending 存在,等待更新数据的digest或者omap的digest
    • 如果end不是max,说明本PG还有没有完成scrub操作的对象,设置状态NEW_CHUNK,继续加入requeue_scrub,进行处理。
    • 否则设置为PG::Scrubber::FINISH
  • 11.PG::Scrubber::FINISH状态处理如下:
    -- 设置状态为PG::Scrubber::INACTIVE, done = true,完成scrub。
  • 12.PG::Scrubber::BUILD_MAP_REPLICA 状态处理如下:
    • 在步骤3中发送到其他osd副本,其他副本会调用build_scrub_map_chunk 构建scrubmap,然后调用 osd->send_message_osd_cluster(reply) 返回到主OSD。
/*
 * Chunky scrub scrubs objects one chunk at a time with writes blocked for that
 * chunk.
 *
 * The object store is partitioned into chunks which end on hash boundaries. For
 * each chunk, the following logic is performed:
 *
 *  (1) Block writes on the chunk
 *  (2) Request maps from replicas
 *  (3) Wait for pushes to be applied (after recovery)
 *  (4) Wait for writes to flush on the chunk
 *  (5) Wait for maps from replicas
 *  (6) Compare / repair all scrub maps
 *  (7) Wait for digest updates to apply
 *
 * This logic is encoded in the mostly linear state machine:
 *
 *           +------------------+
 *  _________v__________        |
 * |                    |       |
 * |      INACTIVE      |       |
 * |____________________|       |
 *           |                  |
 *           |   +----------+   |
 *  _________v___v______    |   |
 * |                    |   |   |
 * |      NEW_CHUNK     |   |   |
 * |____________________|   |   |
 *           |              |   |
 *  _________v__________    |   |
 * |                    |   |   |
 * |     WAIT_PUSHES    |   |   |
 * |____________________|   |   |
 *           |              |   |
 *  _________v__________    |   |
 * |                    |   |   |
 * |  WAIT_LAST_UPDATE  |   |   |
 * |____________________|   |   |
 *           |              |   |
 *  _________v__________    |   |
 * |                    |   |   |
 * |      BUILD_MAP     |   |   |
 * |____________________|   |   |
 *           |              |   |
 *  _________v__________    |   |
 * |                    |   |   |
 * |    WAIT_REPLICAS   |   |   |
 * |____________________|   |   |
 *           |              |   |
 *  _________v__________    |   |
 * |                    |   |   |
 * |    COMPARE_MAPS    |   |   |
 * |____________________|   |   |
 *           |              |   |
 *           |              |   |
 *  _________v__________    |   |
 * |                    |   |   |
 * |WAIT_DIGEST_UPDATES |   |   |
 * |____________________|   |   |
 *           |   |          |   |
 *           |   +----------+   |
 *  _________v__________        |
 * |                    |       |
 * |       FINISH       |       |
 * |____________________|       |
 *           |                  |
 *           +------------------+
 */

void PG::chunky_scrub(ThreadPool::TPHandle &handle)
{
  // check for map changes
  if (scrubber.is_chunky_scrub_active()) {// return state != INACTIVE; 开始了scrub
    if (scrubber.epoch_start != info.history.same_interval_since) {//开始了scrub,但是epoch_start != same_interval_since,直接退出
      dout(10) << "scrub  pg changed, aborting" << dendl;
      scrub_clear_state();
      scrub_unreserve_replicas();
      return;
    }
  }

  bool done = false;
  int ret;
// 这里如果是deepscrub,假如一个对象4169728,那么4169728/524288=7.9 ,需要8次才能把一个对象操作完,然后才会执行下一个object,每次都会从这里开始。 
  while (!done) {
    dout(20) << "scrub state " << Scrubber::state_string(scrubber.state)
         << " [" << scrubber.start << "," << scrubber.end << ")"
         << " max_end " << scrubber.max_end << dendl;

    switch (scrubber.state) {
      case PG::Scrubber::INACTIVE:
        dout(10) << "scrub start" << dendl;
    assert(is_primary());

        publish_stats_to_osd();
        scrubber.epoch_start = info.history.same_interval_since;
        scrubber.active = true;

    osd->inc_scrubs_active(scrubber.reserved);
    if (scrubber.reserved) {
      scrubber.reserved = false;
      scrubber.reserved_peers.clear();
    }

    {
      ObjectStore::Transaction t;
      scrubber.cleanup_store(&t);
      scrubber.store.reset(Scrub::Store::create(osd->store, &t,
                            info.pgid, coll));
      osd->store->queue_transaction(osr.get(), std::move(t), nullptr);
    }

        // Don't include temporary objects when scrubbing
        scrubber.start = info.pgid.pgid.get_hobj_start();
        scrubber.state = PG::Scrubber::NEW_CHUNK;

    {
      bool repair = state_test(PG_STATE_REPAIR);
      bool deep_scrub = state_test(PG_STATE_DEEP_SCRUB);
      const char *mode = (repair ? "repair": (deep_scrub ? "deep-scrub" : "scrub"));
      stringstream oss;
      oss << info.pgid.pgid << " " << mode << " starts" << std::endl;
      osd->clog->debug(oss);
    }

    scrubber.preempt_left = cct->_conf->get_val<uint64_t>(
      "osd_scrub_max_preemptions");
    scrubber.preempt_divisor = 1;
        break;

      case PG::Scrubber::NEW_CHUNK:
        scrubber.primary_scrubmap = ScrubMap();
        scrubber.received_maps.clear();

    // begin (possible) preemption window
    if (scrub_preempted) {
      scrubber.preempt_left--;
      scrubber.preempt_divisor *= 2;
      dout(10) << __func__ << " preempted, " << scrubber.preempt_left
           << " left" << dendl;
      scrub_preempted = false;
    }
    scrub_can_preempt = scrubber.preempt_left > 0;
    {
          int min = std::max<int64_t>(3, cct->_conf->osd_scrub_chunk_min /
                      scrubber.preempt_divisor);
      int max = std::max<int64_t>(min, cct->_conf->osd_scrub_chunk_max /
                                      scrubber.preempt_divisor);
          hobject_t start = scrubber.start;
      hobject_t candidate_end;
      vector<hobject_t> objects;
      osr->flush();
      ret = get_pgbackend()->objects_list_partial(
        start,
        min,
        max,
        &objects,
        &candidate_end);
      assert(ret >= 0);

      if (!objects.empty()) {
        hobject_t back = objects.back();
        while (candidate_end.has_snapset() &&
              candidate_end.get_head() == back.get_head()) {
          candidate_end = back;
          objects.pop_back();
          if (objects.empty()) {
        assert(0 ==
               "Somehow we got more than 2 objects which"
               "have the same head but are not clones");
          }
          back = objects.back();
        }
        if (candidate_end.has_snapset()) {
          assert(candidate_end.get_head() != back.get_head());
          candidate_end = candidate_end.get_object_boundary();
        }
      } else {
        assert(candidate_end.is_max());
      }

      if (!_range_available_for_scrub(scrubber.start, candidate_end)) {//检查列表中的对象,如果有被阻塞的对象,就退出
        // we'll be requeued by whatever made us unavailable for scrub
        dout(10) << __func__ << ": scrub blocked somewhere in range "
             << "[" << scrubber.start << ", " << candidate_end << ")"
             << dendl;
        done = true;
        break;
      }
      scrubber.end = candidate_end;
      if (scrubber.end > scrubber.max_end)
        scrubber.max_end = scrubber.end;
        }

        // walk the log to find the latest update that affects our chunk
        scrubber.subset_last_update = eversion_t();
    for (auto p = projected_log.log.rbegin();
         p != projected_log.log.rend();
         ++p) {
          if (p->soid >= scrubber.start &&
          p->soid < scrubber.end) {
            scrubber.subset_last_update = p->version;
            break;
      }
    }
    if (scrubber.subset_last_update == eversion_t()) {
      for (list<pg_log_entry_t>::const_reverse_iterator p =
         pg_log.get_log().log.rbegin();
           p != pg_log.get_log().log.rend();
           ++p) {
        if (p->soid >= scrubber.start &&
        p->soid < scrubber.end) {
          scrubber.subset_last_update = p->version;
          break;
        }
      }
    }
        // ask replicas to wait until
        // last_update_applied >= scrubber.subset_last_update and then scan
        scrubber.waiting_on_whom.insert(pg_whoami);

        // request maps from replicas
    for (set<pg_shard_t>::iterator i = actingbackfill.begin();
         i != actingbackfill.end();// 向所有副本发送消息,获取相应的scrubmap的校验信息
         ++i) {
      if (*i == pg_whoami) continue;
          _request_scrub_map(*i, scrubber.subset_last_update,
                             scrubber.start, scrubber.end, scrubber.deep,
                 scrubber.preempt_left > 0);
          scrubber.waiting_on_whom.insert(*i);//把所有的等待的副本加入到waiting_on_whom set中
        }
    dout(10) << __func__ << " waiting_on_whom " << scrubber.waiting_on_whom
         << dendl;

        scrubber.state = PG::Scrubber::WAIT_PUSHES;
        break;

      case PG::Scrubber::WAIT_PUSHES:
        if (active_pushes == 0) {
          scrubber.state = PG::Scrubber::WAIT_LAST_UPDATE;
        } else {
          dout(15) << "wait for pushes to apply" << dendl;
          done = true;
        }
        break;

      case PG::Scrubber::WAIT_LAST_UPDATE:
        if (last_update_applied < scrubber.subset_last_update) {//虽然已经把操作写入日志,但是还没有应用到对象,由于后边scrub有对象的读操作,所以需要等待日志应用完成。这里有对象没有真正写完,所以结束本地scrub。
          // will be requeued by op_applied
          dout(15) << "wait for writes to flush" << dendl;
          done = true;
      break;
    }

    scrubber.state = PG::Scrubber::BUILD_MAP;
    scrubber.primary_scrubmap_pos.reset();
        break;

      case PG::Scrubber::BUILD_MAP:
        assert(last_update_applied >= scrubber.subset_last_update);

        // build my own scrub map
    if (scrub_preempted) {
      dout(10) << __func__ << " preempted" << dendl;
      scrubber.state = PG::Scrubber::BUILD_MAP_DONE;
      break;
    }
    ret = build_scrub_map_chunk(
      scrubber.primary_scrubmap,
      scrubber.primary_scrubmap_pos,
      scrubber.start, scrubber.end,
      scrubber.deep,
      handle);// 此处开始真正构建主osd的scubmap。构建start到end所有对象校验信息,并保存在scrubmap结构中
    if (ret == -EINPROGRESS) {
      requeue_scrub();
      done = true;
      break;
    }
    scrubber.state = PG::Scrubber::BUILD_MAP_DONE;
    break;

      case PG::Scrubber::BUILD_MAP_DONE:
    if (scrubber.primary_scrubmap_pos.ret < 0) {
      dout(5) << "error: " << scrubber.primary_scrubmap_pos.ret
          << ", aborting" << dendl;
          scrub_clear_state();
          scrub_unreserve_replicas();
          return;
        }
    dout(10) << __func__ << " waiting_on_whom was "
         << scrubber.waiting_on_whom << dendl;
    assert(scrubber.waiting_on_whom.count(pg_whoami));
        scrubber.waiting_on_whom.erase(pg_whoami);

        scrubber.state = PG::Scrubber::WAIT_REPLICAS;
        break;

      case PG::Scrubber::WAIT_REPLICAS:
        if (!scrubber.waiting_on_whom.empty()) {//如果waiting_on_whom不为空,说明有部分osd没有完成scrubmap构建,结束本次scrub请求
          // will be requeued by sub_op_scrub_map
          dout(10) << "wait for replicas to build scrub map" << dendl;
          done = true;
      break;
    }
    // end (possible) preemption window
    scrub_can_preempt = false;
    if (scrub_preempted) {
      dout(10) << __func__ << " preempted, restarting chunk" << dendl;
      scrubber.state = PG::Scrubber::NEW_CHUNK;
    } else {
          scrubber.state = PG::Scrubber::COMPARE_MAPS;
        }
        break;

      case PG::Scrubber::COMPARE_MAPS:
        assert(last_update_applied >= scrubber.subset_last_update);
        assert(scrubber.waiting_on_whom.empty());

        scrub_compare_maps();//比较各副本的校验信息
    scrubber.start = scrubber.end;
    scrubber.run_callbacks();

        // requeue the writes from the chunk that just finished
        requeue_ops(waiting_for_scrub);//把scrub引起的阻塞的op重新加入队列中执行 PrimaryLogPG::do_op()-> waiting_for_scrub.push_back(op)

    scrubber.state = PG::Scrubber::WAIT_DIGEST_UPDATES;

    // fall-thru

      case PG::Scrubber::WAIT_DIGEST_UPDATES:
    if (scrubber.num_digest_updates_pending) {// 等待更新数据的digest或者omap的digest
      dout(10) << __func__ << " waiting on "
           << scrubber.num_digest_updates_pending
           << " digest updates" << dendl;
      done = true;
      break;
    }

    scrubber.preempt_left = cct->_conf->get_val<uint64_t>(
      "osd_scrub_max_preemptions");
    scrubber.preempt_divisor = 1;

    if (!(scrubber.end.is_max())) {// 本PG还有没有完成scrub操作的对象,设置状态NEW_CHUNK,继续加入requeue_scrub,进行处理
      scrubber.state = PG::Scrubber::NEW_CHUNK;
      requeue_scrub();
          done = true;
        } else {
          scrubber.state = PG::Scrubber::FINISH;
        }

    break;

      case PG::Scrubber::FINISH:
        scrub_finish();
        scrubber.state = PG::Scrubber::INACTIVE;
        done = true;

    if (!snap_trimq.empty()) {
      dout(10) << "scrub finished, requeuing snap_trimmer" << dendl;
      snap_trimmer_scrub_complete();
    }

        break;

      case PG::Scrubber::BUILD_MAP_REPLICA:
        // build my own scrub map
    if (scrub_preempted) {
      dout(10) << __func__ << " preempted" << dendl;
      ret = 0;
    } else {
      ret = build_scrub_map_chunk(
        scrubber.replica_scrubmap,
        scrubber.replica_scrubmap_pos,
        scrubber.start, scrubber.end,
        scrubber.deep,
        handle);
    }
    if (ret == -EINPROGRESS) {
      requeue_scrub();
      done = true;
      break;
    }
    // reply
    if (HAVE_FEATURE(acting_features, SERVER_LUMINOUS)) {//L版版
      MOSDRepScrubMap *reply = new MOSDRepScrubMap(
        spg_t(info.pgid.pgid, get_primary().shard),
        scrubber.replica_scrub_start,
        pg_whoami);
      reply->preempted = scrub_preempted;
      ::encode(scrubber.replica_scrubmap, reply->get_data());
      osd->send_message_osd_cluster(
        get_primary().osd, reply,
        scrubber.replica_scrub_start);
    } else {//J版本
      // for jewel compatibility
      vector<OSDOp> scrub(1);
      scrub[0].op.op = CEPH_OSD_OP_SCRUB_MAP;
      hobject_t poid;
      eversion_t v;
      osd_reqid_t reqid;
      MOSDSubOp *subop = new MOSDSubOp(
        reqid,
        pg_whoami,
        spg_t(info.pgid.pgid, get_primary().shard),
        poid,
        0,
        scrubber.replica_scrub_start,
        osd->get_tid(),
        v);
      ::encode(scrubber.replica_scrubmap, subop->get_data());
      subop->ops = scrub;
      osd->send_message_osd_cluster(
        get_primary().osd, subop,
        scrubber.replica_scrub_start);
    }
    scrub_preempted = false;
    scrub_can_preempt = false;
    scrubber.state = PG::Scrubber::INACTIVE;
    scrubber.replica_scrubmap = ScrubMap();
    scrubber.replica_scrubmap_pos = ScrubMapBuilder();
    scrubber.start = hobject_t();
    scrubber.end = hobject_t();
    scrubber.max_end = hobject_t();
    done = true;
    break;

      default:
        ceph_abort();
    }
  }
  dout(20) << "scrub final state " << Scrubber::state_string(scrubber.state)
       << " [" << scrubber.start << "," << scrubber.end << ")"
       << " max_end " << scrubber.max_end << dendl;
}   

2.3 构建Scrubmap

2.3.1 build_scrub_map_chunk

构建start到end所有对象校验信息,并保存在scrubmap结构中。在函数chunky_scrub中主/从OSD都会调用该函数进行Scrubmap的构建。

  • 1.调用get_pgbackend()->objects_list_range列出start->end 的对象,放入ls和rollback_obs列表, ls 用来存放head和snap对象(默认),rollback_obs存放用来回滚的对象
  • 2.调用函数get_pgbackend()->be_scan_list(map, pos),扫描对象,构建scrubmap结构。
  • 3.当一组[start,end]构建完毕,这里会调用PG::_repair_oinfo_oid,如果hoid和scrubmap中解析出来的不一致,这时候需要修复,以scrubmap中解析出的hoid为准,修复oi(o.attrs)中的soid。
  • 4.调用_scan_snaps修复SnapMapper中的snap信息。
//构建start到end所有对象校验信息,并保存在scrubmap结构中
int PG::build_scrub_map_chunk(
  ScrubMap &map,
  ScrubMapBuilder &pos,
  hobject_t start,
  hobject_t end,
  bool deep,
  ThreadPool::TPHandle &handle)
{
  dout(10) << __func__ << " [" << start << "," << end << ") "
       << " pos " << pos
       << dendl;
  //build_scrub_map_chunk [2:60000000::::head,2:7341ac06:::rbd_data.6ed5b6b8b4567.00000000000000c5:0)  pos (2/24 2:61948a30:::rbd_data.6ed5b6b8b4567.000000000000006d:head deep)

  // start
  while (pos.empty()) {//初始状态pos.empty()为空
    pos.deep = deep;
    map.valid_through = info.last_update;
    osr->flush();

    // objects
    vector<ghobject_t> rollback_obs;
    pos.ret = get_pgbackend()->objects_list_range(
      start,
      end,
      0,
      &pos.ls,
      &rollback_obs);//列出范围内的对象, ls 用来存放head和snap对象,rollback_obs存放用来回滚的对象
    if (pos.ret < 0) {
      dout(5) << "objects_list_range error: " << pos.ret << dendl;
      return pos.ret;
    }
    if (pos.ls.empty()) {
      break;
    }
    _scan_rollback_obs(rollback_obs, handle);
    pos.pos = 0;
    return -EINPROGRESS;
  }

  // scan objects
  while (!pos.done()) {
    int r = get_pgbackend()->be_scan_list(map, pos);//扫描对象,构建scrubmap结构
    if (r == -EINPROGRESS) {
      return r;
    }
  }

  // finish
  dout(20) << __func__ << " finishing" << dendl;
  assert(pos.done());
  _repair_oinfo_oid(map);
  if (!is_primary()) {
    ScrubMap for_meta_scrub;
    // In case we restarted smaller chunk, clear old data
    scrubber.cleaned_meta_map.clear_from(scrubber.start);
    scrubber.cleaned_meta_map.insert(map);
    scrubber.clean_meta_map(for_meta_scrub);
    _scan_snaps(for_meta_scrub);
  }

  dout(20) << __func__ << " done, got " << map.objects.size() << " items"
       << dendl;
  return 0;
}

2.3.2 PGBackend::be_scan_list

用于构建scrubmap中对象的校验信息。

  • 1.调用store->stat 获取store层对象stat的信息,保存在st中。实际调用BlueStore::stat()
  • 2.设置o.size 的值为 st.st_size,并调用store->getattrs 把对象的attr信息保存在o.attrs中。
  • 3.如果deep为true,调用函数be_deep_scrub,获取对象omap和data的digest信息。
  • 注:这里构建scrubmap,如果是scrub,对比的只有size和attrs信息,如果是deep-scrub,对比的多了omap和data的信息;
  • 4.开始对比下一个object

    int PGBackend::be_scan_list(
    ScrubMap &map,
    ScrubMapBuilder &pos)
    {
    dout(10) << __func__ << " " << pos << dendl;
    //be_scan_list (4/24 2:625a7837:::rbd_data.6ed5b6b8b4567.000000000000001c:head)
    assert(!pos.done());
    assert(pos.pos < pos.ls.size());
    hobject_t& poid = pos.ls[pos.pos];
    
    struct stat st;
    int r = store->stat(
    ch,
    ghobject_t(
      poid, ghobject_t::NO_GEN, get_parent()->whoami_shard().shard),
    &st,
    true);//BlueStore::stat
    if (r == 0) {
    ScrubMap::object &o = map.objects[poid];//这里是引用语法,构建scrubmap,如果是scrub,对比的只有size和attrs信息,如果是deep-scrub,对比的多了omap和data的信息;
    o.size = st.st_size;
    assert(!o.negative);
    store->getattrs(
      ch,
      ghobject_t(
    poid, ghobject_t::NO_GEN, get_parent()->whoami_shard().shard),
      o.attrs);
    
    dout(10) << __func__ << " wds-1: o.size:" << o.size << ", o.attrs:" << o.attrs << dendl;
    
    if (pos.deep) {
      r = be_deep_scrub(poid, map, pos, o);// 这里对比omap和data的信息
    }
    dout(25) << __func__ << "  " << poid << dendl;
    dout(10) << __func__ << " wds-3: o.size:" << o.size << ", o.attrs:" << o.attrs << dendl;
    } else if (r == -ENOENT) {
    dout(25) << __func__ << "  " << poid << " got " << r
         << ", skipping" << dendl;
    } else if (r == -EIO) {
    dout(25) << __func__ << "  " << poid << " got " << r
         << ", stat_error" << dendl;
    ScrubMap::object &o = map.objects[poid];
    o.stat_error = true;
    } else {
    derr << __func__ << " got: " << cpp_strerror(r) << dendl;
    ceph_abort();
    }
    if (r == -EINPROGRESS) {//代表连接还在进行中,如果object读了一部分,还没有读完,be_deep_scrub返回该值,就直接返回到上层,不会继续执行pos.next_object()
    return -EINPROGRESS;
    }
    pos.next_object();// 一个object彻底读完了,才开始下一个object
    return 0;
    }

    2.3.3 ReplicatedBackend::be_deep_scrub

  • 1.循环调用store->read 读取对象的数据,每次读取对象数据的长度为cct->_conf->osd_deep_scrub_stride/512k/。
  • 2.这里是逐个对象进行data_hash值的,每次读取对象的512K字节,假如一个对象4169728,那么4169728/524288=7.9 ,需要8次 ‘more data, digest so far’ 之后才会出现‘done with data, digest 0x’,接着开始下一个object。
  • 3.计算data的crc值
  • 4.计算omap header的crc值
  • 5.计算omap的crc值
int ReplicatedBackend::be_deep_scrub(
  const hobject_t &poid,
  ScrubMap &map,
  ScrubMapBuilder &pos,
  ScrubMap::object &o)
{
  dout(10) << __func__ << " " << poid << " pos " << pos << dendl;
  int r;
  uint32_t fadvise_flags = CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |
                           CEPH_OSD_OP_FLAG_FADVISE_DONTNEED |
                           CEPH_OSD_OP_FLAG_BYPASS_CLEAN_CACHE;

  utime_t sleeptime;
  sleeptime.set_from_double(cct->_conf->osd_debug_deep_scrub_sleep);//osd_debug_deep_scrub_sleep=0
  if (sleeptime != utime_t()) {
    lgeneric_derr(cct) << __func__ << " sleeping for " << sleeptime << dendl;
    sleeptime.sleep();
  }

  assert(poid == pos.ls[pos.pos]);
  //这里是逐个对象进行data_hash值的,每次读取对象的512K字节,假如一个对象4169728,那么4169728/524288=7.9 ,需要8次 ‘more data, digest so far’ 之后才会出现‘done with data, digest 0x’,接着开始下一个object。
  if (!pos.data_done()) {
    if (pos.data_pos == 0) {
      pos.data_hash = bufferhash(-1);
    }

    bufferlist bl;
    r = store->read(
      ch,
      ghobject_t(
    poid, ghobject_t::NO_GEN, get_parent()->whoami_shard().shard),
      pos.data_pos,
      cct->_conf->osd_deep_scrub_stride/*512k*/, bl,
      fadvise_flags);
    if (r < 0) {
      dout(20) << __func__ << "  " << poid << " got "
           << r << " on read, read_error" << dendl;
      o.read_error = true;
      return 0;
    }
    if (r > 0) {
      pos.data_hash << bl;
    }
    pos.data_pos += r;
    if (r == cct->_conf->osd_deep_scrub_stride) {
      dout(20) << __func__ << "  " << poid << " more data, digest so far 0x"
           << std::hex << pos.data_hash.digest() << std::dec << dendl;
      return -EINPROGRESS;
    }
    // done with bytes
    pos.data_pos = -1;
    o.digest = pos.data_hash.digest();//计算data的crc值
    o.digest_present = true;
    dout(20) << __func__ << "  " << poid << " done with data, digest 0x"
         << std::hex << o.digest << std::dec << dendl;
  }

  // omap header
  if (pos.omap_pos.empty()) {
    pos.omap_hash = bufferhash(-1);

    bufferlist hdrbl;
    r = store->omap_get_header(
      coll,
      ghobject_t(
    poid, ghobject_t::NO_GEN, get_parent()->whoami_shard().shard),
      &hdrbl, true);
    if (r == -EIO) {
      dout(20) << __func__ << "  " << poid << " got "
           << r << " on omap header read, read_error" << dendl;
      o.read_error = true;
      return 0;
    }
    if (r == 0 && hdrbl.length()) {
      dout(25) << "CRC header " << string(hdrbl.c_str(), hdrbl.length())
           << dendl;
      pos.omap_hash << hdrbl;//计算omap header的crc值
    }
  }

  // omap
  ObjectMap::ObjectMapIterator iter = store->get_omap_iterator(
    coll,
    ghobject_t(
      poid, ghobject_t::NO_GEN, get_parent()->whoami_shard().shard));
  assert(iter);
  if (pos.omap_pos.length()) {
    iter->lower_bound(pos.omap_pos);
  } else {
    iter->seek_to_first();
  }
  int max = g_conf->osd_deep_scrub_keys;//1024
  while (iter->status() == 0 && iter->valid()) {
    pos.omap_bytes += iter->value().length();
    ++pos.omap_keys;
    --max;
    // fixme: we can do this more efficiently.
    bufferlist bl;
    ::encode(iter->key(), bl);
    ::encode(iter->value(), bl);
    pos.omap_hash << bl;

    iter->next();

    if (iter->valid() && max == 0) {
      pos.omap_pos = iter->key();
      return -EINPROGRESS;
    }
    if (iter->status() < 0) {
      dout(25) << __func__ << "  " << poid
           << " on omap scan, db status error" << dendl;
      o.read_error = true;
      return 0;
    }
  }

  if (pos.omap_keys > cct->_conf->
    osd_deep_scrub_large_omap_object_key_threshold/*2000000*/ ||
      pos.omap_bytes > cct->_conf->
    osd_deep_scrub_large_omap_object_value_sum_threshold/*1G*/) {
    dout(25) << __func__ << " " << poid
         << " large omap object detected. Object has " << pos.omap_keys
         << " keys and size " << pos.omap_bytes << " bytes" << dendl;
    o.large_omap_object_found = true;
    o.large_omap_object_key_count = pos.omap_keys;
    o.large_omap_object_value_size = pos.omap_bytes;
    map.has_large_omap_object_errors = true;
  }

  o.omap_digest = pos.omap_hash.digest();//计算omap的crc值
  o.omap_digest_present = true;
  dout(20) << __func__ << " done with " << poid << " omap_digest "
       << std::hex << o.omap_digest << std::dec << dendl;

  // done!
  return 0;
}

2.4 从副本处理,构建scrubmap

2.3中主副本PG::chunky_scrub中发送MOSDRepScrub消息,类型为MSG_OSD_REP_SCRUB到其他副本获取scrubmap。

从副本接收到主副本发送过来的MOSDRepScrub消息,通过PrimaryLogPG::do_request,处理类型为MSG_OSD_REP_SCRUB,开始获取从副本上的scrubmap信息,调用函数PG::replica_scrub完成。

  • 1.断定没有正在scrub的操作
  • 2.确保过来的scrub消息不是过时的
  • 3.如果副本上完成日志应用的操作落后于主副本scrub的操作版本,必须等待他们保持一致,等待
  • 4.如果recovery操作正在进行,等待
  • 5.设置Scrubber::BUILD_MAP_REPLICA状态。
  • 6.调用函数requeue_scrub,放入本副本osd的队列,PG::scrub -> if (!is_primary() && scrubber.state == PG::Scrubber::BUILD_MAP_REPLICA) {chunky_scrub(handle)}->case PG::Scrubber::WAIT_REPLICAS,接着执行副本上的build_scrub_map_chunk。
/* replica_scrub
 *
 * Wait for last_update_applied to match msg->scrub_to as above. Wait
 * for pushes to complete in case of recent recovery. Build a single
 * scrubmap of objects that are in the range [msg->start, msg->end).
 */
void PG::replica_scrub(
  OpRequestRef op,
  ThreadPool::TPHandle &handle)
{
  const MOSDRepScrub *msg = static_cast<const MOSDRepScrub *>(op->get_req());
  assert(!scrubber.active_rep_scrub);//断定没有正在scrub的操作
  dout(7) << "replica_scrub" << dendl;

  if (msg->map_epoch < info.history.same_interval_since) {// 说明过来的scrub是过时的
    dout(10) << "replica_scrub discarding old replica_scrub from "
         << msg->map_epoch << " < " << info.history.same_interval_since 
         << dendl;
    return;
  }

  assert(msg->chunky);
  if (last_update_applied < msg->scrub_to) {//副本上完成日志应用的操作落后于主副本scrub的操作版本,必须等待他们保持一致,等待
    dout(10) << "waiting for last_update_applied to catch up" << dendl;
    scrubber.active_rep_scrub = op;
    return;
  }

  if (active_pushes > 0) {//recovery操作正在进行,等待
    dout(10) << "waiting for active pushes to finish" << dendl;
    scrubber.active_rep_scrub = op;
    return;
  }

  scrubber.state = Scrubber::BUILD_MAP_REPLICA;
  scrubber.replica_scrub_start = msg->min_epoch;
  scrubber.start = msg->start;
  scrubber.end = msg->end;
  scrubber.max_end = msg->end;
  scrubber.deep = msg->deep;
  scrubber.epoch_start = info.history.same_interval_since;
  if (msg->priority) {
    scrubber.priority = msg->priority;
  } else {
    scrubber.priority = get_scrub_priority();
  }

  scrub_can_preempt = msg->allow_preemption;
  scrub_preempted = false;
  scrubber.replica_scrubmap_pos.reset();

  requeue_scrub(msg->high_priority);// 这里会放入队列,PG::scrub -> if (!is_primary() && scrubber.state == PG::Scrubber::BUILD_MAP_REPLICA) {chunky_scrub(handle)}->case PG::Scrubber::WAIT_REPLICAS,接着执行副本上的build_scrub_map_chunk
}

2.5 副本对比

当对象的主副本和从副本都完成了校验信息的构建,并保存在相应的结构Scrubmap中,下一步就是对比各副本的校验信息来完成一致性检查。

首先通过对象自身的信息来选出一个权威对象,然后用权威对象和其他对象做对比校验。

2.5.1 scrub_compare_maps

该函数实现比较不同副本数据是否一致。

  • 1.把actingbackfill对应的osd的scrubmap放置到maps中。
  • 2.把所有副本的object全部放到master_set中
  • 3.调用be_compare_scrubmaps,比较各副本对象,把对象完整的副本所在的shard保存在authoritative
  • 4.调用_scan_snaps,比较snap对象之间的一致性。
//比较不同副本数据是否一致
void PG::scrub_compare_maps() 
{
  dout(10) << __func__ << " has maps, analyzing" << dendl;

  // construct authoritative scrub map for type specific scrubbing
  scrubber.cleaned_meta_map.insert(scrubber.primary_scrubmap);
  map<hobject_t,
      pair<boost::optional<uint32_t>,
           boost::optional<uint32_t>>> missing_digest;

  map<pg_shard_t, ScrubMap *> maps;
  maps[pg_whoami] = &scrubber.primary_scrubmap;

  for (const auto& i : actingbackfill) {//osdmap [8,0,3], 把actingbackfill对应的osd的scrubmap放置到maps中。
    if (i == pg_whoami) continue;
    //scrub_compare_maps replica 0 has 24 items
    //scrub_compare_maps replica 3 has 24 items
    dout(2) << __func__ << " replica " << i << " has "
            << scrubber.received_maps[i].objects.size()
            << " items" << dendl;
    maps[i] = &scrubber.received_maps[i];
  }

  set<hobject_t> master_set;//所有副本上对象的并集

  // Construct master set
  for (const auto map : maps) {
    for (const auto i : map.second->objects) {//这里把所有副本的object全部放到master_set中
      master_set.insert(i.first);
    }
  }

  stringstream ss;
  get_pgbackend()->be_large_omap_check(maps, master_set,
                                       scrubber.large_omap_objects, ss);
  if (!ss.str().empty()) {
    osd->clog->warn(ss);
  }

  if (acting.size() > 1) {
    dout(10) << __func__ << "  comparing replica scrub maps" << dendl;

    // Map from object with errors to good peer
    map<hobject_t, list<pg_shard_t>> authoritative; //对象完整的副本所在的shard集合
    //scrub_compare_maps   osd.8 has 24 items
    dout(2) << __func__ << "   osd." << acting[0] << " has "
        << scrubber.primary_scrubmap.objects.size() << " items" << dendl;

    ss.str("");
    ss.clear();

    get_pgbackend()->be_compare_scrubmaps(//比较各副本对象,把对象完整的副本所在的shard保存在authoritative
      maps,
      master_set,
      state_test(PG_STATE_REPAIR),
      scrubber.missing,
      scrubber.inconsistent,
      authoritative,
      missing_digest,
      scrubber.shallow_errors,
      scrubber.deep_errors,
      scrubber.store.get(),
      info.pgid, acting,
      ss);
    dout(2) << ss.str() << dendl;

    if (!ss.str().empty()) {
      osd->clog->error(ss);
    }

    for (map<hobject_t, list<pg_shard_t>>::iterator i = authoritative.begin();
     i != authoritative.end();
     ++i) {
      list<pair<ScrubMap::object, pg_shard_t> > good_peers;
      for (list<pg_shard_t>::const_iterator j = i->second.begin();
       j != i->second.end();
       ++j) {
    good_peers.push_back(make_pair(maps[*j]->objects[i->first], *j));//object -> osd
      }
      scrubber.authoritative.insert(
    make_pair(
      i->first,
      good_peers));// object -> (object -> osd)
    }

    for (map<hobject_t, list<pg_shard_t>>::iterator i = authoritative.begin();
     i != authoritative.end();
     ++i) {
      scrubber.cleaned_meta_map.objects.erase(i->first);
      scrubber.cleaned_meta_map.objects.insert(
    *(maps[i->second.back()]->objects.find(i->first))
    );
    }
  }

  ScrubMap for_meta_scrub;
  scrubber.clean_meta_map(for_meta_scrub);

  // ok, do the pg-type specific scrubbing
  scrub_snapshot_metadata(for_meta_scrub, missing_digest);
  // Called here on the primary can use an authoritative map if it isn't the primary
  _scan_snaps(for_meta_scrub);//比较snap对象之间的一致性
  if (!scrubber.store->empty()) {
    if (state_test(PG_STATE_REPAIR)) {
      dout(10) << __func__ << ": discarding scrub results" << dendl;
      scrubber.store->flush(nullptr);
    } else {
      dout(10) << __func__ << ": updating scrub object" << dendl;
      ObjectStore::Transaction t;
      scrubber.store->flush(&t);
      osd->store->queue_transaction(osr.get(), std::move(t), nullptr);
    }
  }
}

2.5.2 be_compare_scrubmaps

该函数用来比较各副本的一致性。

  • 1.遍历master_set中所有对象。进行逐个对象的一致性检查。
  • 2.调用函数be_select_auth_object,选择出一个具有权威对象的副本auth,如果没有选出权威对象,调用set_auth_missing设置missing以及根据情况设置deep_errors,shallow_errors。
  • 3.调用函数be_compare_scrub_objects 比较各shard上的对象和auth 对象,包括data的digest、omap的digest以及attrs。
  • 4.如果结果是clean,表明该对象和权威对象各项一致,就把该shard添加到auth_list列表中。
  • 5.如果结果不是clean,就把该对象加到cur_inconsistent列表中,分别统计deep_errors和shallow_errors的值。
  • 6.如果该对象在某个osd(j)中不存在,就把该osd加入到cur_missing 列表,统计shallow_errors值。
  • 7.检查该对象的对比结果,如果cur_missing不为空,就添加到missing列表;如果cur_inconsistent不为空,就加入inconsistent列表。
  • 8.如果该对象有不完整的副本,就把没有问题的记录放在authoritative中。
  • 9.如果object_info里没有data的digest和omap的digest,update置为MAYBE;如果auth_oi 中记录的data_digest和omap_digest和实际计算出的auth_object的data_digest和omap_digest不一致。修复模式下设置为FORCE,构建missing_digest,用于强制恢复。
//对比对象各个副本的一致性
void PGBackend::be_compare_scrubmaps(
  const map<pg_shard_t,ScrubMap*> &maps,
  const set<hobject_t> &master_set,
  bool repair,
  map<hobject_t, set<pg_shard_t>> &missing,
  map<hobject_t, set<pg_shard_t>> &inconsistent,
  map<hobject_t, list<pg_shard_t>> &authoritative,
  map<hobject_t, pair<boost::optional<uint32_t>,
                      boost::optional<uint32_t>>> &missing_digest,
  int &shallow_errors, int &deep_errors,
  Scrub::Store *store,
  const spg_t& pgid,
  const vector<int> &acting,
  ostream &errorstream)
{
  utime_t now = ceph_clock_now();

  // Check maps against master set and each other
  for (set<hobject_t>::const_iterator k = master_set.begin();
       k != master_set.end();
       ++k) {//遍历master_set中所有对象。进行逐个对象的一致性检查。
    object_info_t auth_oi;
    map<pg_shard_t, shard_info_wrapper> shard_map;

    inconsistent_obj_wrapper object_error{*k};

    bool digest_match;
    map<pg_shard_t, ScrubMap *>::const_iterator auth =
      be_select_auth_object(*k, maps, &auth_oi, shard_map, digest_match,
                pgid, errorstream);//选出一个具有权威对象的副本给auth,如果没有选出,shallow_errors+1 记录这种错误

    list<pg_shard_t> auth_list;
    set<pg_shard_t> object_errors;
    if (auth == maps.end()) {//没有找到权威的object
      object_error.set_version(0);
      object_error.set_auth_missing(*k, maps, shard_map, shallow_errors,
    deep_errors, get_parent()->whoami_shard());
      if (object_error.has_deep_errors())
    ++deep_errors;
      else if (object_error.has_shallow_errors())
    ++shallow_errors;
      store->add_object_error(k->pool, object_error);
      errorstream << pgid.pgid << " soid " << *k
          << " : failed to pick suitable object info\n";
            continue;
    }
    object_error.set_version(auth_oi.user_version);
    ScrubMap::object& auth_object = auth->second->objects[*k];//auth_object 即为权威的object
    set<pg_shard_t> cur_missing;
    set<pg_shard_t> cur_inconsistent;
    bool fix_digest = false;

    for (auto  j = maps.cbegin(); j != maps.cend(); ++j) {
      if (j == auth)
    shard_map[auth->first].selected_oi = true;
      if (j->second->objects.count(*k)) {//可以找到对象
    shard_map[j->first].set_object(j->second->objects[*k]);
    // Compare
    stringstream ss;
    bool found = be_compare_scrub_objects(auth->first,
                   auth_object,
                   auth_oi,
                   j->second->objects[*k],
                   shard_map[j->first],
                   object_error,
                   ss,
                   k->has_snapset());//比较各shard上的对象和auth 对象,包括data的digest、omap的digest以及attrs。

/*
*be_compare_scrubmaps replicated shards 3 digest_match
*be_compare_scrubmaps replicated shards 3 digest_match
*be_compare_scrubmaps replicated auth shards 3 digest_match
*/
    dout(20) << __func__ << (repair ? " repair " : " ") << (parent->get_pool().is_replicated() ? "replicated " : "")
      << (j == auth ? "auth " : "") << "shards " << shard_map.size() << (digest_match ? " digest_match " : " ")
      << (shard_map[j->first].has_data_digest_mismatch_info() ? "info_mismatch " : "")
      << (shard_map[j->first].only_data_digest_mismatch_info() ? "only" : "")
      << dendl;

        if (cct->_conf->osd_distrust_data_digest/*false*/) {
      if (digest_match && parent->get_pool().is_replicated()
              && shard_map[j->first].has_data_digest_mismatch_info()) {
        fix_digest = true;
      }
      shard_map[j->first].clear_data_digest_mismatch_info();
    // If all replicas match, but they don'by using missing_digest mechanism
    } else if (repair && parent->get_pool().is_replicated() && j == auth && shard_map.size() > 1
        && digest_match && shard_map[j->first].only_data_digest_mismatch_info()
        && auth_object.digest_present) {
      // Set in missing_digests
      fix_digest = true;
      // Clear the error
      shard_map[j->first].clear_data_digest_mismatch_info();
      errorstream << pgid << " soid " << *k << " : repairing object info data_digest" << "\n";
    }
    // Some errors might have already been set in be_select_auth_object()
    if (shard_map[j->first].errors != 0) {//有error信息
      cur_inconsistent.insert(j->first);
      if (shard_map[j->first].has_deep_errors())
        ++deep_errors;
      else
        ++shallow_errors;
      // Only true if be_compare_scrub_objects() found errors and put something
      // in ss.
      if (found)
        errorstream << pgid << " shard " << j->first << " soid " << *k
              << " : " << ss.str() << "\n";
    } else if (found) {// 有不匹配的信息
      // Track possible shard to use as authoritative, if needed
      // There are errors, without identifying the shard
      object_errors.insert(j->first);
      errorstream << pgid << " soid " << *k << " : " << ss.str() << "\n";
    } else {// 没有异常,才会把osd 加入到auth_list
      // XXX: The auth shard might get here that we don't know
      // that it has the "correct" data.
      auth_list.push_back(j->first);
    }
      } else {//对象在某个osd(j)中不存在
    cur_missing.insert(j->first);// 把该osd加入到cur_missing 列表
    shard_map[j->first].set_missing();
    shard_map[j->first].primary = (j->first == get_parent()->whoami_shard());
    // Can't have any other errors if there is no information available
    ++shallow_errors;
    errorstream << pgid << " shard " << j->first << " " << *k << " : missing\n";
      }
      object_error.add_shard(j->first, shard_map[j->first]);
    }

    if (auth_list.empty()) {
      if (object_errors.empty()) {
        errorstream << pgid.pgid << " soid " << *k
          << " : failed to pick suitable auth object\n";
        goto out;
      }
      // Object errors exist and nothing in auth_list
      // Prefer the auth shard otherwise take first from list.
      pg_shard_t shard;
      if (object_errors.count(auth->first)) {//auth osd存在Object errors,osd又不在auth_list,加入到auth_list
    shard = auth->first;
      } else {
    shard = *(object_errors.begin());
      }
      auth_list.push_back(shard);
      object_errors.erase(shard);
    }
    // At this point auth_list is populated, so we add the object errors shards
    // as inconsistent.
    cur_inconsistent.insert(object_errors.begin(), object_errors.end());
    if (!cur_missing.empty()) {
      missing[*k] = cur_missing;//加入missing列表
    }
    if (!cur_inconsistent.empty()) {
      inconsistent[*k] = cur_inconsistent;//加入inconsistent列表
    }

    if (fix_digest) {//是否修复digest
      boost::optional<uint32_t> data_digest, omap_digest;
      assert(auth_object.digest_present);
      data_digest = auth_object.digest;
      if (auth_object.omap_digest_present) {
        omap_digest = auth_object.omap_digest;
      }
      missing_digest[*k] = make_pair(data_digest, omap_digest);
    }
    // Special handling of this particular type of inconsistency
    // This can over-ride a data_digest or set an omap_digest
    // when all replicas match but the object info is wrong.
    if (!cur_inconsistent.empty() || !cur_missing.empty()) {
      authoritative[*k] = auth_list;//如果该对象有不完整的副本,就把没有问题的记录放在authoritative中。
    } else if (!fix_digest && parent->get_pool().is_replicated()) {
      enum {
    NO = 0,
    MAYBE = 1,
    FORCE = 2,
      } update = NO;
//如果object_info里没有data的digest和omap的digest,update置为MAYBE
      if (auth_object.digest_present && !auth_oi.is_data_digest()) {
    dout(20) << __func__ << " missing data digest on " << *k << dendl;
    update = MAYBE;
      }
      if (auth_object.omap_digest_present && !auth_oi.is_omap_digest()) {
    dout(20) << __func__ << " missing omap digest on " << *k << dendl;
    update = MAYBE;
      }

      // recorded digest != actual digest?
          if (auth_oi.is_data_digest() && auth_object.digest_present &&
        auth_oi.data_digest != auth_object.digest) {//auth_oi 中记录的data_digest和omap_digest和实际计算出的auth_object的data_digest和omap_digest不一致。               
      assert(cct->_conf->osd_distrust_data_digest                     
             || shard_map[auth->first].has_data_digest_mismatch_info());
      errorstream << pgid << " recorded data digest 0x"               
            << std::hex << auth_oi.data_digest << " != on disk 0x"    
            << auth_object.digest << std::dec << " on " << auth_oi.soid
            << "\n";                         
      if (repair)                            
        update = FORCE;                      
          }                                  
          if (auth_oi.is_omap_digest() && auth_object.omap_digest_present &&
        auth_oi.omap_digest != auth_object.omap_digest) {             
            assert(shard_map[auth->first].has_omap_digest_mismatch_info());
      errorstream << pgid << " recorded omap digest 0x"               
            << std::hex << auth_oi.omap_digest << " != on disk 0x"    
            << auth_object.omap_digest << std::dec                    
            << " on " << auth_oi.soid << "\n";                        
      if (repair)                            
        update = FORCE;                      
          }                                  

          if (update != NO) {                
      utime_t age = now - auth_oi.local_mtime;                        
      if (update == FORCE ||                 
          age > cct->_conf->osd_deep_scrub_update_digest_min_age/*2_hr*/) {   
              boost::optional<uint32_t> data_digest, omap_digest;     
              if (auth_object.digest_present) {                       
                data_digest = auth_object.digest;                     
          dout(20) << __func__ << " will update data digest on " << *k << dendl;
              }                              
              if (auth_object.omap_digest_present) {                  
                omap_digest = auth_object.omap_digest;                
          dout(20) << __func__ << " will update omap digest on " << *k << dendl;
          }
    missing_digest[*k] = make_pair(data_digest, omap_digest);
  } else {
    dout(20) << __func__ << " missing digest but age " << age
       << " < " << cct->_conf->osd_deep_scrub_update_digest_min_age
       << " on " << *k << dendl;
  }
      }
    }              

    out:
        if (object_error.has_deep_errors())
          ++deep_errors;
        else if (object_error.has_shallow_errors())
          ++shallow_errors;
        if (object_error.errors || object_error.union_shards.errors) {
          store->add_object_error(k->pool, object_error);
        }
      }
    }

2.5.3 be_select_auth_object

该函数在各副本对象中选择出一个权威的对象auth_obj。其原理是根据自身所携带的冗余信息验证自己是否完整。

  • 1.把obj所在的osd放入shards列表,主osd放在列表头部。
  • 2.遍历shards列表,循环逐个shard进行判断。
  • 3.如果获取对象的数据和元数据出错,那么退出
  • 4.确认获取OI_ATTR属性不为空,并将数据结构正确解码为object_info_t (oi),验证保存在object_info_t(oi)中的size和扫描对象的size值是否一致,如果不一致,就继续查找下一个更好的副本对象。
  • 5.如果暂时auth 已经选出(auth_version != eversion_t()),auth的digest和待验证的obj的digest不一致,设置digest_match=false
  • 6.如果shard_info.errors不为0,说明某个object有error,是不会选择为auth的。
  • 7.如果shard_info.errors 为0,此时才有竞选auth的资格。开始选择auth:
    • 第一次(主副本),auth_version=0`0,满足auth_version == eversion_t(),设置auth,auth_version = oi.version
    • 第二次(次副本),oi.version == auth_version, 无法满足条件,进不去;//(dcount(oi, oi_prio) > dcount(*auth_oi, auth_prio)))默认同时拥有data和omap crc的object所在osd为auth,如果第一个osd的object 的data或者omap的crc缺失,那么第二个osd的object判断,会大于第一个的count,此时第二个为auth。
    • 第三次(次副本),同第二次;
    • 注:如果第一次的object比较完整,即使其他副本obj也完整,仍然选择第一个osd当选auth。
//在各副本对象中选择出一个权威的对象auth_obj
map<pg_shard_t, ScrubMap *>::const_iterator
  PGBackend::be_select_auth_object(
  const hobject_t &obj,
  const map<pg_shard_t,ScrubMap*> &maps,
  object_info_t *auth_oi,
  map<pg_shard_t, shard_info_wrapper> &shard_map,
  bool &digest_match,
  spg_t pgid,
  ostream &errorstream)
{
  eversion_t auth_version;
  bool auth_prio = false;

  // Create list of shards with primary first so it will be auth copy all
  // other things being equal.
  list<pg_shard_t> shards;
  for (map<pg_shard_t, ScrubMap *>::const_iterator j = maps.begin();
       j != maps.end();
       ++j) {
    if (j->first == get_parent()->whoami_shard())
      continue;
    shards.push_back(j->first);
  }
  shards.push_front(get_parent()->whoami_shard());//[8,0,3]  osd.8 为主 把obj所在的osd放入shards列表,主osd放在列表头部。

  map<pg_shard_t, ScrubMap *>::const_iterator auth = maps.end();
  digest_match = true;
  for (auto &l : shards) {
    bool oi_prio = false;
    ostringstream shard_errorstream;
    bool error = false;
    map<pg_shard_t, ScrubMap *>::const_iterator j = maps.find(l);
    map<hobject_t, ScrubMap::object>::iterator i =
      j->second->objects.find(obj);
    if (i == j->second->objects.end()) {//没有找到要校验的对象
      continue;
    }
    auto& shard_info = shard_map[j->first];//scrubmap 中某个osd的info信息
    if (j->first == get_parent()->whoami_shard())
      shard_info.primary = true;//当前osd为primary
    if (i->second.read_error) {// 要校验的对象 read_error
      shard_info.set_read_error();
      if (error)
        shard_errorstream << ", ";
      error = true;
      shard_errorstream << "candidate had a read error";
    }
    if (i->second.ec_hash_mismatch) {
      shard_info.set_ec_hash_mismatch();
      if (error)
        shard_errorstream << ", ";
      error = true;
      shard_errorstream << "candidate had an ec hash mismatch";
    }
    if (i->second.ec_size_mismatch) {
      shard_info.set_ec_size_mismatch();
      if (error)
        shard_errorstream << ", ";
      error = true;
      shard_errorstream << "candidate had an ec size mismatch";
    }

    object_info_t oi;
    bufferlist bl;
    map<string, bufferptr>::iterator k;
    SnapSet ss;
    bufferlist ss_bl, hk_bl;

    if (i->second.stat_error) {
      shard_info.set_stat_error();
      if (error)
        shard_errorstream << ", ";
      error = true;
      shard_errorstream << "candidate had a stat error";
      // With stat_error no further checking
      // We don't need to also see a missing_object_info_attr
      goto out;//如果获取对象的数据和元数据出错,那么退出
    }

    // We won't pick an auth copy if the snapset is missing or won't decode.
    if (obj.is_head() || obj.is_snapdir()) {
      k = i->second.attrs.find(SS_ATTR);
      if (k == i->second.attrs.end()) {
    shard_info.set_snapset_missing();
        if (error)
          shard_errorstream << ", ";
        error = true;
        shard_errorstream << "candidate had a missing snapset key";
      } else {
        ss_bl.push_back(k->second);
        try {
      bufferlist::iterator bliter = ss_bl.begin();
      ::decode(ss, bliter);
        } catch (...) {
      // invalid snapset, probably corrupt
      shard_info.set_snapset_corrupted();
          if (error)
            shard_errorstream << ", ";
          error = true;
          shard_errorstream << "candidate had a corrupt snapset";
        }
      }
    }

    if (parent->get_pool().is_erasure()) {
      ECUtil::HashInfo hi;
      k = i->second.attrs.find(ECUtil::get_hinfo_key());
      if (k == i->second.attrs.end()) {
    shard_info.set_hinfo_missing();
        if (error)
          shard_errorstream << ", ";
        error = true;
        shard_errorstream << "candidate had a missing hinfo key";
      } else {
    hk_bl.push_back(k->second);
        try {
      bufferlist::iterator bliter = hk_bl.begin();
      decode(hi, bliter);
        } catch (...) {
      // invalid snapset, probably corrupt
      shard_info.set_hinfo_corrupted();
          if (error)
            shard_errorstream << ", ";
          error = true;
          shard_errorstream << "candidate had a corrupt hinfo";
        }
      }
    }

    k = i->second.attrs.find(OI_ATTR);
    if (k == i->second.attrs.end()) {
      // no object info on object, probably corrupt
      shard_info.set_info_missing();
      if (error)
        shard_errorstream << ", ";
      error = true;
      shard_errorstream << "candidate had a missing info key";
      goto out;
    }
    bl.push_back(k->second);
    try {
      bufferlist::iterator bliter = bl.begin();
      ::decode(oi, bliter);//确认获取OI_ATTR属性不为空,并将数据结构正确解码为object_info_t (oi)
    } catch (...) {
      // invalid object info, probably corrupt
      shard_info.set_info_corrupted();
      if (error)
        shard_errorstream << ", ";
      error = true;
      shard_errorstream << "candidate had a corrupt info";
      goto out;
    }

    // This is automatically corrected in PG::_repair_oinfo_oid()
    assert(oi.soid == obj);

    dout(10) << __func__ << "wds-2: l:" << l << ", i->first:" << i.first << ", i->second.size:" << i->second.size << ", oi.size:" << oi.size << dendl;

    if (i->second.size != be_get_ondisk_size(oi.size)) {//验证保存在object_info_t(oi)中的size和扫描对象的size值是否一致,如果不一致,就继续查找下一个更好的副本对象
      shard_info.set_obj_size_info_mismatch();
      if (error)
        shard_errorstream << ", ";
      error = true;
      shard_errorstream << "candidate size " << i->second.size << " info size "
            << oi.size << " mismatch";
    }

    // digest_match will only be true if computed digests are the same
    if (auth_version != eversion_t()
        && auth->second->objects[obj].digest_present
        && i->second.digest_present
        && auth->second->objects[obj].digest != i->second.digest) {// 暂时auth 已经选出(auth_version != eversion_t()),auth的digest和待验证的obj的digest不一致,设置digest_match=false
      digest_match = false;
      dout(10) << __func__ << " digest_match = false, " << obj << " data_digest 0x" << std::hex << i->second.digest
            << " != data_digest 0x" << auth->second->objects[obj].digest << std::dec
            << dendl;
    }

    // Don't use this particular shard due to previous errors
    // XXX: For now we can't pick one shard for repair and another's object info or snapset
    if (shard_info.errors)//这里如果某个object有error,是不会选择为auth的。
      goto out;

    // XXX: Do I want replicated only?
    if (parent->get_pool().is_replicated() && cct->_conf->osd_distrust_data_digest/*false*/) {
      // This is a boost::optional<bool> so see if option set AND it has the value true
      // We give priority to a replica where the ObjectStore like BlueStore has builtin checksum
      if (j->second->has_builtin_csum && j->second->has_builtin_csum == true) {
        oi_prio = true;
      }
    }

    dout(10) << __func__ << " wds-4: auth_version:" << auth_version <<  "oi.version:" << oi.version << " j->first:" << j->first << dendl;
/*
*第一次,auth_version=0`0,满足auth_version == eversion_t(),设置auth,auth_version = oi.version
*第二次,oi.version == auth_version, 无法满足条件,进不去;//(dcount(oi, oi_prio) > dcount(*auth_oi, auth_prio)))默认同时拥有data和omap crc的object所在osd为auth,如果第一个osd的object 的data或者omap的crc缺失,那么第二个osd的object判断,会大于第一个的count,此时第二个为auth
*第三次,同第二次;
*注:如果第一次的object比较完整,即使其他副本obj也完整,仍然选择第一个osd当选auth。
*/
    if (auth_version == eversion_t() || oi.version > auth_version ||
        (oi.version == auth_version && dcount(oi, oi_prio) > dcount(*auth_oi, auth_prio))) {
      auth = j;
      *auth_oi = oi;
      auth_version = oi.version;
      auth_prio = oi_prio;
      dout(10) << __func__ << " wds-5: auth->first:" << auth->first << dendl;
    }

out:
    if (error)
        errorstream << pgid.pgid << " shard " << l << " soid " << obj
            << " : " << shard_errorstream.str() << "\n";
    // Keep scanning other shards
  }
  dout(10) << __func__ << ": selecting osd " << auth->first
       << " for obj " << obj
       << " with oi " << *auth_oi
       << dendl;
/*
*2020-08-28 09:53:18.636882 7f60d52e4700 10 osd.8 pg_epoch: 147 pg[2.6( v 147'409 (0'0,147'409] local-lis/les=146/147 n=37 ec=131/131 lis/c 146/146 les/c/f 147/147/0 146/146/146) [8,0,3] r=
*0 lpr=146 crt=147'409 lcod 147'408 mlcod 147'408 active+clean+scrubbing+deep] be_select_auth_object: selecting osd 8 for obj 2:62b05854:::rbd_data.6ed5b6b8b4567.000000000000006e:head with 
*oi 2:62b05854:::rbd_data.6ed5b6b8b4567.000000000000006e:head(147'378 osd.8.0:5 dirty|data_digest|omap_digest s 3866624 uv 372 dd 210c09f1 od ffffffff alloc_hint [4194304 4194304 0])
*/
  return auth;
}

2.6 结束scrub过程

PG::scrub_finish() 函数用于结束scrub过程。

  • 1.调用函数scrub_process_inconsistent 用于修复scrubber中标记的missing和inconsistent对象,最终调用repair_object函数。它只是在peer_missing和missing中标记对象缺失。authoritative 不为空,且repair为true时,has_error才为true。
  • 2.设置相关PG状态及统计信息。
  • 3.如果has_error为true,把DoRecovery 事件发送PG状态机,发起实际对象的修复操作。实际上,执行deepscrub操作,repair为0,不会触发修复操作的。
  • 4.当pg恢复正常,调用函数PG::share_pg_info()把 新的pg_info 信息发送到其他副本。
// the part that actually finalizes a scrub
void PG::scrub_finish() 
{
  bool repair = state_test(PG_STATE_REPAIR);
  // if the repair request comes from auto-repair and large number of errors,
  // we would like to cancel auto-repair
  if (repair && scrubber.auto_repair
      && scrubber.authoritative.size() > cct->_conf->osd_scrub_auto_repair_num_errors) {
    state_clear(PG_STATE_REPAIR);
    repair = false;
  }
  bool deep_scrub = state_test(PG_STATE_DEEP_SCRUB);
  const char *mode = (repair ? "repair": (deep_scrub ? "deep-scrub" : "scrub"));

  // type-specific finish (can tally more errors)
  _scrub_finish();

//authoritative 不为空,且repair为true时,has_error才为true
  bool has_error = scrub_process_inconsistent();// 用于修复scrubber中标记的missing和inconsistent对象,最终调用repair_object函数。她只是在peer_missing和missing中标记对象缺失。

  {
    stringstream oss;
    oss << info.pgid.pgid << " " << mode << " ";
    int total_errors = scrubber.shallow_errors + scrubber.deep_errors;
    if (total_errors)
      oss << total_errors << " errors";
    else
      oss << "ok";//2.6 deep-scrub ok
    if (!deep_scrub && info.stats.stats.sum.num_deep_scrub_errors)
      oss << " ( " << info.stats.stats.sum.num_deep_scrub_errors
          << " remaining deep scrub error details lost)";
    if (repair)
      oss << ", " << scrubber.fixed << " fixed";
    if (total_errors)
      osd->clog->error(oss);
    else
      osd->clog->debug(oss);
  }

  // finish up
  unreg_next_scrub();
  utime_t now = ceph_clock_now();
  info.history.last_scrub = info.last_update;//更新pginfo的信息
  info.history.last_scrub_stamp = now;
  if (scrubber.deep) {
    info.history.last_deep_scrub = info.last_update;
    info.history.last_deep_scrub_stamp = now;
  }
  // Since we don't know which errors were fixed, we can only clear them
  // when every one has been fixed.
  if (repair) {
    if (scrubber.fixed == scrubber.shallow_errors + scrubber.deep_errors) {
      assert(deep_scrub);
      scrubber.shallow_errors = scrubber.deep_errors = 0;
    } else {
      // Deep scrub in order to get corrected error counts
      scrub_after_recovery = true;
    }
  }
  if (deep_scrub) {
    if ((scrubber.shallow_errors == 0) && (scrubber.deep_errors == 0))
      info.history.last_clean_scrub_stamp = now;
    info.stats.stats.sum.num_shallow_scrub_errors = scrubber.shallow_errors;
    info.stats.stats.sum.num_deep_scrub_errors = scrubber.deep_errors;
    info.stats.stats.sum.num_large_omap_objects = scrubber.large_omap_objects;
  } else {
    info.stats.stats.sum.num_shallow_scrub_errors = scrubber.shallow_errors;
    // XXX: last_clean_scrub_stamp doesn't mean the pg is not inconsistent
    // because of deep-scrub errors
    if (scrubber.shallow_errors == 0)
      info.history.last_clean_scrub_stamp = now;
  }
  info.stats.stats.sum.num_scrub_errors = 
    info.stats.stats.sum.num_shallow_scrub_errors +
    info.stats.stats.sum.num_deep_scrub_errors;
  reg_next_scrub();

  {
    ObjectStore::Transaction t;
    dirty_info = true;
    write_if_dirty(t);
    int tr = osd->store->queue_transaction(osr.get(), std::move(t), NULL);
    assert(tr == 0);
  }

  if (has_error) {//把DoRecovery 事件发送PG状态机,发起实际对象的修复操作。如果是deep操作,has_error 为false
    queue_peering_event(
      CephPeeringEvtRef(
    std::make_shared<CephPeeringEvt>(
      get_osdmap()->get_epoch(),
      get_osdmap()->get_epoch(),
      DoRecovery())));
  }

  scrub_clear_state();
  scrub_unreserve_replicas();

  if (is_active() && is_primary()) {
    share_pg_info();
  }
}

注:参考《ceph 源码分析》丛书

猜你喜欢

转载自blog.51cto.com/wendashuai/2531075