前言

leveldb中level之间的compaction是leveldb一个核心功能，由一个背景线程执行，背景线程中BackgroundCompaction()函数完成主体工作，包括两个任务：

若imm_非空，则将imm_写入到磁盘，生成新的level 0中的sstable文件；
根据一些依据来选择某个level比如level n，将level n中的文件与level n+1的文件进行合并，避免level n中文件过多，同时在这个过程中删除掉过期的kv以及被用户删除的kv。

本文的关注重点是第2条中涉及到的如何“选择某个level，如何选择需要参与合并的文件”。

蓝图

backgroundCompaction()会先调用PickCompaction()函数来选择出需要进行compaction的level n，将选择结果信息放在Compaction类中，然后才会“start to compact”如下代码所示。

-------------------------------------------------------------------------Db_impl.cc
void DBImpl::BackgroundCompaction()
{
    Compaction* c;
    // some other codes
    // 从数据库中选出合适进行compaction的level，将选择结果信息放在Compaction类中
    c = versions_->PickCompaction();
    // start to compact
}

先看一下Compaction类的结构

---------------------------------------------------------------------------version_set.h
// A Compaction encapsulates information about a compaction.
class Compaction {
 public:
  ~Compaction();

  // Return the level that is being compacted.  Inputs from "level"
  // and "level+1" will be merged to produce a set of "level+1" files.
  // 返回被选择出来的level的值，比如n，那么就是将level n的文件与level n+1的文件进行merge
  int level() const { return level_; }

  // Return the object that holds the edits to the descriptor done
  // by this compaction.
  //暂时不明白这个edit是什么？？？？
  VersionEdit* edit() { return &edit_; }

  // "which" must be either 0 or 1
  /* 返回input_[0]或者input_[1]这两个vector中存放的文件数量， 
   * input_[0]和input_[1]分别存放level n和level n+1参与merge的所有文件的FileMetaData指针
   */
  int num_input_files(int which) const { return inputs_[which].size(); }

  // Return the ith input file at "level()+which" ("which" must be 0 or 1).
  /* 返回input_[0]或input_[1]中第i个文件的FileMetaData */
  FileMetaData* input(int which, int i) const { return inputs_[which][i]; }

  // Maximum size of files to build during this compaction.
  // DoCompactionWork()时，如果构造的新sstable足够大则需要关闭当前文件构造新的sstable文件
  // max_output_file_size_用来记录这个水位size值，其实值存粹来自于option->max_file_size
  uint64_t MaxOutputFileSize() const { return max_output_file_size_; }

  // Is this a trivial compaction that can be implemented by just
  // moving a single input file to the next level (no merging or splitting)
  /* 判断是否是简单的移动文件的情况，input[1]中没文件，input[0]只有一个文件，同时
   * grandparents_中有交集的文件总size小于配置值，这是为了避免创建的单个level+1文件后续merge到 
   * level+2时的高开销
  bool IsTrivialMove() const;

  // Add all inputs to this compaction as delete operations to *edit.
  // 所有参与comapction的level/level+1文件都记录到set容器edit->deleted_files_，以便后续删除
  void AddInputDeletions(VersionEdit* edit);

  // Returns true if the information we have available guarantees that
  // the compaction is producing data in "level+1" for which no data exists
  // in levels greater than "level+1".
  // 如果user_key在大于level+1（大于level n+1，从level n+2开始)的level中并不存在的话，则返回true
  bool IsBaseLevelForKey(const Slice& user_key);

  // Returns true iff we should stop building the current output
  // before processing "internal_key".
  bool ShouldStopBefore(const Slice& internal_key);

  // Release the input version for the compaction, once the compaction
  // is successful.
  /* 当compaction成功完成后，调用ReleaseInputs()来释放input_version_，
   * 不太清楚input_version_具体指什么 */
  void ReleaseInputs();

 private:
  friend class Version;
  friend class VersionSet;

  Compaction(const Options* options, int level);

  int level_;
  uint64_t max_output_file_size_;
  Version* input_version_;
  VersionEdit edit_;

  // Each compaction reads inputs from "level_" and "level_+1"
  // 记录所有参与compaction的level和level+1文件的两个vector容器
  std::vector<FileMetaData*> inputs_[2];      // The two sets of inputs

  // State used to check for number of of overlapping grandparent files
  // (parent == level_ + 1, grandparent == level_ + 2)
  std::vector<FileMetaData*> grandparents_;
  size_t grandparent_index_;  // Index in grandparent_starts_
  bool seen_key_;             // Some output key has been seen
  int64_t overlapped_bytes_;  // Bytes of overlap between current output
                              // and grandparent files

  // State for implementing IsBaseLevelForKey

  // level_ptrs_ holds indices into input_version_->levels_: our state
  // is that we are positioned at one of the file ranges for each
  // higher level than the ones involved in this compaction (i.e. for
  // all L >= level_ + 2).
  size_t level_ptrs_[config::kNumLevels];
};

VersionSet::PickCompaction()方法构造并填充Compaction

PickCompaction()方法负责构造每次compaction所需要的信息，记录在compaction对象中，包括发生compaction的level，所有需要参与compaction的level和level+1层的所有文件的meta信息，还有level+2的文件信息。

选择进行compaction的level

有两种方式来选择level：

level容量：如果一个level的所有文件总size达到一定水位，该level用来进行compaction；
所谓的seek选择，每次用户执行get操作时会更新db的一些记录字段，总之，被seek较多的文件会被选中来进行compaction；

在代码中，优先使用level容量来进行选择，如果没有开启容量选择才使用所谓的seek选择。

Compaction* VersionSet::PickCompaction()
{
  const bool size_compaction = (current_->compaction_score_ >= 1);
  const bool seek_compaction = (current_->file_to_compact_ != nullptr);

  // 优先考虑size compaction
  if (size_compaction) {
  	 //从容量的角度触发来选择level
  }
  else if (seek_compaction) {
     //然后才考虑seek选择compaction  
  } 
  else
  {
    //如果两种方法都没开启，直接返回空
    return nullptr;
  }
  //other code
}

容量选择

选择level中的文件

对应上面第一个if(size_compaction)里的内容，将level参与compaction的文件选出来放入FileMetaData*类型的vector，也就是input_[0]中；

  if (size_compaction) {
  	//compaction_level_一直在跟踪当前版本中最适合进行compaction的level，见version_set.cc中Finalize
    level = current_->compaction_level_;
    assert(level >= 0);
    assert(level+1 < config::kNumLevels);
    c = new Compaction(options_, level);

    // Pick the first file that comes after compact_pointer_[level]
    /*
     * leveldb为每个level中的文件维持一个 数组compact_pointer_，
     * 这个compact_pointer_[level]指向当前level中上次被compaction的最大key的值，
     * 因此下次对这个level进行compaction时，就要从larget key大于compact_pointer_[level]的文件开始，
     */
    /* 对level中的所有文件一个个来判断，两者满足其一即认为这个文件需要compact到level+1，然后直接break
     * 1. compact_pointer_[level]指向的key为empty，说明该level还没有compction过；
     * 2. 文件的larget key大于compact_pointer_[level]
     */
    /**************************************************************
     * STEP1:在指定的level中获取一个用来compacttion的文件放入iput[0]
     **************************************************************/
    for (size_t i = 0; i < current_->files_[level].size(); i++) {
      FileMetaData* f = current_->files_[level][i];
      // 选择largest key大于compact_pointer_[level]的第一个文件作为需要进行compaction的文件      if (compact_pointer_[level].empty() ||
          icmp_.Compare(f->largest.Encode(), compact_pointer_[level]) > 0) {
        //version_set.h: Each compaction reads inputs from "level_" and "level_+1"
        //std::vector<FileMetaData*> inputs_[2]; 
        c->inputs_[0].push_back(f);
        break;
      }
    }
	//compact_pointer_[level]是单增的，那后加入的小key怎么办？只能靠绕回了，
	//没有比compact_pointer_[level]更大的文件，那就绕回到第一个
    if (c->inputs_[0].empty()) {
      // Wrap-around to the beginning of the key space
      c->inputs_[0].push_back(current_->files_[level][0]);
    }
  }

这个容量选择方法基于两个前提：

levelDB一直在维护一个comaction_level_监控指标，记录最适合进行compaction的level，存放在current->compaction_level_中，维护点是version_set.cc中的finalize();
levelDB一直在为每个level都维持一个string，里面存放该level上次compaction时的最大key值，所有n层的n个string集中放置在一个string数组compact_pointer_[]中。

在这个前提的基础上，每次在level中进行compaction时，在level中选择第一个larget key大于compact_pointer_[level]的第一个文件参与本次compaction。就是当初的循环遍历level中的文件，当满足如下两点时，该文件被选中并跳出循环：

compact_pointer_[level]为empty，说明当前level还从未进行过compaction，那第一个文件就直接被选中；
文件的larget key大于compact_pointer_[level]，文件被选中。

这段代码不能更单纯，在校生写出来也是这个模样。

for (size_t i = 0; i < current_->files_[level].size(); i++) {
      FileMetaData* f = current_->files_[level][i];
      if (compact_pointer_[level].empty() ||
          icmp_.Compare(f->largest.Encode(), compact_pointer_[level]) > 0) {
        //version_set.h: Each compaction reads inputs from "level_" and "level_+1"
        //std::vector<FileMetaData*> inputs_[2]; 
        c->inputs_[0].push_back(f);
        //选中一个就可以
        break;
      }
 }

这里明显有个问题，每次都选择比上次compaction的最大key更大的key值文件来进行compaction，那如果上次compaction之后正好本level写入的kv值的k都很小怎么办，那这些key很小的文件永远都不可能参与compaction了？不会的，如果之后写到本level的数据的key都非常小，那么大于compact_pointer_[level]的文件迟早会被compaction消耗完，上面的for循环就会选不出合适的文件，此时进行绕回，选择level的第一个文件。

//没有比compact_pointer_[level]更大的文件，那就绕回到第一个
if (c->inputs_[0].empty()) {
    // Wrap-around to the beginning of the key space
    c->inputs_[0].push_back(current_->files_[level][0]);
}

level 0文件的交集

按道理以上操作完成后level的文件就选择完了，但是对于level 0来说这里有个坑，level 0的各个文件通常是有交集的，相比多次compaction都在同一level中对有交集的数据进行compaction，肯定不如每次compaction时尽量把有交集的文件都一并处理了。因此对于level 0还要做多一步处理，根据input_[0]中所有文件的key的区间[smallest, largest]，逐个判断level 0中的所有文件，key存在交集的文件都放入input_[0]中，其实也就是增大了input_[0]的size。

// Files in level 0 may overlap each other, so pick up all overlapping ones
  if (level == 0) {
    InternalKey smallest, largest;
    GetRange(c->inputs_[0], &smallest, &largest);
    // Note that the next call will discard the file we placed in
    // c->inputs_[0] earlier and replace it with an overlapping set
    // which will include the picked file.
    current_->GetOverlappingInputs(0, &smallest, &largest, &c->inputs_[0]);
    assert(!c->inputs_[0].empty());
  }

以上，level的工作就完成了，还有level+1和level+2的信息，这些都交给SetupOtherInputs()完成。

SetupOtherInputs

这个函数的工作强行总结一下就是“在指定的level+1中选择那些key range与input _[0]中所有文件key range重合的文件填入input_[1]”

step1:确定level+1参与compaction的文件

计算inputs_[0]的key range假设为range0；
在level+1中将与range0有交集的files都选出来全部放入inputs_[1]中；
计算inputs_[0]和inputs_[1]的并集的key range假设为range1；
根据range1，再重新遍历level的文件，将与range1有交集的level文件全部放入expanded0中，显然有可能出现expanded0.szie > inputs_[0].size;
同样的，以expanded0所有文件的range假设为range3，再用range3去选择有交集的所有level+1的文件放入expanded1中；
如果expanded0.szie > inputs_[0].size，且expanded1.size == inputs_[1].size，则使用expanded0和expanded1取代inputs_[0]和inputs_[1]作为最终的选择结果。

后面 3～6步骤其实就是尝试在不增加level+1参与compaction的文件数量的前提下去增加level参与compaction的文件数量，其实也就是为了尽量在一次compaction中尽量让更多的level文件参与进来。

step2:记录level+2中key区间与inputs_[0]和inputs_[1]有交集的文件

这一步记录所有与inputs_[0]和inputs_[1]并集的key有交集的level+2的文件放入grandparents_容器，这里获取level+2层的文件信息是为了后续在实际compaction过程生成level+1文件时保证新文件不会与level+2中太多文件有key range上的重合（导致后续level+1的compaction太多merge开销）。

step3:维护compaction_pointer_[level]

是的，以上一大段一开始基于这个compaction_pointer_，结束时也是由此模块来维护这个东西。上面这么多猛操作，现在该更新level的compact_pointer_[level]了。

compact_pointer_[level] = largest.Encode().ToString();
//这个的作用暂时不太明白，将level->lagest对放入compact_pointer_容器中
c->edit_.SetCompactPointer(level, largest);

leveldb源码学习1--compaction--1)level及file选择

前言

蓝图