BlueStore & BlueFS & rocksdb association combing

Tag: ceph 12.2.4

BlueStore space initialization

BlueStore disk space management

Overview

  1. The OSD mount directory is managed based on the file system, and the Slow, WAL, and DB space areas are managed based on the bare disk;
  2. Slow area: This type of space is mainly used to store object data and is managed by BlueStore, which is allocated to the BlueFS space segment and managed using the bluefs_extents structure;
  3. WAL area: managed by BlueFS alone and invisible to BlueStore. BlueFS itself is responsible for initialization when it is powered on. For .log files and logs generated by itself, BlueFS prefers to use the WAL type device space. If it does not exist or the WAL device space is insufficient, Then downgrade step by step to select DB and SLOW partitions;
  4. DB area: Managed by BlueFS alone and invisible to BlueStore, BlueFS itself is responsible for initialization when powering on; for .sst files, the DB type device space is preferred, if it does not exist or the DB device space is insufficient, select the Slow type device space ;
  5. rocksdb: Based on the file system KV storage engine, the external interface is called by BlueStore;
  6. BlueStore supports rocksdb, realizes the small file system BlueFS, and realizes BlueRocksEnv to provide the underlying system package for rocksdb;
  7. BlueFS: Flat directory and file hierarchy (tree-like) organizational structure, locating a specific file requires two searches: find the bottom-level folder where the file is located through dir_map; then find the corresponding file in file_map under the folder; each File bluefs_fnode_t structure management, the extents attribute indicates the physical segment collection on the disk, and the prefer_bdev attribute indicates the preferred block device for storing the file; the bdev attribute in each extent in the extents collection identifies the belonging device, and each file may use multiple different blocks. Device (WAL, DB, and Slow) space; the BlueFS:_allocate method specifies the bdev_ attribute value in each extent; power on again, fix the entry position through the second 4K storage space (superblock) of the DB partition through BlueFS::_replay, and read Log file content replay restores metadata (dir_map, file_map).

initialization process

[ceph_osd.cc]
int main(int argc, const char **argv)
>> OSD::mkfs(g_ceph_context, store, g_conf->osd_data, mc.monmap.fsid, whoami);

[OSD.cc]
int OSD::mkfs(CephContext *cct, ObjectStore *store, const string &dev, uuid_d fsid, 
  int whoami)
>> store->mkfs();
>> store->mount();

[BlueStore.cc]
int BlueStore::mkfs()
>> _setup_block_symlink_or_file("block", cct->_conf->bluestore_block_path,
	cct->_conf->bluestore_block_size, cct->_conf->bluestore_block_create);
>> _setup_block_symlink_or_file("block.wal", cct->_conf->bluestore_block_wal_path,
	cct->_conf->bluestore_block_wal_size, cct->_conf->bluestore_block_wal_create);
>> _setup_block_symlink_or_file("block.db", cct->_conf->bluestore_block_db_path,
	cct->_conf->bluestore_block_db_size, cct->_conf->bluestore_block_db_create);
>> _open_db(true);
>> _open_fm(true);  初始化FreelistManager

[BlueStore.cc]
int BlueStore::_open_db(bool create)
>> if (do_bluefs):
   >> bluefs = new BlueFS(cct);
   >> bluefs->add_block_device(BlueFS::BDEV_DB, bfn);
>> if (create): bluefs->add_block_extent(BlueFS::BDEV_DB, SUPER_RESERVED, 
                   bluefs->get_block_device_size(BlueFS::BDEV_DB) - SUPER_RESERVED);
>> if (create):
   >> bluefs->add_block_extent(bluefs_shared_bdev, start, initial);
   >> bluefs_extents.insert(start, initial);
>> if (create):
   >> bluefs->add_block_extent(BlueFS::BDEV_WAL, BDEV_LABEL_BLOCK_SIZE, 
         bluefs->get_block_device_size(BlueFS::BDEV_WAL) - BDEV_LABEL_BLOCK_SIZE);
>> if (create): bluefs->mkfs(fsid);
>> bluefs->mount();

[BlueFS.cc]
int BlueFS::mkfs(uuid_d osd_uuid)
>> _init_alloc();
>> 设置superblock信息
>> 初始化log_file

[BlueFS.cc]
int BlueFS::mount()
>> _init_alloc();
>> _replay(false);
>> for (auto& p : file_map):
    for (auto& q : p.second->fnode.extents):
      alloc[q.bdev]->init_rm_free(q.offset, q.length);
>> log_writer = _create_writer(_get_file(1));

[BlueFS.cc]
int BlueFS::_replay(bool noop)
>> 逐个回放事务op

BlueStore calls rocksdb related interfaces

RocksDBStore implements the KeyValueDB interface, and BlueStore implements rocksdb operations through RocksDBStore.

open operation

[BlueStore.cc]
int BlueStore::_open_db(bool create)
>> db = KeyValueDB::create(cct, kv_backend, fn, static_cast<void*>(env));
>> if (create): db->create_and_open(err);

[RocksDBStore.cc]
int RocksDBStore::create_and_open(ostream &out)
>> do_open(out, true);

[RocksDBStore.cc]
int RocksDBStore::do_open(ostream &out, bool create_if_missing)
>> rocksdb::DB::Open(opt, path, &db);

read operation

[BlueStore.cc]
int BlueStore::read(const coll_t& cid, const ghobject_t& oid,
  uint64_t offset, size_t length, bufferlist& bl, uint32_t op_flags)
>> read(c, oid, offset, length, bl, op_flags);

[BlueStore.cc]
int BlueStore::read(CollectionHandle &c_, const ghobject_t& oid,
  uint64_t offset, size_t length, bufferlist& bl, uint32_t op_flags)
>> OnodeRef o = c->get_onode(oid, false);
>> _do_read(c, o, offset, length, bl, op_flags);
[注]
每个Onode包含一个ExtentMap,每个ExtentMap包含若干个Extent,
每个Extent负责管理一段逻辑范围内的数据并管理一个Blob,
由Blob通过若干个pextent负责将数据映射到磁盘

[BlueStore.cc]
BlueStore::OnodeRef BlueStore::Collection::get_onode(const ghobject_t& oid, bool create)
>> store->db->get(PREFIX_OBJ, key.c_str(), key.size(), &v);

[KeyValueDB.h]
virtual int get(const string &prefix,const char *key, size_t keylen, bufferlist *value)
>> get(prefix, string(key, keylen), value);

[KeyValueDB.h]
virtual int get(const std::string &prefix, const std::string &key, bufferlist *value)
>> get(prefix, ks, &om);

[RocksDBStore.cc]
int RocksDBStore::get(const string &prefix,
    const string &key, bufferlist *out)
>> db->Get(rocksdb::ReadOptions(), rocksdb::Slice(k), &value);

write operation

[BlueStore.cc]
void BlueStore::_kv_sync_thread()
>> db->submit_transaction_sync(synct);

[KeyValueDB.h]
virtual int submit_transaction_sync(Transaction t)
>> submit_transaction(t);

[RocksDBStore.cc]
int RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
>> db->Write(woptions, &_t->bat);

remove operation

Delete the specified metadata key-value pair from the database

[BlueStore.cc]
int BlueStore::_remove(TransContext *txc, CollectionRef& c, OnodeRef &o)
>> _do_remove(txc, c, o);

[BlueStore.cc]
int BlueStore::_do_remove(TransContext *txc,
  CollectionRef& c, OnodeRef o)
>> txc->t->rmkey(PREFIX_OBJ, o->key.c_str(), o->key.size());

[KeyValueDB.h]
virtual void rmkey(const std::string &prefix,   
  const char *k, size_t keylen)
>> rmkey(prefix, string(k, keylen));

[RocksDBStore.cc]
void RocksDBStore::RocksDBTransactionImpl::rmkey(const string &prefix, const string &k)
>> bat.Delete(combine_strings(prefix, k));

compact operation

OSD-related commands (asok_command) can be used in the client command line

[BlueStore.h]
void compact() override
>> db->compact();

[RocksDBStore.cc]
void RocksDBStore::compact()
>> db->CompactRange(options, nullptr, nullptr);

[references]

  1. "Ceph Design Principles and Implementation" Chapter 2 The Peak of Performance—New Object Storage Engine BlueStore

Guess you like

Origin blog.csdn.net/weixin_43778179/article/details/132669982