ZNS Architecture Realization: High Performance Storage Stack Design to Solve Traditional SSD Problems

statement

Homepage : Metastorage Blog_CSDN Blog

Arranged based on public knowledge and experience, please leave a message if there is any error.

Personal hard work, paid content, reprinting is prohibited .


abstract

2.2 Architecture Implementation of ZNS

First look at the entire storage stack form of SMR HDD that supports zone storage and nvme ssd that supports zonefs

insert image description here

 

Among them, for applications such as ceph, backend engines such as bluestore or seastore directly manage raw devices, so file system support is not required. Of course, if necessary, data access can also be performed through a small file system zonefs supported by the kernel.

But this small file system used to be simple. It exposes each zone space as a file, uses LBA0 to store superblocks, and does not have complex inode/dentry metadata management mechanisms. Create/delete/rebuild on this file system Naming is not allowed. The writing of data is still similar to the requirements of zone storage, that is, writing through a WP. If the file corresponding to the zone space is full, the WP cannot be written until the zone is reset, and the WP will be moved again. to LBA0, thus writable.

Its function is not enough for Rocksdb, and Rocksdb only needs a file user-mode backend. Although this is the case, how to allocate an optimal zone when rocksdb schedules writing, and how to choose the right time to delete the sst file (reset zone space) need to be carefully designed in it.

Later, Hans led the design of Zenfs as the Backend of Rocksdb for end-to-end request scheduling, and selected the optimal data storage method and reduced write amplification (LSM-tree), SSD wear leveling, and reduced read long tail. Do more exploration.

The general structure is as follows:

insert image description here

 

Next, let's take a closer look at some internal implementation features of ZNS and the detailed design and implementation of Zenfs.

2.2.1 Some PRs in ZNS implementation process

The features of ZNS require kernel support, so the ZBD (zoned block device) kernel subsystem was developed to provide a common block layer access interface. In addition to supporting the kernel to access ZNS through ZBD, it also provides user API ioctl to access some basic data, including: enumeration of current environment zone devices, display of existing zone information, and management of a specific zone (such as reset ).

The pressure measurement of ZBD is supported inside FIO.

In the near future, in order to evaluate the performance of ZNS-ssd more friendly, the exposure of per zone capacity and active zones limit is supported on ZBD.

Zenfs is designed and used as a filesystem backend for rocksdb.

Here is an overview of the corresponding number of modified lines of code:

insert image description here

 

Judging from the number of lines of code, it can be said to be very lightweight.

2.2.2 Design and implementation of Zenfs

2.2.2.1 Why zenfs values ​​rocksdb (LSM-tree architecture) so much

The reason why the ZNS community attaches so much importance to Rocksdb is that in terms of the number of lines of code, it can be said that the energy invested in Rocksdb far exceeds other aspects.

From the principle of LSM-tree, we can see a few points:

The writing of LSM-tree is written in append order, which is perfect for adapting to the zone architecture of ZNS.
The compaction of LSM-tree also writes a batch of data sequentially, and then deletes them collectively, which is also in line with the space recovery method of ZNS (when each zone is in Full state, if you want to rewrite it, you only need to reset it). Under good configuration, it is equivalent to the perfect combination of GC inside SSD and compaction of rocksdb.
The pain points of LSM-tree on traditional ssd are more obvious. In terms of reading: the LSM-tree layered software architecture is not friendly to reading performance (the long tail is more serious), and the FTL GC of ssd will indirectly make the long tail unpredictable; the proud sequential writing advantage is also due to the SSD internal The frequent GC of the FTL leads to write performance jitter and a drop compared to no-load. These pain points can be well avoided or even completely resolved under ZNS.
Apart from the disadvantages of LSM-tree itself on ssd, Rocksdb has some unique advantages, which are worthy of continuous investment by the ZNS community:

The k/v storage field is widely used, suitable for high-speed storage media (NVMe-ssd)
open source and has an active community, and the community is also continuing to follow up with new storage technologies. Including: io_uring / spdk and other
pluggable storage backend design, implement a fs backend, porting is very easy (you can see it by compiling zenfs into rocksdb code).
2.2.2.2 Zenfs detailed design

Let’s first take a look at the overall Zenfs system architecture overview. This picture is the one in the paper, which is more concise:

insert image description here

Because it serves as the fs backend of Rocksdb and is responsible for interacting with the zoned block devcei, the basic interfaces inherited from the FileSystemWrapper class must be implemented.

The main components are as follows:

Journaling and Data.

Zenfs defines two types of zones: journal and data. The ZonedBlockDevice class in the code manages two vectors, meta_zones and io_zones, which are collectively referred to as Journal Zones and Data Zones below.

Among them, Journal Zones are used to manage the metadata of the file system, including restoring the consistency state of the file system when abnormal, maintaining the superblock of the file system, and mapping wal and data files to the zone.

Data zones are mainly used to save data files such as sst.

Extents.

The data files of Rocksdb will be mapped and written to an extents collection std::vector<ZoneExtent*> extents_. One of the extents is a variable-length but block-aligned continuous LBA address, and it will be sequentially written into a section of zone space with information identifying the current sst, and the ZoneFile data structure will be used to identify the file and the extents array belonging to this file. Each zone space can store multiple extents, but an extent does not exist across multiple zones.

The allocation and release of extents are managed by a memory data structure. When a file ratio or the data of this extent is to be persisted to the disk and call Fsync/Append, the memory data structure will also be persisted in the journal_zone. Moreover, the data structure in the memory will keep track of the distribution of extents. When all files belonging to extents in a zone are deleted, the zone can be reset to facilitate subsequent reuse.

Superblocks.

Superblock is mainly used to initialize Zenfs or restore Zenfs state from disk abnormality. Superblock will identify Zenfs belonging to the current disk through unique id, magic number and user options. This unique identifier is UUID (unique identifier), which allows users to identify the file system on the corresponding disk, that is, the file system on it can still be identified when the drive letter of the disk is restarted or the external plug-in changes.

Journal.

The main job of Journal is to maintain the mapping between superblock and WAL, as well as sst and extents stored in zone.
Journal data is mainly stored in the Journal Zones in the above figure, which is the meat_zones in the code, and the journal zone is the first three zones ZENFS_META_ZONES located on the entire storage device that will never be offline. At any moment, there will always be a zone that is active, that is, it must be writable. Otherwise, if the two zones are closed, metadata updates cannot be tracked.

The first active zone will have a header, including: sequence number (it will be incremented whenever a journal zone is initialized), superblock data structure, and a snapshot of the current journal state. During initialization, the header is persisted, and the remaining capacity of the entire zone can start to receive new data updates.

Our process of initializing a Zenfs from a ZNS disk requires executing:

.plugin/zenfs/util/zenfs mkfs --zbd=$DEV --aux_path=/tmp/$AUXPATH --finish_threshold=10 --force

Note: The device name in $DEV can only be nvme0n1 or nvme1n, and /dev/nvme0n1 cannot be added, and zenfs will find nvme0n1 in the environment by itself, no user-specified path is required.
1 Zenfs::MkFS All meta zones will reset, and create a zenfs file system on the first meta zone, execute the following

Write a superblock data structure, including sequence initialization and persistence
. Initialize an empty snapshot and make it persistent.
2 Zenfs::Mount The steps to recover an existing zenfs from the disk are as follows:

Now there are three journal zones. At the beginning, you need to read the first LBA content of the three journal zones to determine the sequence of each zone. Among them, the largest seq is the current active zone (with the most complete metadata new zone).
Read the header content of the active zone, and initialize the superblock and jourace state.
Updates to the journal will be synchronized to the snapshot of the header.
These two steps basically build a complete Zenfs state, and will continue to accept user writes in the future.

During the writing process, the data storage of sst saves the extent and persists the extent to the corresponding zone space, so how to choose a data zone to store the current file data? Because different zones actually accept data storage The capacity of its capacity is changing. If an sst file is stored across zones, the final reset of a zone also needs to consider whether the file is deleted.

Best-Effort Alogthrim for Zone Selection

Zenfs developed the Best-effort algorithm here to select the zone as the storage of rocksdb sst files. Rocksdb indicates the life cycle of these files by setting different write_hints for WAL and different levels of sst.

Which kind of write_hint to choose is done through the following logic:

Env::WriteLifeTimeHint ColumnFamilyData::CalculateSSTWriteHint(int level) {
  if (initial_cf_options_.compaction_style != kCompactionStyleLevel) {
    return Env::WLTH_NOT_SET;
  }
  if (level == 0) {
    return Env::WLTH_MEDIUM;
  }
  int base_level = current_->storage_info()->base_level();

  // L1: medium, L2: long, ...
  if (level - base_level >= 2) {
    return Env::WLTH_EXTREME;
  } else if (level < base_level) {
    // There is no restriction which prevents level passed in to be smaller
    // than base_level.
    return Env::WLTH_MEDIUM;
  }
  return static_cast<Env::WriteLifeTimeHint>(level - base_level +
                            static_cast<int>(Env::WLTH_MEDIUM));
}


That is, starting from the current total number of layers, the sst files of the last two layers have the longest life cycle, level0 has the life cycle of WITH_MEDIUM, and WAL has the shortest life cycle of WITH_SHORT.

Going back to the process of Zenfs selecting a zone, in general it is to store files with a small life_time in a zone as close to its life_time as possible, so that there is a greater probability that the entire zone must be reset uniformly:

(1) For new writes, directly allocate a new zone

(2) Prioritize allocation from active zones. If a suitable zone can be found, then directly reset this zone and use it as the storage of the current file. The conditions for a suitable zone are: if the lifetime of the current file is smaller than the oldest data in the active zone, then the current zone is more suitable as the storage of the current file; if there are multiple active zones that meet this condition, select the most recent comparison active zone.

(3) If no suitable zone is found from the active zone, then directly allocate a new zone. Of course, the allocation process also means judging whether the number of active zones exceeds max_nr_active_io_zones_. If it exceeds, an active zone needs to be closed before a new zone can be allocated.

The logic is as follows:

Zone *ZonedBlockDevice::AllocateZone(Env::WriteLifeTimeHint file_lifetime) {   ...   // logic of best effort algorithm   for (const auto z : io_zones) {     if ((!z->open_for_write_) && (z->used_capacity_ > 0 ) && !z->IsFull()) {       // The main thing is to compare the lifetime of the current zone with the file_lifetime of the current file (that is, write_hint).       // If the lifetime_time of the file is small, the current zone meets the storage requirements.       unsigned int diff = GetLifeTimeDiff(z->lifetime_, file_lifetime);       if (diff <= best_diff) {         allocated_zone = z;         best_diff = diff;       }     }   }   ...   // If no suitable zone was found for the current file, Then you have to allocate a new one   if (best_diff >= LIFETIME_DIFF_NOT_GOOD) {
















    /* If we at the active io zone limit, finish an open zone(if available) with
     * least capacity left */
    if (active_io_zones_.load() == max_nr_active_io_zones_ &&
        finish_victim != nullptr) {
      s = finish_victim->Finish();
      if (!s.ok()) {
        Debug(logger_, "Failed finishing zone");
      }
      active_io_zones_--;
    }

    if (active_io_zones_.load() < max_nr_active_io_zones_) {
      for (const auto z : io_zones) {
        if ((!z->open_for_write_) && z->IsEmpty()) {
          z->lifetime_ = file_lifetime;
          allocated_zone = z;
          active_io_zones_++;
          new_zone = 1;
          break;
        }
      }
    }
  }
  ...
}


After the AllocateZone is completed, the extent of the current file in the allocated zone (mainly storing the offset address and length) can be updated, and the actual file data is written through IOStatus Zone::Append.

In general, Zenfs uses the Best-effort algorithm to accelerate the recovery of expired zones according to the write_hint_ storage data file configured by Rocksdb and the close life cycle of the zone, which greatly reduces the space enlargement problem of ZNS. According to the data in the paper , said to be able to keep the space enlarged at about 10% (it can be said that the entire LSM-tree + SSD space enlargement, if the data is no problem, it is already very remarkable).

Of course, if you want to have such test data, you need to adjust the parameter configuration of rocksdb, which can be achieved by executing a script under Zenfs: ./zenfs/tests/get_good_db_bench_params_for_zenfs.sh nvme2n1 can get a configuration recommended by the official , it is recommended to align the target_file_size with the size of the zone configuration.

Zenfs also has an active_zone_limits limit, that is, we can see in the AllocateZone function that if the number of current active zones reaches max_nr_active_io_zones_ when allocating a new zone, you need to close the previous zone first, which is also the case in Rocksdb There is a limit on the number of active zones. Of course, Zenfs has also done a corresponding test in this regard, and found that if the number of active zones is less than 6, the write performance will be affected, but if it reaches 12, adding later will not have much impact on the write performance.

————————————————
Copyright statement: This article is an original article of CSDN blogger "z_stand", following the CC 4.0 BY-SA copyright agreement, please attach the original source link and this statement for reprinting .
Original link: https://blog.csdn.net/Z_Stand/article/details/120933188


reference

Disclaimer :

This article is organized based on public information and aims to introduce more storage knowledge. The articles contained in the article are only the author's opinion and do not constitute investment or commercial advice. This article is only for learning and communication, not for commercial use. If you have any questions or infringements, please contact the author.

Guess you like

Origin blog.csdn.net/vagrant0407/article/details/130186777