CacheLib principle description

Introduction to CacheLib

CacheLib is a C++ library open sourced by facebook for accessing and managing cached data. It is a thread-safe API that enables developers to build and customize scalable concurrent caches.

The main function:

  1. A hybrid cache for DRAM and NVM is implemented, and the cached data expelled from DRAM can be persisted to NVM. The cached data stored on NVM will be written to the DRAM cache when it is found and hit, which is transparent to users.
  2. Support the persistence of cached data, so that the original cached data will not be lost after the process is restarted.
  3. Cache variable-sized objects.
  4. Provides zero-copy concurrent access.
  5. Provides rich cache algorithms such as LRU, segmented LRU, FIFO, 2Q and TTL.
  6. Use hard RSS memory limit to avoid OOM.
  7. Intelligently and automatically tunes cache to dynamically changing workloads.

Let's choose a few important ones from the above-mentioned functions to explain them emphatically.

CacheLib hybrid cache

CacheLib supports the use of multiple hardware media through the hybrid cache function. In order to make this transparent to users, a unified CacheAllocator API package is provided, so that users are not aware of specific hard disk reads and writes. CacheLib supports SSD as NVM media for hybrid caches. Navy is an SSD-optimized caching engine. In order to interact with Navy, the operation of CacheAllocator for NVM is packaged as NvmCache. CacheAllocator completely entrusts NvmCache with Navy's search, insertion, and deletion functions, and ensures that it is thread-safe. NvmCache implements the function of handling the conversion between Item and NVM. There is no global key-level lock in CacheAllocator to ensure the synchronization between DRAM and NVM, so NvmCache uses optimistic concurrency control primitives to ensure the correctness of data.

Prerequisite knowledge:

Cache features supported by CacheLib

CacheTrait is a combination of MMType, AccessType and AccessTypeLock.

MMType: Memory management type, which controls the life cycle of cache items (add/remove/evict/recordAccess), types include MMLru, MM2Q, MMTinyLFU.

AccessType: Access control type, which controls the access method of cache items (find/insert/remove), and the type includes ChainedHashTable.

AccessTypeLock: It is the lock type of the access container that supports multiple locking primitives, including SharedMutexBuckets and SpinBuckets.

RefcountWithFlags

Records reference counts and identifiers for cache entries

|--------18 bits----------|--3 bits---|------11 bits------|
┌─────────────────────────┬───────────┬───────────────────┐
│         Access Ref      │ Admin Ref │       Flags       │
└─────────────────────────┴───────────┴───────────────────┘

Access Ref: Records the reference count of the cache entry

Admin Ref: Record who has permission to manage cache items, such as kLinked (MMContainer), kAccessible (AccessContainer), kMoving.

Flags: Indicates the current state of the cache item, such as kIsChainedItem, kHasChainedItem, kNvmClean, kNvmEvicted, kUnevictable_NOOP.

Hybrid cache initialization

When creating a Cache instance, it is instantiated through CacheAllocator. There are two methods here, one is the startup method that does not support cache persistence, and the other is the startup method that supports cache persistence. The way that supports persistence will have one more input parameter SharedMemNewT or SharedMemAttachT than the way that does not support optimization.

SharedMemNewT will add a shmManager to the Cache, and use posix shm to persist the cache. Generally, two files will be generated: metadata and NvmCacheState. For details, please refer to ShmManager.h and NvmCacheState.h. metadata will store memory-related meta information including shm_info, shm_cache, shm_hash_table, and shm_chained_alloc_hash_table; NvmCacheState will store NVM state information, because NVM meta information has been recorded in the cache file and does not require additional persistence.

SharedMemAttachT will try to attach the persistent metadata at startup. If it succeeds, it can continue to use the existing cache. If it fails, it will throw an exception, so the general usage is as follows:

 try 
 {
     cache = std::make_unique<Cache>(Cache::SharedMemAttach, config);
 } 
 catch (const std::exception& ex) 
 {
     cache.reset();
     std::cout << "Couldn't attach to cache: " << ex.what() << std::endl;
     cache = std::make_unique<Cache>(Cache::SharedMemNew, config);
     default_pool = global_cache->addPool("default", 1024*1024*1024);
 }

Item allocation and retirement

When the cache is written to CacheLib, it will first be written to DRAM to allocate memory and write the data, and after writing, it will be managed by MMContainer (cache management container).

insertInMMContainer(*(handle.getInternal()));

When the Item satisfies the elimination policy and needs to be eliminated, there are two ways: one is to be eliminated directly from CacheLib; the other is to be eliminated from memory and enter NVM.

Because NVM supports customizing the Admission Policy in order to ensure good performance at all times. For example, when there are too many kv writes suddenly, some kv write requests can be rejected by percentage. This can play a role in protecting NVM, and at the same time, the key eliminated in CacheLib will be eliminated directly and will not be written into NVM.

When the threshold condition is not met, the Item will be written to NVM, and if the NVM space is full, the elimination mechanism will also be triggered.

This allocation process is also the core logic of insert. Another implementation corresponding to insert is insertOrReplace. Insert will fail to write due to duplicate keys when writing, and insertOrReplace will directly replace the original existing key. The specific implementation is in check After and before writing, mark the NVM for deletion. For details, please refer to the createDeleteTombStone function. This also leads to a logic that the data in NVM is marked for deletion, and this part of the data will not be cleared directly, but will be deleted asynchronously by calling removeAsync later.

insert image description here

Hybrid cache lookup

When searching for a key, it first searches in the memory. For this reason, CacheLib specially designed the following function. After finding the key, it will judge whether the key is expired. If the key is found, it will be marked as expired and return not found.

auto handle = findFastImpl(key, mode);

If it does not exist in the memory, it will try to find it from NVM. There are synchronous and asynchronous implementations of find, but NVM generally finds asynchronously, and the performance will be better.

The find design of NVM is a bit complicated. First, in order to avoid the heavy search of the disk, the enableFastNegativeLookups function is added, which can pre-judge whether the Item definitely does not exist in the memory. If you are not sure whether the storage exists, then search on the NVM, in fact, this is a bloom filter.

Because the interaction time between memory and disk will be relatively long, the process of expulsion from memory to disk (writing NVM) will be relatively long, so an asynchronous search function is designed for this purpose, track the entire search process through GetContext, and finally use the callback function to notify the user that the search is complete.

During the search process, there may be multiple find concurrent requests. For performance considerations, CacheAllocator does not use real locks, but uses an optimistic mechanism to ensure the consistency of data in DARM and NVM. NvmCache is implemented by maintaining a PutTokens structure. Each key being written to NVM has its own putToken. Only when the putToken is valid can the NVM write be performed to control the visibility of the key.

For example, when thread 1 has a find request to find key1, it is possible that thread 2 is executing the NVM write of key1, and key1 will hold a valid putToken. When thread 1 executes find, key1 will be LRU When the rule is activated, there is no need to be eliminated by the memory, and it is no longer necessary to write to NVM, so the putToken of key1 will be invalidated, and the write operation of thread 2 will not continue to write if the putToken is found to be invalid. Sequential consistency to NVM lookups. The putToken here will only be generated when writing to NVM, and will be invalidated during operations such as find and eviction, and there is no complicated state change.

After the NVM search is completed, the Item is not directly returned to the user, but is first written to the DRAM, the same as the allocation logic. At this time, the nvmClean (true) flag is used to indicate that the kv in the DRAM is consistent with that in the NVM. If the Item is eliminated again , there is no need to write to NVM again, because the key already exists in NVM. If the value corresponding to the key is modified, nvmClean will be set to false, marking it as inconsistent. This part of the inconsistency in NVM will be cleaned up by the background asynchronous thread.

insert image description here

Navy Implementation Principle

Navy is an SSD-optimized caching engine. Features of Navy:

  1. Efficiently caches billions of small objects (<1KB) and millions of large objects (1KB - 16MB) on SSD.
  2. Efficient point lookup
  3. Low DRAM overhead

Since Navy is designed for caching, the implementation chooses to sacrifice data persistence. Since Navy is write-secret, NVM write endurance becomes a concern, and Navy is also optimized for write endurance.

Navy finishing structure

Navy provides users with an asynchronous API. Navy uses the Small Item Engine to optimize for small objects and the Large Item Engine for large objects. Each engine is designed with the required DRAM overhead in mind without compromising read efficiency. Underneath, an abstraction of the block device is made, and both engines run on top of the block device abstraction.

insert image description here

Engine

Navy implements Small Item Engine and Large Item Engine respectively for the size of the Item. The corresponding specific implementations are BigHash and BlockCache respectively. You can control whether the Item is stored in the Small Item Engine or the Large Item Engine through the size of the Item, and the query will be prioritized when querying. Large Item Engine, if it does not exist, then query Small Item Engine. Generally, the Small Item Engine stores data within 1K. A large number of caches generally use the Large Item Engine, and the Small Item Engine can also be disabled through configuration. If it terminates due to an exception, it will be retried.

insert image description here

BigHash is mainly used for small objects, so I won't introduce too much for now.

BlockCache

The specific implementation of Large Item Engine is BlockCache. The following is the specific structure of BlockCache:

insert image description here

Main read and write logic

  1. According to ssd configuration deviceMaxWriteSize and blockCacheRegionSize (default 16MB), Device is divided into multiple Regions and stored in CleanList.

  2. Write process:

    Select a Region, if the Region is not full, add Entry to the Region, if the current Region is full, get a Region from CleanList. If CleanList is empty, it needs to wait for eviction and gc to release Region, and finally update IndexMap.

  3. Query process:

    Find the RegionId and offset according to the key, and then locate the Entry through the RegionId and offset, and parse and read the data.

  4. Delete process:

    Only deleted from IndexMap, Entry itself will be marked for deletion, waiting for gc to process asynchronously.

  5. Elimination process:

    Elimination is performed according to the elimination strategy. The strategies include FIFO, LRU, and LRU2Q. The elimination is performed in the unit of Region. If you think the granularity is too large, you can configure reinsertion to reinsert according to the hit rate or frequency to save.

Job scheduling

The job here contains 4 categories;

JobType::Read: for read jobs corresponding to Navy (lookup);

JobType::Write: Corresponding to Navy write jobs, new insertions and deletions are implemented in the form of insertions;

JobType::Reclaim: perform internal elimination;

JobType::Flush: perform any internal asynchronous buffered writes;

Jobs are submitted in strict order to ensure the sequential consistency of writes and queries, and avoid performing multiple operations on the same key at the same time. When encountering concurrent requests, it still relies on NvmCache's optimistic locking mechanism to ensure data consistency between DRAM and NVM.

JobScheduler provides two thread pools, one for reading and the other for writing. Jobs are assigned to one of two thread pools based on their type. And on the assigned thread according to the key, each thread maintains a JobQueue, which is generally first-in first-out, but JobType::Reclaim and JobType::Flush will have higher priority and will be executed first.

In order to use the thread pool to improve performance and maintain sequential consistency, a shard sorting mechanism is introduced. First, the number of shards can be specified through configuration parameters, generally in the millions. The submitted Job will be assigned to different shards according to the key hash. In a shard, the same key will only be in one shard to ensure the order of processing the same key. The allocation request will traverse from the first shard. Generally, only one job will be fetched at a time. If the consecutive jobs are different There is no conflict in the key, and multiple submissions can be submitted to the JobQueue of the thread pool thread at one time, and traversed to the end and then cycled again to continuously complete the Job.

insert image description here

The above is the introduction to the principle of CacheLib hybrid cache.

Welcome to add WeChat: xiedeyantu to discuss technical issues.

Guess you like

Origin blog.csdn.net/weixin_39992480/article/details/128950762