Introduction and use of Facebook flashcache

1. Introduction The

traditional HDD has the advantage of large capacity, but the performance is relatively low, especially the random IO performance, which often becomes the performance bottleneck of the system, which is more obvious in the virtual machine environment, because the virtualized scene will increase the random IO. change. Compared with HDD, SSD has the advantage of high performance, especially in the aspect of random IO, the advantage is very obvious, but the hardware cost of SSD is relatively high. At present, the industry has made some optimizations in terms of combining the large capacity of HDD and the high performance of SSD. The basic idea is to use SSD as the cache of HDD. In the computer field, the idea of ​​cache is everywhere, such as L1 and L2 cache of CPU and raid card. cache, TLB cache, etc. Regarding the optimization scheme of SSD as the cache of HDD, there are mainly Linux bcache, Linux dm-cache, Facebook flashcache, btier, IBM flashcache, etc. This article mainly introduces Facebook flashcache, which is currently maintained by Facebook and has not merged into the kernel mainstream.

2. Principle introduction

The flashcache is built on the Linux devicemapper. The devicemapper creates a logical mapped device on the SSD and the backing HDD. The user uses this mapped device. flashcache manages the cache (SSD) in a structured way by hashing, as shown in the figure below, each square represents a block, the default size is 4KB, and each block corresponds to a per-block metadata, which is stored in the flash, metadata It records the sector number of the backing device cached by the block, and the status of the cache (DIRTY, VALID, INVALID); each line is a cache set, and the set size (buckets) is 512 by default. The total cache size, block size, and set size can be specified when creating the flashcache. According to the size of the flashcache and the size of the buckets, the size of each block can be used to calculate the number of buckets.
flashcache.png

When accessing a sector N, its corresponding cache set is: (N / block size / set size) mod (number of sets), number of sets is n in the above figure, once the cache set is found, You can find the corresponding block in this set. Doing so allows a range of sequential disk blocks to be cached in a set.

The cache mode of flashcache is divided into the following three types:
writethrough: disk write will keep a copy in the cache, but will also write the data to the backing disk, and will not return until the write backing disk is completed.
writearound: disk write will bypass the cache and write directly to the backing disk, and disk read will cache the data read from the backing disk in the cache.
For writethrough and writearound, disk read first finds the corresponding cache set according to the target sector, and then finds whether there is a corresponding block. If it is found, that is, a cache hit, it will read directly from the cache. If it is not found, it will be read from the backing disk. The data is read in the cache, and it is also cached in the cache.
writeback: write will first write to the cache, and then update the dirty bit in the metadata, and the data will not be synchronized to the backing disk immediately.

In addition, flashcache also saves the cache superblock on the flash, which is used to record some parameters of flashcache create (these parameters can be used for reload recovery), and whether the cache shutdown has been cleaned, the normal shutdown will be cleaned, but the crash or power failure will not It will be cleaned. After the cache is clean, the status in the metadata corresponding to the block will be marked as ~dirty. In a clean cache shutdown, the metadata of all cache blocks will be flushed to the flash. If it is a normal shutdown, the VALID and DITRY cache will be stored in the ssd when the flashcache reload is performed after the next boot. If it crashes or reboots after a power failure, Only the DIRTY cache will be in the SSD, crash and power failure will not cause data loss, only VALID and non-DIRTY cache will be lost.

The update of cache block metadata may be merged. If the cache block metadata corresponding to the target sectors of several pending writes are the same, they will be merged.

The dirty cache is brushed to the backing disk in a lazy way in the background. This can be controlled by the dirty threshold. The flashcache will keep the dirty percent of each cache set below the dirty threshold. When the dirty threshold is exceeded, the flashcache will be based on the policy. Part of the cache is synchronized to the backing disk. At the same time, there is also a dirty cache idle clean function, that is, synchronization is performed according to the idle time of the cache block (that is, the time of the latest read and write from the block) exceeds a value (which can be configured through dev.flashcache.fallow_delay). When fall_delay= When it is 0, idle clean is turned off.

When a dirty cache block clean occurs, flashcache traverses the blocks in the block set, sorts them, and merges them into large IOs and synchronizes them to the backing disk.

One thing to note, devicemapper will divide the IO it receives according to blocksize, and then submit it to flashcache. If the IO size is less than blocksize, flashcache will not cache the IO, but first check whether there is any overlap dirty data in the cache. , if there are flowers, first brush the dirty data, and then write the IO from devicemapper to the backing disk. If there is no dirty data, write it directly to the backing disk, which is why when the fio test is less than At 4KB of random IO, flashcache has little effect.

3. How to use

【获取flashcache source】
https://github.com/facebook/flashcache
上面有很多版本,针对centos6.x,下载flashcache-stable-v3.1.3或者flashcache-3.1.2

【编译,安装】
make && make install

【创建flashcache】
flashcache_create [-v] -p back|around|thru [-s cache size] [-b block size] cachedevname ssd_devname disk_devname
-v : verbose.
-p : cache mode (writeback/writethrough/writearound).
-s : cache size. Optional. If this is not specified, the entire ssd device
is used as cache. The default units is sectors. But you can specify
k/m/g as units as well.
-b : block size. Optional. Defaults to 4KB. Must be a power of 2.
The default units is sectors. But you can specify k as units as well.
(A 4KB blocksize is the correct choice for the vast majority of
applications. But see the section "Cache Blocksize selection" below).
-f : force create. by pass checks (eg for ssd sectorsize).
Examples :
flashcache_create -p back -s 1g -b 4k cachedev /dev/sdc /dev/sdb
Creates a 1GB writeback cache volume with a 4KB block size on ssd
device /dev/sdc to cache the disk volume /dev/sdb. The name of the device
created is "cachedev".
以上来自:https://github.com/facebook/flashcache/blob/3.1.2/doc/flashcache-sa-guide.txt

【格式化,mount】
The logical disk created by flashcache_create is actually the logical disk created by devicemapper. You can use it like a normal disk, partition it, and format the file system.
mkfs.ext4 /dev/mapper/cachedev
mount -t ext4 /dev/mapper/cachedev [mount point]

[tuning & configuration]
sysctls common for all cache modes:
dev.flashcache..cache_all: This parameter has two values, 0 and 1, the former stands for cache nothing, the latter stands for cache everything, and the default is cache everything. When caching everything, you can use the process ID black list to specify which processes are not cached for IO. When nothing is cached, you can use the process ID white list to specify which processes are cached. One thing to note is that the black list only has an effect on O_DIRECT IOs, because for buffered IOs, pdflush, kswapd will write these buffered IOs in the background, and the flashcache will still cache these IOs, so the black list has no effect. We can view the specific white list and black list settings through cat /proc/flashcache/[cache name]/flashcache_pidlists. We can set the white/black list of the process through ioctl(/dev/mapper/cachedev, CMD, pidlist), the specific CMD is as follows,
FLASHCACHEADDBLACKLIST: add the pid (or tgid) to the blacklist.
FLASHCACHEDELBLACKLIST: Remove the pid (or tgid) from the blacklist.
FLASHCACHEDELALLBLACKLIST: Clear the blacklist. This can be used to cleanup if a process dies.
FLASHCACHEDDWHITELIST: add the pid (or tgid) to the whitelist.
FLASHCACHEDELWHITELIST: Remove the pid (or tgid) from the whitelist.
FLASHCACHEDELALLWHITELIST: Clear the whitelist. This can be used to cleanup if a process dies.

dev.flashcache..zero_stats: FIFO (0) vs LRU ( 1). The default is FIFO. It supports runtime switching.

dev.flashcache..io_latency_hist: It is used to calculate the latency of IOs, which can be displayed by histogram, and can be viewed by dmsetup status. The unit is 250us. This function is disabled by default, because flashcache uses gettimeofday() to calculate the delay. For some clock sources, this overhead is too high, and this function is rarely enabled.

dev.flashcache..max_pids: The maximum number of pids allowed to be set in the process pid white/black list.

dev.flashcache.do_pid_expiry: enable/disable the expiry of the pid in the white/black list.

dev.flashcache.pid_expiry_secs: Set the expiry time of the pid in the white/black list.

dev.flashcache..skip_seq_thresh_kb: Configure whether to not cache IOs that meet certain conditions. The condition is that if the sequential IO is greater than this threshold, no cache will be performed. When the threshold is 0 (the default value), it means that all IOs are cached. Either sequential or random. Sequential IO is judged based on the most recent IO.

sysctls for writeback mode only:
dev.flashcache..fallow_delay: idle clean idle time, in seconds, when the last access time from the cache block is greater than the idle, it will trigger synchronization of the cache set where the block is located to synchronize dirty block.
The default value is 900s, and 0 means to disable the idle clean function.

dev.flashcache..fallow_clean_speed: This value is used to precisely control idle clean, the maximum number of blocks allowed to be cleaned per second per cache block, the default is 2.

ev.flashcache..fast_remove: Determines whether to synchronize the dirty cache when removing the cache, 0 means synchronization, 1 means no synchronization, the default is synchronization. If out of sync, both dirty and valid blocks are in the cache when reloading. This option can be used for fast cache remove.

dev.flashcache.dirty_thresh_pct: Threshold of ditry block, flashcache will try to make the dirty block in a cache set lower than this threshold, once this value is reached, flashcache sync will be triggered for some dirty blocks in the cache set to the backing disk. When this value is relatively small, more IO writes will be triggered, but more backing disk sectors will be added to the cache.

dev.flashcache..stop_sync: Stop ongoing IO synchronization operations.

dev.flashcache..do_sync: Trigger to synchronize all dirty blocks to backing disk.

[Notes]
1) Before creating the flashcache, make sure that the backing hdd is not mounted, because once the flashcache is created, the previous data on the backing hdd cannot be obtained. The generated mapped device is formatted with the file system.

2) It is best to call dmset remove cachedev before shutdown, which will synchronize all dirty caches to the backing disk. For writeback mode, you need to call flashcache_load [SSD] to restore the previous flashcache configuration at boot time to ensure that no data loss occurs. These can be done by adding an init script under /etc/init.d/, you can refer to https://github.com/facebook/flashcache/blob/3.1.2/doc/flashcache-doc.txt "Using Flashcache sysVinit script".

3) When the IO submitted by devicemapper to flashcache is less than the cache block size, flashcache will not cache these IOs.

4) Regarding sequential IO, whether flashcache has an effect depends on the sequential IO performance of your HDD. If the sequential IO performance of your HDD is better than that of SSD, it is recommended to configure dev.flashcache..skip_seq_thresh_kb=[256,512, 1024,...] parameter skips the sequential IO for which the cache meets the conditions. For details, see the description above about dev.flashcache..skip_seq_thresh_kb.

4. Test verification

Configuration : HDD partition size 16G, SSD partition size 8GB,
flashcache creation command: flashcache_create -p back -s 8g cachedev1 /dev/sdb /dev/sda2
The comprehensive test results are shown in the figure below,
flashcache2.png

The following is the detail test commands and results,
【fio测试】
随机写:
raw HDD, fio -filename=/dev/sda2 -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=psync -bs=4k -size=5G -numjobs=8 -runtime=300 -group_reporting -name=mytest
flashcache-test-randwrite-hdd.png

flashcache, fio -filename=/dev/mapper/cachedev1 -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=psync -bs=4k -size=5G -numjobs=8 -runtime=300 -group_reporting -name=mytest
flashcache-randwrite-flashcache.png

随机读:
raw HDD, fio -filename=/dev/sda2 -direct=1 -iodepth 1 -thread -rw=randread -ioengine=psync -bs=4k -size=5G -numjobs=8 -runtime=300 -group_reporting -name=mytest
flashcache-randread-hdd.png

flashcache, fio -filename=/dev/sda2 -direct=1 -iodepth 1 -thread -rw=randread -ioengine=psync -bs=4k -size=5G -numjobs=8 -runtime=300 -group_reporting -name=mytest
flashcache -randread-flashcache.png

random mixed read and write, 50% read and write:
raw HDD, fio -filename=/dev/sda2 -direct=1 -iodepth 1 -thread -rw=randrw -rwmixread=50 -ioengine=psync -bs=4k -size=5G -numjobs=8 -runtime=300 -group_reporting -name=mytest
flashcache-randrw-hdd.png

flashcache, fio -filename=/dev/mapper/cachedev1 -direct=1 -iodepth 1 -thread -rw=randrw -rwmixread=50 -ioengine=psync -bs=4k -size=5G -numjobs=8 -runtime=300 -group_reporting -name=mytest
flashcache-randrw-flashcache.png

5.

Conclusion For random IO (including random Write, random read, random mixed read and write), flashcache can increase IOPS dozens of times, and significantly reduce IO latency.
In addition, whether or not flashcache has an effect on sequential IO depends on the sequential IO performance of your HDD. For example, I used Dawning I620r-T in the above test environment. The sequential write performance of HDD on it reaches 46K IOPS, while SSD The performance is only about 20K IOPS, so at this time, you need to configure dev.flashcache..skip_seq_thresh_kb=x, so that the sequential IO larger than xKB is not cached, and the x value can be adjusted according to the specific situation, generally you can choose 128, 256, 512, etc. .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326523946&siteId=291194637