Flink state backend and RocksDB tuning

Table of contents

1 What is a state backend?

2 Status backend classification?

MemoryStateBckend

Notice

FsStateBackend

Applicable scene

RocksDBStateBackend

Applicable scene

important point

3 RocksDB large state tuning

Set up local Rocks multiple directories

Turn on incremental checkpoint

Block Size

Block Cache Size

Maximum open files

Cache Index And Filter Blocks

Optimize Filter For Hits

Write Buffer Size

Max Bytes For Level Base

Write Buffer Count

Min Write Buffer Number To Merge

Thread Num(Parallelism)

Write Batch Size

Compaction Style

Compression Type


1 What is a state backend?

Specify how and where state is stored

2 Status backend classification?

MemoryStateBckend

  1. State data required at runtime is stored in the memory on the TaskManager JVM heap

  2. The checkpoint is saved in the memory of the JobManager process, and synchronous/asynchronous snapshots can be selected.

Notice

1) State is stored in the memory of JobManager and is limited by the memory size of JobManager.

2) Each State defaults to 5MB, which can be adjusted through the MemoryStateBackend constructor.

3) Each State cannot exceed the Akka Frame size

FsStateBackend

  1. State data required at runtime is stored in the memory on the TaskManager JVM heap

  2. Checkpoints are stored in the file system (HDFS)

  3. State data required at runtime is stored in the memory on the TaskManager JVM heap

Applicable scene

1) Stateful processing tasks with large states, long windows, or large key-value states

2) High availability

important point

1) State data will first be stored in the memory of TaskManager

2) State size cannot exceed TM memory

3) TM writes State data to external storage asynchronously

RocksDBStateBackend

  1. Use the embedded local database RocksDB to store the flow computing data state in the local disk and will not be limited by the memory size of the TaskManager.

  2. When performing a checkpoint, the State data saved in the entire RocksDB is persisted in full or incrementally to the configured file system.

  3. A small amount of checkpoint metadata is stored in JobManager memory

  4. RocksDB overcomes the problem of State being limited by memory, and at the same time can be persisted to the remote file system, which is more suitable for use in production.

Applicable scene

1) Stateful processing tasks with large states, long windows, or large key-value states

2) High availability

3) RocksDBStateBackend supports incremental checkpoints. Incremental checkpoints are very suitable for scenarios with very large states

important point

1) The total State size is limited to the disk size and is not limited by memory

2) RocksDBStateBackend also needs to configure an external file system to centrally save the State

3 RocksDB  large state tuning

For stream computing scenarios that need to save very large states (far exceeding memory capacity), RocksDB is currently the only choice for official implementation on the Flink platform. There are also solutions in the industry that use Redis and other services as the state backend, but after all, they are not mature enough and have been rejected by the community.

RocksDB is a KV database implemented based on the LSM tree principle. The LSM tree read amplification problem is serious, so the disk performance requirements are relatively high. It is strongly recommended that the production environment use SSD as the storage medium of RocksDB. However, some clusters may not be configured with SSDs, just ordinary mechanical hard disks. When Flink tasks are relatively large and state access is frequent, the disk IO of the mechanical hard disk may become a performance bottleneck . In this case, how to solve this bottleneck?

Use multiple hard drives to share the load

RocksDB uses memory plus disk to store data. When the status is relatively large, the disk space will be relatively large. If there are frequent read requests to RocksDB, disk IO will become the bottleneck of the Flink task.

When a TaskManager contains 3 slots, then the three degrees of parallelism on a single server will cause frequent reads and writes to the disk, causing the three degrees of parallelism to compete with each other for the same disk IO, which will inevitably lead to three degrees of parallelism. The throughput will decrease.

Fortunately, Flink's state.backend.rocksdb.localdir parameter can specify multiple directories. Generally, big data servers will mount many hard disks. We expect that the three slots of the same TaskManager will use different hard disks to reduce resource competition. The specific configuration is as follows

  • Set up local Rocks multiple directories

state.backend.rocksdb.localdir: /data1/flink/rocksdb,/data2/flink/rocksdb,/data3/flink/rocksdb,/data4/flink/rocksdb,/data5/flink/rocksdb,/data6/flink/rocksdb,/data7/flink/rocksdb,/data8/flink/rocksdb,/data9/flink/rocksdb,/data10/flink/rocksdb,/data11/flink/rocksdb,/data12/flink/rocksdb

Note: Be sure to configure the directory on multiple different disks. Do not configure multiple directories on a single disk. The purpose of configuring multiple directories here is to allow multiple disks to share the pressure.

As shown below is the IO usage of the disk during the test. It can be seen that the parallelism of the three large state operators corresponds to three disks. The average IO usage of these three disks remains at about 45%, with the highest IO The usage rate is almost 100%, while the average IO usage rate of other disks is relatively low. It can be seen that when using RocksDB as the state backend and there are frequent reads of large states, the consumption of disk IO performance is indeed relatively large.

  • Turn on incremental checkpoint

state.backend.incremental 开启增量检查点,默认false。或代码中指定new EmbededRocksDBStateBackend(true)

  • Block Size

The default value of state.backend.rocksdb.block.blocksize  is 4KB. It is recommended that the production environment be adjusted to 16~32KB. If you need to increase the Block Size to improve read and write performance, be sure to increase the Block Cache Size as well, so that you can obtain comparisons. Good read and write performance. If the memory is already tight, it is not recommended to continue to increase the Block Cache Size, otherwise there will be a risk of OOM (this will be more obvious if the total amount of memory used is limited in the container environment), so relatively speaking, it is not recommended to continue. After increasing the Block Size, you can appropriately reduce the Block Size to improve throughput, while paying attention to whether the write performance is acceptable.

  • Block Cache Size

state.backend.rocksdb.block.cache-size  The entire RocksDB shares a block cache. The cache size of the memory when reading data. The larger the parameter, the higher the cache hit rate when reading data. The default size is 8MB, but it is usually used when the memory is rich. It is recommended to set it to 64~256MB

   You can start the state.backend.rocksdb.metrics.metrics.block-cache-usage  monitoring indicator to observe the Block Cache usage in real time and make targeted optimizations.

  • Maximum open files

The parameter state.backend.rocksdb.files.open  determines the maximum number of file handles that RocksSB can open. The default value is 5000. It is recommended to change it to -1 (unlimited). If this parameter is too small, index and filter blocks will appear. The problem of being unable to load into the memory causes a sharp decline in reading performance. The cache_index_and_filter_blocks is set to false.

  • Cache Index And Filter Blocks

cache_index_and_filter_blocks(blockBasedTableConfig.setCacheIndexAndFilterBlocks), the default is false, indicating that the index and filter block are not cached in the memory, but are loaded, and kicked out if not used. If set to true, it means that these indexes and filters are allowed to be placed in the Block Cache for backup, which can improve the efficiency of local data access (knowing whether the key is and where it is without disk access). However, if this option is enabled, the pin_I0_filter_and_index_blocks_in_cache (blockBasedTableConfig.pinL0FilterAndIndexBlocksinCache) parameter must also be set to true. Otherwise, performance jitters may occur due to the paging operation of the operating system.

It should be noted that for this parameter, it is important to note that the total size of Block Cache is limited. If Index and Filter are allowed to be put in, there will be less space to store data. Therefore, it is recommended to turn on these two parameters only when the Key has local hotspots (some Keys are frequently accessed, while other Keys are rarely accessed); for Keys with relatively random distribution, this parameter may even have a counterproductive effect ( When using random keys, reading performance drops significantly)

  • Optimize Filter For Hits

If the optimize_filters_for_hits parameter (columnFamilyOptions.setOptimizeFiltersForHits) is set to true, RocksDB will not generate a Bloom Filter for L0. According to the document, it can reduce 90% of Filter storage overhead, which is beneficial to reducing memory usage. However, this parameter is only applicable to scenarios where there are local hotspots or where it is certain that Cache Miss will rarely occur. Otherwise, frequent cache misses will drag down read performance.

For memory usage such as Cache and Filter, you can estimate it through Flink’s state.backend.rocksdb.metrics.estimate-table-readers-mem monitoring indicator.

  • Write Buffer Size

The default size of state.backend.rocksdb.writebuffer.size is 64MB. Generally speaking, the larger the Writer Buffer, the smaller the write amplification effect, so the write performance will also improve.

  • Max Bytes For Level Base

state.backend.rocksdb.compaction.level.max-size-level-base. It is important to note that if you increase the Write Buffer Size, please be sure to increase the size threshold of the L1 layer (max_bytes_for_level_base). This factor has a very large impact. If this parameter is too small, there will be very few SST files stored in each layer, and there will be many levels, making it difficult to find; if this parameter is too large, there will be too many files in each layer, and operations such as compaction will be performed. It will take a long time, and the Write Stall phenomenon may easily occur at this time, causing write interruption.

  • Write Buffer Count

Flink's state.backend.rocksdb.writebuffer.count parameter (which can also be set through columnFamilyOptions.setMaxWriteBufferNumber ) can control the maximum number of MemTables allowed to be retained in the memory. After this number is exceeded, it will be flushed to the disk by Flush and become SST. document.

The default value of this parameter is 2. For mechanical disks, if the memory is large enough, it can be increased to about 5 to reduce the size of MemTable and reduce the probability of Write Stall during Flush operation.

  • Min Write Buffer Number To Merge

Flink's state.backend.rocksdb.writebuffer.number-to-merge parameter ( columnFamilyOptions.setMinWriteBufferNumberToMerge ) determines the minimum threshold for Write Buffer merging. The default value is 1. For mechanical hard disks, it can be increased appropriately to avoid frequent Merge operations. Caused by write pauses.

According to our tuning experience, adjusting this parameter small or large will often cause performance degradation. Its optimal value will be near an intermediate value, such as 3.

For the estimated memory size indicator occupied by MemTable, you can enable Flink's state.backend.rocksdb.metrics.cur-size-all-mem-tables parameter for real-time monitoring.

  • Thread Num(Parallelism)

state.backend.rocksdb.thread.num This parameter allows users to increase the maximum number of threads for background Compaction and Flush operations. Flush operations are in the high-priority queue by default, and Compaction operations are in the low-priority queue.

The default number of background threads is 1, and mechanical hard disk users can change it to 4 or other larger values.

If all threads in the background are performing compaction operations, and there are suddenly many write requests at this time, a write stall (Write Stall) will occur. Write pauses can be discovered through logs or monitoring indicators.

  • Write Batch Size

The state.backend.rocksdb.write-batch-size parameter allows you to specify the maximum amount of memory occupied by RocksDB during batch writing. The default is 2m. If set to 0, it will be automatically adjusted according to the task volume. If there are no special requirements for this parameter, you do not need to adjust it.

  • Compaction Style

The state.backend.rocksdb.compaction.style parameter (setCompactionStyle method of ColumnFamilyOptions) allows users to adjust the organization of Compaction. The default value is LEVEL (more balanced), but it can also be changed to UNIVERSAL or FIFO.

Compared with the default LEVEL mode, UNIVERSAL is a type of Tiered, which can reduce the write amplification effect, but the side effect is to increase the space amplification and read amplification effects, so it is only suitable for large writes and infrequent reads, and at the same time there is sufficient disk space. scene.

FIFO is suitable for scenarios where RocksDB is used as a time series database because it is a first-in, first-out algorithm and can eliminate expired old data in batches.

  • Compression Type

The ColumnFamilyOptions class provides the setCompressionType method, which can specify the compression algorithm for Block.

RocksDB provides support for multiple compression algorithms such as uncompressed, Snappy, ZLib, BZlib2, LZ4, LZ4HC, Xpress, ZSTD, etc. However, it should be noted that before enabling it, you need to confirm whether the corresponding compression algorithm library has been installed in the system, otherwise May not function properly.

If you are pursuing performance, you can turn off compression (NO_COMPRESSION); otherwise, it is recommended to use the LZ4 algorithm, followed by the Snappy algorithm. After enabling compression, the verify_checksums option of ReadOptions can be turned off to improve reading speed (but may be affected by disk bad blocks)

 

Guess you like

Origin blog.csdn.net/qq_24186017/article/details/127177554