TiKV new architecture: Partitioned Raft KV principle analysis

Author: Xu Qi

TiKV has launched a new experimental feature called "partitioned-raft-kv", which adopts a new architecture that can not only significantly improve the scalability of TiDB, but also improve the write throughput and performance stability of TiDB.

In a previous post, we described the massive performance and scalability gains brought about by this new experimental feature of Partitioned Raft KV. In this article, we will introduce why it can have such a big advantage.

architecture

The following is the architecture of TiKV.

Figure 1.pngFigure 1 TiKV Architecture - Logical Data Partitioning

A TiKV cluster consists of many data partitions (also called Regions). Each Region is responsible for a specific piece of data, determined by its start and end key range. It has 3 or more replicas on different TiKV nodes and synchronizes them through the raft protocol. In the old raft engine, there is only one RocksDB instance in each TiKV to store data of all Regions. The partitioned-raft-KV feature introduces a new physical data layout: each Region has its own RocksDB instance.

Figure 2.pngFigure 2: Physical data layout comparison

Challenges of the old Raft KV engine

"Region" is a logical scale unit in TiKV. Every data access and management operation such as load balancing, scaling up and down is partitioned by Region. However, in the current architecture, it is a purely logical concept with no clear area boundaries physically. this means:

  1. When it is necessary to move the data of a Region from one TiKV to another (also called load balancing), TiKV needs to scan in the huge RocksDB instance to obtain the data of the Region. This causes read expansion.
  2. When several Regions have heavy write traffic, if their key ranges are spread out, then it is likely to trigger a large compaction in RocksDB that includes data from other idle Regions. This introduces read and write scaling. For example, SST11 is a 1MB sized SST with only region1's data but containing a fairly large key range. When it is selected to be merged into L2, SST21, SST22 and SST23 all participate in the compression, and they contain the data of region2, 3, 4. The larger the scale of TiKV, the greater the read and write expansion.

Figure 3.pngFigure 3: Compressed data between different Regions

  1. There is no region isolation, so a few popular regions can slow down the performance of all regions.

Therefore, in the old raft KV engine, we may encounter the following problems:

  1. The speed of capacity expansion is very slow because multiple data scans are required.
  2. Because RocksDB's write group is single-threaded, write throughput is limited.
  3. Since data compression occurs from time to time, when the amount of data in RocksDB is large, the latency of user traffic is unstable.

Improvements to the Partitioned Raft KV engine

  • The data of each Region is a dedicated RocksDB instance, so it is only necessary to perform x-copy of RocksDB for load balancing among Regions, avoiding the occurrence of read amplification.
  • The write traffic of the hotspot Region will only trigger the compression of its own RocksDB, and will not involve the data of other Regions. Therefore, it effectively reduces read and write amplification.
  • When writing data to RocksDB, data synchronization and lock contention do not occur between writing threads, because each thread is writing to a different RocksDB instance. This removes the write bottleneck. Since there is no WAL log, writing to RocksDB is an in-memory operation.
  • The poor performance of a RocksDB will not affect other Regions. Therefore, the performance of Regions is isolated at the storage level.
  • Each Region now supports a larger capacity, 15 GB by default. Compared with the previous Region size limit of 96MB, the Region overhead such as heartbeat and memory usage has been reduced by as much as 99%.

Therefore, with partitioned raft KV, TiDB is about 5 times faster in expanding or shrinking data, and its performance is generally more stable due to the much smaller impact of compression.

scope of application

Everything looks fine. But there is one more problem. Now we have more RocksDB instances, so the memory consumption of their memtables is much more, which means you may need an additional 5GB~10GB of memory overhead to strike a balance between memory consumption and performance. Therefore, it is generally not recommended to turn on this feature when memory resources are already very tight. However, this feature might help you when you have extra memory in TiKV and care about scalability and write performance.

write at the end

Some customers may say that the current version of TiDB is good enough. So the new features don't seem to matter much to them. But what if they could be used for multiple workloads in a cluster, each with good isolation and QoS guarantees? This is the "Resource Governance" feature in version 7.0. The "partitioned raft KV" feature is designed to maximize hardware performance, used together with "Resource Governance", our customers will be able to fully utilize their hardware resources and reduce costs by consolidating multiple workloads into one cluster.

Guess you like

Origin juejin.im/post/7233717220353032229