Exclusive | rocksdb compaction speed limit Practice and source code analysis

Lead: Index of disk IO utilization is very concerned about the storage of the students, this article describes the 58 stores tuning team practice in the use of rocksdb for IO glitch scene, and analyzes the source code rocksdb compaction speed limit section through tuning, effective reduce IO glitches, reduce the impact on the real-time read and write.

Background
58 stores from the research team of KV distributed storage products WTable used storage engine is open source rocksdb. rocksdb is a storage structure LSM Tree performs compaction in the background.

In practice, the process of discovery, when a large amount of data written to the business side, it will trigger a lot of compaction operations, occupying high IO resources, and even lead to IO 100%, IO glitch occurs, resulting in real-time read and write requests timeout delay becomes even higher .

This article first source rocksdb compaction rate-limiting part of the analysis, and then introduce WTable tuned for scenes IO glitch in practice.

First, source code analysis
rocksdb speed limit function is achieved by RateLimter class. RateLimiter has five parameters:
ratebytespersec: Total rate limit threshold per flush and a compaction;
refillperiodus: Supplementary token cycle, the default is 100ms;
Fairness: the low priority request (compaction) with a higher priority request (flush) Get token the probability of default is 10;
the MODE: three enumeration values, read, write, write, write request is the default speed limit;
autotuned: whether to open the autotune, the default is not open.
In conventional rate limiter (autotuned = false) in, the token is assigned in accordance with the cycle, each cycle = refillperiodus, the default value is 100ms, the cycle can be assigned to each token = ratebytespersec100ms / (1000ms). Availablebytes global variable is the initial value of each cycle of token values can be assigned. Token request, if the requested bytes less than availablebytes, and direct the allocation succeeds corresponding reduction availablebytes; if the requested bytes larger than availablebytes, enters a request queue, as shown, has a low priority request (low pri) and high priority request (high pri) two token to be dispensed waiting queue, each request is assumed that the desired values are bytes 20MB:
Exclusive | rocksdb compaction speed limit Practice and source code analysis

By competition, selecting a unique leader, the leader requesting the non-enters the wait state; leader by only recording variable numdrains depletion caused by availablebytes number of waiting cycles, and waits for the current cycle ends, and the refill token operation.

You can see by the above process, the only leader of the thread can be requested refill token operation. In the refill process, first update availablebytes value. And sequentially assigning a probability to the token request waiting in the queue in accordance with the two fairness probability, low pri queue priority in the queue for token assignment 1 / fairness. Each allocation will decrease corresponding availablebytes, the corresponding request queue is removed, and the non-leader evoke waiting thread successfully returned, or until the queue is empty availablebytes depleted. After refill flow returns to be re-run for leader, a new leader can not allocate enough token of the old leader, head of the queue of ordinary unallocated token request (high pri queue header of the request priority), the new request.

After assumed refill, availablebytes updated to 50M, and select from the token assigned starting high pri queue according to the probability, before the request satisfies two conditions, after waiting queue assignment is removed, availablebytes case remaining 10M, a request does not satisfy the required token value, refill flow ends, re-run Leader, wait for a refill process, as shown below:
Exclusive | rocksdb compaction speed limit Practice and source code analysis

autotune rate limiter在普通rate limiter的基础上,增加了动态调整限速阈值的功能,此时,参数ratebytespersec的含义是限速的上限值。每100个refillperiodus周期调整一次限速阈值,调整的区间为[ratebytespersec/20, ratebytespersec],调整的依据是过去100个周期的因availablebytes耗尽而导致等待的周期的比率(numdrains差值/100 100%),当这个比率低于低水位(50%),限速阈值降低1.05倍,当比率高于高水位(90%),限速阈值上升1.05倍,在高低水位之间,限速阈值不变,如果没有因availablebytes耗尽而导致等待的周期,则限速阈值直接设定为下限值ratebytespersec/20。

二、限速实践
1、普通限速
在初期的实践中,WTable并没有使用autotune模式的rate limiter,仅考虑第一个参数,其他使用默认值。针对某一集群,存在大量离线导入的情况,网卡的流入量如下图所示:
Exclusive | rocksdb compaction speed limit Practice and source code analysis

由于写入量较大,rocksdb会生成很多新的sst文件,需要进行大量的compaction,当不进行限速时,经常会出现IO毛刺的情况,如下图所示:
Exclusive | rocksdb compaction speed limit Practice and source code analysis

为此,针对该业务的写入场景,进行试验,设定ratebytespersec为250MB时,io毛刺得到明显改善,如下图所示,IO最高在50%左右。
Exclusive | rocksdb compaction speed limit Practice and source code analysis

2、auto tune限速
上述设定ratebytespersec为250MB是针对特定集群的实验经验值,并不具备通用性。当业务方的写入量更大时,由于限速的缘故,rocksdb会出现write stall的情况,这对在线读写业务是不能接受的。每个集群都设定一个特定的限速阈值不太现实,也不够通用,为此,进一步测试auto tune方式的限速,希望能在不同集群中使用一套配置。

When turned auto tune (parameter autotuned = true), the meaning of the parameter is ratebytespersec speed upper limit, the official to the reference value is as large as possible.

A beginning, to set ratebytespersec 2000MB, but found no effect play speed limit, by the above source code analysis, was found to adjust the speed threshold interval [ratebytespersec / 20, ratebytespersec],
so when ratebytespersec too large, lower limit of the speed limit threshold is still large, and can not play a real role in the speed limit. In the subsequent practice, the ratebytespersec set to 1000M, the lower limit value at this time is 50M, both play a role in the speed limit, solves the problem of burrs io, can dynamically adjust the threshold speed limit, to avoid write stall. This is a more common configuration, it can be used in a variety of scenarios written, easy to deploy operation and maintenance.
Other
parameters rocksdb very much, we can consider io glitch problem from another angle: reducing the number of compaction. For example, the configuration file sst larger, can to some extent reduce the number of compaction, but not in time will cause compaction old data can not be cleared in time, lead to enlarge the space, which requires trade-offs, according to the business side of the scene.

In addition, we are also trying to alleviate by inserting sst file directly to rocksdb A large number of import offline impact on real-time read and write, so stay tuned!

References: https://github.com/facebook/rocksdb/wiki

Author: Emori Chao, 58 Group senior storage engineer, responsible for development and optimization WTable KV distributed storage system.

Exclusive | rocksdb compaction speed limit Practice and source code analysis

Guess you like

Origin blog.51cto.com/14621185/2455437