Delay operation in Kafka time wheel (TimingWheel) and Kafka

Kafka-related interview questions: https://blog.csdn.net/qq_28900249/article/details/90346599

There are a large number of delayed operations in Kafka, such as delayed production, delayed pull, and delayed deletion. Kafka does not use the Timer or DelayQueue that comes with JDK to implement the delay function, but customizes a timer (SystemTimer) for implementing the delay function based on the time wheel. The average time complexity of JDK's Timer and DelayQueue insertion and deletion operations is O(nlog(n)), which cannot meet Kafka's high-performance requirements, and the time complexity of insertion and deletion operations can be reduced to O based on the time wheel (1). The application of the time wheel is not unique to Kafka, and there are many other application scenarios. There are traces of the time wheel in components such as Netty, Akka, Quartz, and Zookeeper.

Referring to the figure below, the timing wheel (TimingWheel) in Kafka is a circular queue for storing timing tasks . The bottom layer is implemented by an array, and each element in the array can store a timing task list (TimerTaskList). TimerTaskList is a circular doubly linked list. Each item in the linked list represents a timer task entry (TimerTaskEntry), which encapsulates the real timer task TimerTask.

figure 1

A time wheel consists of multiple time grids, and each time grid represents the basic time span (tickMs) of the current time wheel . The number of time grids of the time wheel is fixed and can be represented by wheelSize, then the overall time span (interval) of the entire time wheel can be calculated by the formula tickMs × wheelSize. The time wheel also has a dial pointer (currentTime), which is used to indicate the current time of the time wheel, and currentTime is an integer multiple of tickMs. currentTime can divide the entire time wheel into the expired part and the unexpired part. The time grid currently pointed to by currentTime also belongs to the expired part, which means that it just expires and all tasks of the TimerTaskList corresponding to this time grid need to be processed.

If tickMs=1ms and wheelSize=20 of the time wheel, then it can be calculated that the interval is 20ms. Initially, the dial pointer currentTime points to time slot 0, and a task with a timing of 2ms is inserted and stored in TimerTaskList with a time slot of 2. As time goes by, the pointer currentTime continues to move forward. After 2ms, when the time slot 2 is reached, the tasks in the TimeTaskList corresponding to the time slot 2 need to be expired accordingly. At this time, if another task with a timing of 8ms is inserted, it will be stored in time grid 10, and currentTime will point to time grid 10 after another 8ms. What if a task with a timing of 19ms is inserted at the same time? The new TimerTaskEntry will reuse the original TimerTaskList, so it will be inserted into the expired time slot 1. In short, the overall span of the entire time wheel remains unchanged. With the continuous advancement of the pointer currentTime, the time period that the current time wheel can handle is also moving backwards, and the overall time range is between currentTime and currentTime+interval.

What if there is a task with a timing of 350ms at this time? Do you directly expand the size of wheelSize? There are tens of thousands or even hundreds of thousands of milliseconds of scheduled tasks in Kafka. There is no bottom line for the expansion of this wheelSize. Even if an upper limit is set for the expiration time of all scheduled tasks, such as 1 million milliseconds, then the wheelSize is 1 million milliseconds. The time wheel not only takes up a lot of memory space, but also reduces efficiency. For this purpose, Kafka introduces the concept of a hierarchical time wheel. When the due time of a task exceeds the time range represented by the current time wheel, it will try to add it to the upper-level time wheel.

figure 2

Referring to the above figure, reuse the previous case, the time wheel of the first layer tickMs=1ms, wheelSize=20, interval=20ms. The tickMs of the time wheel of the second layer is the interval of the time wheel of the first layer, which is 20ms. The wheelSize of each layer of time wheel is fixed, which is 20, so the overall time span interval of the second layer of time wheel is 400ms. By analogy, this 400ms is also the size of the tickMs of the third layer, and the overall time span of the time wheel of the third layer is 8000ms.

For the 350ms timing task mentioned earlier, it is obvious that the first-level time wheel cannot meet the conditions, so it is upgraded to the second-level time wheel, and finally inserted into the TimerTaskList corresponding to the time grid 17 in the second-level time wheel. If there is another task with a timing of 450ms at this time, then obviously the second-level time wheel cannot meet the conditions, so it is upgraded to the third-level time wheel, and finally inserted into the TimerTaskList of time grid 1 in the third-level time wheel middle. Note that multiple tasks (such as 446ms, 455ms, and 473ms timing tasks) with an expiration time in the interval [400ms, 800ms) will be placed in the time grid 1 of the third-level time wheel, and the TimerTaskList corresponding to time grid 1 The timeout period is 400ms. As time goes by, when the next TimerTaskList expires, the task that was originally scheduled for 450ms has 50ms left, and the expiration operation of this task cannot be performed. Here is an operation of downgrading the time wheel , which will resubmit the timing task with a remaining time of 50ms to the hierarchical time wheel. At this time, the overall time span of the first layer of time wheel is not enough, but the second layer is enough, so the task It is placed in the time grid whose expiry time of the second-level time wheel is [40ms, 60ms). After another 40ms, the task is "perceived" again at this time, but there is still 10ms left, and the due operation cannot be performed immediately. Therefore, there is another downgrade of the time wheel. This task is added to the time grid with the expiry time of the first layer of the time wheel being [10ms, 11ms). After another 10ms, the task actually expires and the corresponding The expiration operation.

Design comes from life. Our common clock is a time wheel with a three-layer structure. The first layer of time wheel tickMs=1s, wheelSize=60, interval=1min, which is seconds; the second layer of tickMs=1min, wheelSize=60, interval= 1hour, this is the minute; the third layer tickMs=1hour, wheelSize is 12, interval is 12hours, this is the clock.

The parameters of the first-level time wheel in Kafka are the same as the above case: tickMs=1ms, wheelSize=20, interval=20ms, and the wheelSize of each level is also fixed at 20, so the tickMs and interval of each level can also be calculated accordingly . Kafka has some small details when implementing the TimingWheel:

When TimingWheel is created, the current system time is used as the start time (startMs) of the first-level time wheel. The current system time here does not simply call System.currentTimeMillis(), but calls Time.SYSTEM.hiResClockMs, which Because the time precision of the currentTimeMillis() method depends on the specific implementation of the operating system, some operating systems cannot achieve millisecond-level precision, and Time.SYSTEM.hiResClockMs essentially uses System.nanoTime()/1_000_000 to convert the precision Adjust to the millisecond level. There are also some other operations that can achieve millisecond precision, but the author does not recommend it. System.nanoTime()/1_000_000 is the most effective method. (If you have any ideas about this, you can discuss it in the message area.)
Each bidirectional circular linked list TimerTaskList in TimingWheel will have a sentinel node (sentinel). The introduction of sentinel nodes can simplify the boundary conditions. A sentinel node is also called a dummy node. It is an additional linked list node. This node is the first node. It does not store anything in its value field, but it is introduced for the convenience of operation. If a linked list has sentinel nodes, then the first element of the linear list should be the second node of the linked list.
Except for the first layer of time wheels, the start time (startMs) of other high-level time wheels is set to the currentTime of the previous first round when this layer of time wheels is created. The currentTime of each layer must be an integer multiple of tickMs. If not satisfied, the currentTime will be trimmed to an integer multiple of tickMs, so as to correspond to the expiration time range of the time grid in the time wheel. The trimming method is: currentTime = startMs - (startMs % tickMs). currentTime will be recommended over time, but will not change the established fact that it is an integer multiple of tickMs. If the time at a certain moment is timeMs, then the currentTime of the time wheel at this time = timeMs - (timeMs % tickMs), and each time the time advances, the currentTime of the time wheel at each level will advance according to this formula.
The timer in Kafka only needs to hold the reference of the first-level time wheel of TimingWheel, and will not directly hold other high-level time wheels, but each time wheel will have a reference (overflowWheel) pointing to a higher level The application, called at this level, can realize that the timer indirectly holds the reference of the time wheel of each level.
The details about the time wheel are described here, and the implementation of the time wheel in each component is similar. Readers here will be curious about a scenario that has been described in the article - "as time goes by" or "as time goes by", so how does time advance in Kafka? Similar to using scheduleAtFixedRate in JDK to advance the time wheel every second? Obviously this is unreasonable, and TimingWheel has lost most of its meaning.

The timer in Kafka uses DelayQueue in JDK to help advance the time wheel. The specific method is that each used TimerTaskList will be added to the DelayQueue. "Each used TimerTaskList" specifically refers to the TimerTaskList with a non-sentry node TimerTaskEntry. The DelayQueue will be sorted according to the timeout expiration corresponding to the TimerTaskList, and the TimerTaskList with the shortest expiration will be ranked at the head of the DelayQueue. There will be a thread in Kafka to obtain the list of expired tasks in DelayQueue. What is interesting is that the name corresponding to this thread is called "ExpiredOperationReaper", which can be literally translated as "expired operation harvester", and the name of "SkimpyOffsetMap". fight. After the "harvester" thread obtains the timed-out task list TimerTaskList in the DelayQueue, it can advance the time of the time wheel according to the expiration of the TimerTaskList, or perform corresponding operations on the obtained TimerTaskList. The opposite TimerTaskEntry should perform the expiration operation The expiration operation will be performed, and the downgraded time wheel will be downgraded.

Readers may be very confused after reading this. The DelayQueue clearly stated at the beginning of the article is not suitable for high-performance timing tasks such as Kafka. Why is DelayQueue introduced here? Note that for the timing task item TimerTaskEntry insertion and deletion operations, the time complexity of TimingWheel is O(1), and the performance is much higher than that of DelayQueue. If you directly insert TimerTaskEntry into DelayQueue, the performance is obviously difficult to support. Even if we divide several TimerTaskEntry into the TimerTaskList group according to certain rules, and then insert the TimerTaskList into the DelayQueue, imagine how to deal with adding another TimerTaskEntry to this TimerTaskList? For DelayQueue, such operations obviously become powerless.

From the analysis, it can be found that the TimingWheel in Kafka is specially used to perform the operation of inserting and deleting TimerTaskEntry, and the DelayQueue is specially responsible for the task of time advancement. Imagine again that the expiration of the first timeout task list in DelayQueue is 200ms, and the second timeout task is 840ms. Here, obtaining the head of the DelayQueue only requires O(1) time complexity. If timing push is used every second, then 199 of the 200 pushes performed when the first overtime task list is obtained are "empty pushes", and 639 "empty pushes" need to be performed when the second timeout task is obtained ", which will consume the performance resources of the machine for no reason. Here, DelayQueue is used to assist in exchanging a small amount of space for time, thus achieving "precise advancement". The timer in Kafka can be described as "knowing people and using it well". Using TimingWheel to do the best task addition and deletion operations, and using DelayQueue to do the best time advancement work complement each other.

 

The interview question is roughly like this: Consumers go to Kafka to get news, but there is no new news to provide in Kafka, so how will Kafka handle it?

As shown in the figure below, the two follower copies have been pulled to the latest position of the leader copy, and at this time a pull request is sent to the leader copy, but the leader copy has no new message written, so what should the leader copy do at this time? Woolen cloth? You can directly return empty pull results to the follower copy, but if the leader copy has no new messages written, the follower copy will always send pull requests and always receive empty pull results, which consumes resources in vain. Obviously not reasonable.

insert image description here
This involves the concept of Kafka delayed operation. When Kafka processes a pull request, it will first read the log file once. If it cannot collect enough messages (fetchMinBytes, configured by the parameter fetch.min.bytes, the default value is 1), it will create a delayed pull Fetch operation (DelayedFetch) to wait for a sufficient number of messages to be fetched. When the delayed pull operation is executed, the log file will be read again, and then the pull result will be returned to the follower copy.

Delayed operations are not just unique operations when pulling messages. There are many delayed operations in Kafka, such as delayed data deletion and delayed production.

For delayed production (message), if the acks parameter is set to -1 when using the producer client to send a message, it means that it is necessary to wait for all replicas in the ISR set to confirm receipt of the message before correctly The result of receiving a response, or catching a timeout exception.

insert image description here

Suppose a partition has 3 copies: leader, follower1, and follower2, all of which are in the partition's ISR set. To simplify the description, here we do not consider the expansion and contraction of the ISR set. After Kafka receives the client's production request, it writes messages 3 and 4 to the local log file of the leader copy, as shown in the figure above.

Since the client has set acks to -1, it is necessary to wait until both copies of follower1 and follower2 have received message 3 and message 4 to inform the client that the sent message has been received correctly. If within a certain period of time, the copy of follower1 or the copy of follower2 cannot completely pull message 3 and message 4, then a timeout exception needs to be returned to the client. The timeout period of the production request is configured by the parameter request.timeout.ms, and the default value is 30000, which is 30s.

insert image description here

insert image description here

So who will execute the action of waiting for message 3 and message 4 to be written into follower1 copy and follower2 copy, and returning the corresponding response result to the client? After writing the message to the local log file of the leader copy, Kafka will create a delayed production operation (DelayedProduce) to handle the normal writing of the message to all copies or timeout, so as to return the corresponding response result to the client.

The delay operation needs to delay the return of the response result. First of all, it must have a timeout period (delayMs). If the predetermined task is not completed within this timeout period, it needs to be forced to complete to return the response result to the client. Secondly, the delay operation is different from the timing operation. The timing operation refers to the operation performed after a specific time, while the delay operation can be completed before the set timeout time, so the delay operation can support the triggering of external events.

In the case of delayed production operations, its external event is an increase in the HW (high water mark) of a partition to which messages are written. That is to say, as the follower copy keeps synchronizing with the leader copy, which further promotes the further growth of the HW, every time the HW grows, it will check whether the delayed production operation can be completed, and if so, execute it and return the response result to the customer terminal; if it still fails to complete within the timeout period, it is enforced.

Looking back at the delayed pull operation at the beginning of the article, it is the same, and it is also executed by a timeout trigger or an external event trigger. Timeout triggering is easy to understand, that is, it waits until the timeout expires to trigger the operation of reading the log file for the second time. The triggering of external events is a little more complicated, because the pull request is not only initiated by the follower copy, but also by the consumer client. The external events corresponding to the two cases are also different. If it is a delayed pull of the follower copy, its external event is that the message is appended to the local log file of the leader copy; if it is a delayed pull of the consumer client, its external event can be simply understood as the growth of HW.
 

Guess you like

Origin blog.csdn.net/qq_35240226/article/details/106474749