[Message queue] Kafka application layer cache architecture design and implementation based on SSD (Meituan)

The following article is reproduced from " Meituan Technical Team "

The current status of Kafka on the Meituan data platform

Kafka's excellent I/O optimization and multiple asynchronous designs have higher throughput than other message queuing systems, while ensuring good latency, which is very suitable for application in the entire big data ecosystem.

Currently in the Meituan data platform, Kafka assumes the role of data buffering and distribution. As shown in the figure below, business logs, access layer Nginx logs or online DB data are sent to Kafka through the data collection layer, and subsequent data is consumed and calculated by the user’s real-time operations, or used by the ODS layer of the data warehouse for data warehouse production. Another part will enter the company's unified log center to help engineers troubleshoot online problems.

The current scale of Kafka on Meituan Online:

  • Cluster size : 6000+ nodes and 100+ clusters.
  • Cluster bearer : Topic number 60,000+, Partition number 410,000+.
  • The scale of messages processed : Currently, the total amount of messages processed per day is 8 trillion, and the peak traffic is 180 million messages per second.
  • The scale of services provided : Currently, the downstream real-time computing platform runs 30,000+ jobs, and most of these data sources come from Kafka.

Kafka online pain point analysis & core goals

Currently, Kafka supports a large number of real-time jobs, and a large number of Topics and Partitions carried by a single machine. The problem that is prone to appear in this scenario is that different Partitions on the same machine compete for PageCache resources and affect each other, resulting in an increase in the processing delay of the entire Broker and a decrease in throughput.

Next, we will analyze the pain points of Kafka online based on the processing flow of Kafka read and write requests and online statistics.

Principle analysis

Kafkaå¤çè¯ »åæµç¨ç示æå¾

  • For Produce requests : The I/O thread on the Server side uniformly writes the requested data to the PageCache of the operating system and returns immediately. When the number of messages reaches a certain threshold, the Kafka application itself or the operating system kernel will trigger a forced flash disk operation ( As shown in the flowchart on the left).
  • For Consume requests : The ZeroCopy mechanism of the operating system is mainly used. When Kafka Broker receives a data read request, it will send a sendfile system call to the operating system. After receiving the request, the operating system will first try to obtain data from PageCache (as shown in the middle flowchart). If the data does not exist, it will trigger a page fault exception interrupt to read the data from the disk into the temporary buffer (as shown in the flowchart on the right), and then directly copy the data to the network card buffer through the DMA operation to wait for the follow-up TCP transmission.

In summary, Kafka has good throughput and latency for a single read and write request. When processing a write request, the data is returned immediately after being written into PageCache, and the data is flushed to the disk in batches asynchronously, which not only ensures that most write requests can have lower latency, but also batch sequential flushing is more friendly to the disk. When processing read requests, real-time consumption jobs can read data directly from PageCache, with a small request delay. At the same time, the ZeroCopy mechanism can reduce the switch between user mode and kernel mode during data transmission, greatly improving the efficiency of data transmission.

But when there are multiple Consumers on the same Broker, they may be delayed due to multiple Consumers competing for PageCache resources. Let's take two Consumers as an example to explain in detail:

As shown in the figure above, the Producer sends the data to the Broker, and PageCache will cache this part of the data. When the consumption capacity of all Consumers is sufficient, all data will be read from PageCache, and the latency of all Consumer instances is low. At this time, if one of the Consumers has a consumption delay (Consumer Process 2 in the figure), according to the read request processing flow, the disk read will be triggered at this time, and part of the data will be pre-read to PageCache while the data is read from the disk. When the PageCache space is insufficient, the data will be eliminated according to the LRU strategy. At this time, the data read by the consumer that delays consumption will replace the real-time cache data in the PageCache. Later, when the real-time consumption request arrives, because the data in PageCache has been replaced, unexpected disk reads will occur. This has two consequences:

  1. Consumers with sufficient spending power will lose the performance bonus of PageCache when they consume.
  2. Multiple Consumers influence each other, and unexpected disk reads increase, and HDD load increases.

We conducted gradient tests on HDD performance and the impact of read and write concurrency, as shown in the following figure:

It can be seen that with the increase of read concurrency, the IOPS and bandwidth of the HDD will decrease significantly, which will further affect the throughput and processing delay of the entire Broker.

Online statistics

At present, the Kafka cluster TP99 traffic is 170MB/s, TP95 traffic is 100MB/s, and TP50 traffic is 50-60MB/s; the PageCache of a single machine is evenly allocated to 80GB, and the traffic of TP99 is taken as a reference. Under this traffic and PageCache allocation, The maximum cacheable data time span of PageCache is 80*1024/170/60 = 8min, which shows that the current Kafka service as a whole has extremely low tolerance for delayed consumption operations. In this case, once the consumption of some jobs is delayed, real-time consumption jobs may be affected.

At the same time, we have counted the consumption delay distribution of online real-time jobs. Jobs with a delay range of 0-8min (real-time consumption) accounted for only 80%, indicating that there are currently 20% of online jobs in a state of delayed consumption.

Pain point analysis summary

Summarizing the above-mentioned principle analysis and online data statistics, currently online Kafka has the following problems:

  1. Real-time consumption and delayed consumption jobs compete at the PageCache level, leading to unexpected disk reads due to real-time consumption.
  2. Traditional HDD performance declines sharply as read concurrency increases.
  3. There are 20% delayed consumer operations online.

According to the current PageCache space allocation and online cluster traffic analysis, Kafka cannot provide a stable service quality guarantee for real-time consumer operations, and this pain point needs to be solved urgently.

Expected goal

Based on the above analysis of pain points, our expected goal is to ensure that real-time consumer jobs will not be affected by delayed consumer jobs due to PageCache competition, and to ensure that Kafka provides stable service quality guarantees for real-time consumer jobs.

solution

Why choose SSD

Based on the analysis of the above reasons, we can see that solving the current pain points can be considered from the following two directions:

  1. Eliminate the PageCache competition between real-time consumption and delayed consumption, such as preventing the data read by delayed consumption jobs from being written back to PageCache, or increasing the allocation of PageCache.
  2. Add a new device between HDD and memory, which has better read and write bandwidth and IOPS than HDD.

For the first direction, because PageCache is managed by the operating system, if its elimination strategy is modified, it will be more difficult to implement and will destroy the external semantics of the kernel itself. In addition, the cost of memory resources is high and unlimited expansion is not possible, so the second direction needs to be considered.

The development of SSD is becoming more and more mature. Compared with HDD, SSD's IOPS and bandwidth have an order of magnitude improvement, which is very suitable for taking part of the read traffic after PageCache competition in the above scenario. We also tested the performance of the SSD, and the results are shown in the following figure:

It can be found from the figure that as the read concurrency increases, the IOPS and bandwidth of the SSD will not decrease significantly. From this conclusion, we can know that we can use SSD as the caching layer between PageCache and HDD.

Architectural decision

After introducing SSD as the caching layer, the key issues to be solved in the next step include data synchronization among PageCache, SSD, HDD, and data routing for read and write requests. At the same time, our new cache architecture needs to fully match the Kafka engine read and write. The characteristics of the request. This section will introduce how the new architecture solves the above-mentioned problems in selection and design.

The Kafka engine has the following characteristics in read and write behavior:

  • The frequency of data consumption changes over time, and the longer the data consumption frequency, the lower.
  • Each partition (Partition) only Leader provides read and write services.
  • For a client, the consumption behavior is linear, and data will not be repeatedly consumed.

Two alternatives are given below, and our selection basis and architectural decision will be given below for the two alternatives.

Alternative 1: Implementation based on the kernel layer of the operating system

Currently, open source caching technologies include FlashCache, BCache, DM-Cache, OpenCAS, etc. Among them, BCache and DM-Cache have been integrated into Linux, but there are requirements for the kernel version and are limited by the kernel version. We can only choose FlashCache/OpenCAS.

As shown in the figure below, the core design ideas of FlashCache and OpenCAS are similar. The core theoretical basis of the two architectures is the principle of "data locality". The SSD and HDD are divided into fixed management units at the same granularity, and then the SSD is installed. The space is mapped to multiple HDD layer devices (logical mapping or physical mapping). The access process is similar to that of the CPU accessing the cache and main memory. First, try to access the Cache layer. If CacheMiss appears, the HDD layer will be accessed. At the same time, according to the principle of data locality, this part of the data will be written back to the Cache layer. If the Cache space is full, some data will be replaced through the LRU strategy.

FlashCache/OpenCAS provides four caching strategies: WriteThrough, WriteBack, WriteAround, WriteOnly. Since the fourth type does not do read caching, here we only look at the first three types.

Write:

  • WriteThrough : The data write operation will be written to the back-end storage at the same time as it is written to the SSD.
  • WriteBack : The data write operation is only written to the SSD and then returned, and the cache strategy is flushed to the background storage.
  • WriteAround : Data write operations are directly written to the back-end storage, and the cache corresponding to the SSD will become invalid.

Read:

  • WriteThrough/WriteBack/WriteAround : First read the SSD, and the back-end storage will be read again if it fails , and the data will be flushed to the SSD cache.

For more detailed implementation details, please refer to the official documents of the two:

Alternative two: Kafka application internal implementation

In the first type of alternatives mentioned above, the core theoretical basis "data locality" principle and Kafka's read and write characteristics are not completely consistent. The "data flashback" feature will still introduce the problem of cache space pollution. At the same time, the LRU-based elimination strategy of the above architecture also contradicts Kafka's read and write characteristics. When multiple Consumers consume concurrently, the LRU elimination strategy may mistakenly eliminate some near-real-time data, resulting in performance jitters in real-time consumer operations.

It can be seen that the alternative solution cannot completely solve the current pain points of Kafka, and needs to be transformed from within the application. The overall design idea is as follows. The data is distributed in different devices according to the time dimension, and the near real-time part of the data is cached in the SSD, so that when PageCache competition occurs, the real-time consumer job reads the data from the SSD to ensure that the real-time job will not be affected. Impact of delayed consumption operations. The following figure shows the process of processing read requests based on the architecture implemented at the application layer:

When a consumption request arrives at the Kafka Broker, the Kafka Broker directly obtains data from the corresponding device according to the relationship between the message offset (Offset) maintained by it and the device and returns it, and the data read from the HDD is not included in the read request. Flush back to SSD to prevent cache pollution. At the same time, the access path is clear, and there will be no additional access overhead due to Cache Miss.

The following table provides a more detailed comparison of different candidate solutions:

Finally, considering the matching degree with Kafka's read and write features, the overall workload and other factors, we adopted the Kafka application layer to implement this solution, because the solution is closer to the read and write features of Kafka itself, and can more thoroughly solve Kafka's pain points.

New architecture design

Overview

Based on the above analysis of Kafka's read and write characteristics, we give the design goals of the SSD-based cache architecture at the application layer:

  • The data is distributed on different devices according to the time dimension, and the near real-time data is distributed on the SSD and eliminated to the HDD over time.
  • All data in the leader partition is written to the SSD.
  • The data read from the HDD is not flashed back to the SSD.

According to the above goals, we give the implementation of Kafka cache architecture based on SSD at the application layer:

A Partition in Kafka consists of several LogSegments, and each LogSegment contains two index files and log message files. Several LogSegments of a Partition are arranged in order according to the Offset (relative time) dimension.

According to the design ideas in the previous section, we first mark different LogSegments as different states, as shown in the figure (the upper part of the figure), according to the time dimension, it is divided into three resident states: OnlyCache, Cached, and WithoutCache. The transition of the three states and the processing of read and write operations by the new architecture are shown in the lower part of the figure. The LogSegment marked as OnlyCached is only stored on the SSD, and the background thread will periodically store the Inactive (no write traffic) LogSegment Synchronized to the SSD, the completed LogSegment is marked as Cached. Finally, the background thread will periodically check the space used on the SSD. When the space reaches the threshold, the background thread will remove the LogSegment with the longest distance from the SSD according to the time dimension, and this part of the LogSegment will be marked as the WithoutCache state.

For write requests, the write request still writes the data to PageCache first, and the SSD will be flushed after the threshold condition is met. For a read request (when the PageCache does not get the data), if the LogSegment status corresponding to the read offset is Cached or OnlyCache, the data is returned from the SSD (LC2-LC1 and RC1 in the figure). If the status is WithoutCache, the data is returned from the SSD. HDD returns (LC1 in the figure).

For the data synchronization of the follower copy, you can decide whether to write to SSD or HDD through configuration according to Topic's requirements for latency and stability.

Key optimization points

The above introduced the design outline and core design ideas of the SSD-based Kafka application layer cache architecture, including the read and write process, internal state management, and the newly added background thread function. This section will introduce the key optimization points of the solution, and these optimization points are closely related to the performance of the service. It mainly includes LogSegment synchronization and Append flashing strategy optimization, which will be introduced separately below.

LogSegment synchronization

LogSegment synchronization refers to the process of synchronizing data on the SSD to the HDD. The mechanism is designed with the following two key points:

  1. Synchronization method : The synchronization method determines the visible timeliness of the SSD data on the HDD, which will affect the timeliness of failure recovery and LogSegment cleaning.
  2. Synchronization speed limit : LogSegment synchronization process uses a speed limit mechanism to prevent the normal read and write requests from being affected during the synchronization process

Synchronously

Regarding the synchronization method of LogSegment, we have given three alternatives. The following table lists the introduction of the three schemes and their respective advantages and disadvantages:

In the end, we comprehensively considered the consistency maintenance cost, implementation complexity and other factors, and chose the way to synchronize the Inactive LogSegment in the background.

Synchronous speed limit

In the end, we comprehensively considered the consistency maintenance cost, implementation complexity and other factors, and chose the way to synchronize the Inactive LogSegment in the background.

Synchronous speed limit

LogSegment synchronization behavior is essentially the data transmission between devices, which will generate additional read and write traffic on the two devices at the same time, occupying the read and write bandwidth of the corresponding device. At the same time, since we have chosen to synchronize the data in the Inactive part, we need to synchronize the entire section. If the synchronization process is not restricted, it will have a greater impact on the overall delay of the service, which is mainly manifested in the following two aspects:

  • From the perspective of single-disk performance, since the performance of SSD is much higher than that of HDD, HDD write bandwidth will be full during data transmission. At this time, other read and write requests will have glitches. If there is a delay in consumption from HDD at this time Reading data or Follower is synchronizing data to HDD, which will cause service jitter.
  • From the stand-alone deployment perspective, a single machine deploys 2 SSDs and 10 HDDs. Therefore, during the synchronization process, 1 SSD needs to withstand the write volume of 5 HDDs. Therefore, the SSD will also have performance glitches during the synchronization process, which affects normality. The response to the request is delayed.

Based on the above two points, we need to add a speed limit mechanism in the LogSegment synchronization process. The overall speed limit principle is to synchronize as quickly as possible without affecting the delay of normal read and write requests. Because the synchronization speed is too slow, the SSD data cannot be cleaned up in time and eventually become full. At the same time, in order to be able to adjust flexibly, the configuration is also set as a single Broker granularity configuration parameter.

Optimized log flushing strategy

In addition to synchronization issues, the flushing mechanism during data writing also affects the read and write latency of the service. The design of this mechanism will not only affect the performance of the new architecture, but also affect native Kafka.

The following figure shows the processing flow of a single write request:

In the process of Produce request processing, first determine whether the log segment needs to be rolled according to the current LogSegment location and the data information in the request, then write the requested data to PageCache, update the LEO and statistical information, and finally determine whether or not according to the statistical information The flashing operation needs to be triggered, if necessary, through fileChannel.forceforced flashing, otherwise the request is returned directly.

In the entire process, except for log scrolling and flashing operations, other operations are memory operations, which will not cause performance problems. Log rolling involves the operation of the file system. At present, Kafka provides disturbance parameters for log rolling to prevent multiple segments from triggering the rolling operation at the same time to put pressure on the file system. For log flushing operations, the current mechanism provided by Kafka is to trigger forced flushing with a fixed number of messages (currently 50,000 online). This mechanism can only ensure that messages will be flushed with the same frequency when the incoming traffic is constant. But it cannot limit the amount of data that is flashed to the disk each time, and it cannot provide an effective limit to the load of the disk.

As shown in the figure below, it is the instantaneous value of write_bytes of a disk during the noon peak time. During the noon peak time, due to the increase in write traffic, a large number of burrs will be generated during the brushing process, and the value of the burrs is almost close to the maximum value of the disk. This will cause the delay of read and write requests to jitter.

In response to this problem, we modified the mechanism of flashing disks, and changed the original limit based on the number of bars to the actual flashing rate limit. For a single segment, the flashing rate is limited to 2MB/s. This value takes into account the actual average message size on the line. If the setting is too small, the topic with a large single message will be refreshed too frequently, which will increase the average delay when the traffic is high. At present, the mechanism has a small range of gray scales online. The right picture shows the write_bytes index corresponding to the same time period after gray scale. It can be seen that compared to the left picture, the data flushing rate is significantly smoother than before gray scale, and the maximum rate is only 40MB /s around.

For the new SSD cache architecture, the above-mentioned problems also exist. Therefore, in the new architecture, the flashing rate is also restricted in the flashing operation.

Solution test

Test target

  • It is verified that the SSD caching architecture based on the application layer can prevent real-time jobs from being affected by delayed jobs.
  • It is verified that compared with the cache layer architecture based on the kernel layer of the operating system, the SSD architecture based on the application layer has lower read and write latency under different traffic.

Test scenario description

  • Build 4 clusters: new architecture cluster, ordinary HDD cluster, FlashCache cluster, OpenCAS cluster.
  • Each cluster has 3 nodes.
  • Fixed write flow, relatively time-consuming to read and write.
  • Delayed consumption setting: Only consume the data of 10~150 minutes relative to the current time (exceeding the carrying area of ​​PageCache, not exceeding the carrying area of ​​SSD).

Test content and key indicators

  • Case1: When there is only delayed consumption, observe the production and consumption performance of the cluster.
    • Key indicators: time-consuming writing and time-consuming reading, these two indicators reflect the read and write latency.
    • Hit rate indicators: HDD read volume, HDD read volume (HDD read volume/total read volume), SSD read hit rate, these three indicators reflect the SSD cache hit rate.
  • Case 2: When there is delayed consumption, observe the performance of real-time consumption.
    • Key indicators: the proportion of SLA (Quality of Service) for real-time operations in 5 different time areas.

Test Results

From the perspective of single Broker request delay:

Before the optimization of the flashing mechanism, the new SSD cache architecture has obvious advantages over other solutions in all scenarios.

After the disk flush mechanism is optimized, the service quality of the other solutions has been improved in terms of delay. Under a small traffic volume, due to the optimization of the flush mechanism, the advantages of the new architecture and other solutions have become smaller. When the single-node write traffic is large (more than 170MB), the advantage is obvious.

From the perspective of the impact of delayed operations on real-time operations:

In all scenarios involved in the test, the new cache architecture does not affect real-time operations due to delayed operations, which is in line with expectations.

Summary and future outlook

Kafka assumes the role of unified data caching and distribution on the Meituan data platform. In view of the current pain points of real-time operations being affected by delayed operations due to PageCache mutual contamination and pageCache competition, we self-developed Kafka's application layer caching architecture based on SSD. This article mainly introduces the design ideas of Kafka's new architecture and its comparison with other open source solutions. Compared with ordinary clusters, the new cache architecture has very obvious advantages:

  1. Reduce read and write time : Compared with ordinary clusters, the read and write time of new architecture clusters is reduced by 80%.
  2. Real-time consumption is not affected by delayed consumption : Compared with ordinary clusters, the new architecture cluster has stable real-time read and write performance and is not affected by delayed consumption.

At present, this set of cache architecture has been verified and is in the gray stage, and will be deployed to high-quality clusters in the future.

 

 

Guess you like

Origin blog.csdn.net/qq_41893274/article/details/112747743