Design and Implementation of Kafka Application Layer Cache Architecture Based on SSD

 

Kafka assumes the role of unified data caching and distribution on the Meituan data platform. In view of the pain points of real-time operations being affected by delayed operations due to PageCache mutual contamination, which triggers PageCache competition, Meituan has developed Kafka's application layer caching architecture based on SSDs. . This article mainly introduces the design and implementation of the architecture, including the selection of the scheme, the comparison with other alternatives, and the core thinking points of the scheme. Finally, the performance comparison between this scheme and other alternatives is introduced.

The current status of Kafka on the Meituan data platform

Kafka's excellent I/O optimization and multiple asynchronous designs have higher throughput than other message queuing systems, while ensuring good latency, which is very suitable for application in the entire big data ecosystem.

Currently in the Meituan data platform, Kafka assumes the role of data buffering and distribution. As shown in the figure below, business logs, access layer Nginx logs, or online DB data are sent to Kafka through the data collection layer, and subsequent data is consumed and calculated by the user’s real-time operations, or used by the ODS layer of the warehouse for data warehouse production. Another part will enter the company's unified log center to help engineers troubleshoot online problems.

The current scale of Kafka on Meituan:

  • Cluster size : 6000+ nodes and 100+ clusters.

  • Cluster bearer : Topic number 60,000+, Partition number 410,000+.

  • The scale of messages processed : Currently, the total amount of messages processed per day is 8 trillion, and the peak traffic is 180 million messages per second

  • The scale of services provided : Currently, the downstream real-time computing platform runs 30,000+ jobs, and most of these data sources are from Kafka.

Kafka online pain point analysis & core goals

Currently, Kafka supports a large number of real-time jobs, and a large number of topics and partitions carried by a single machine. The problem that is prone to appear in this scenario is that different Partitions on the same machine compete for PageCache resources and affect each other, resulting in an increase in the processing delay of the entire Broker and a decrease in throughput.

Next, we will analyze the pain points of Kafka online based on the processing flow of Kafka read and write requests and online statistics.

Principle analysis

Schematic diagram of Kafka processing read and write process

For Produce requests : The I/O thread on the Server side uniformly writes the requested data to the PageCache of the operating system and returns immediately. When the number of messages reaches a certain threshold, the Kafka application itself or the operating system kernel will trigger a forced flash disk operation ( As shown in the flowchart on the left).

For Consume requests : The ZeroCopy mechanism of the operating system is mainly used. When Kafka Broker receives a data read request, it will send a sendfile system call to the operating system. After receiving it, the operating system will first try to obtain data from PageCache (as shown in the middle flowchart) If the data does not exist, it will trigger a page fault exception interrupt to read the data from the disk into the temporary buffer (as shown in the flowchart on the right), and then directly copy the data to the network card buffer through the DMA operation to wait for the follow-up TCP transmission.

To sum up, Kafka has good throughput and latency for a single read and write request. When processing a write request, the data is returned immediately after being written into PageCache, and the data is flushed to the disk in batches asynchronously, which not only ensures that most write requests can have a lower delay, and batch sequential flushing is more friendly to the disk. When processing read requests, real-time consumption jobs can read data directly from PageCache, with less request latency. At the same time, the ZeroCopy mechanism can reduce the switch between user mode and kernel mode during data transmission, which greatly improves the efficiency of data transmission.

However, when there are multiple Consumers on the same Broker, they may be delayed due to multiple Consumers competing for PageCache resources. Let's take two Consumers as an example to explain in detail:

As shown in the figure above, the Producer sends data to the Broker, and PageCache will cache this part of the data. When the consumption capacity of all Consumers is sufficient, all data will be read from PageCache, and the latency of all Consumer instances is low. At this time, if one of the Consumers has a consumption delay (Consumer Process2 in the figure), according to the read request processing flow, the disk read will be triggered at this time, and part of the data will be pre-read to PageCache while reading the data from the disk. When the PageCache space is insufficient, the data will be eliminated according to the LRU strategy. At this time, the data read by the consumer that delays consumption will replace the real-time cache data in the PageCache. Later, when the real-time consumption request arrives, because the data in PageCache has been replaced, unexpected disk reads will occur. This has two consequences:

  1. Consumers with sufficient spending power will lose the performance bonus of PageCache when they consume.

  2. Multiple Consumers influence each other, and unexpected disk reads increase, and HDD load increases.

We conducted gradient tests on HDD performance and the impact of read and write concurrency, as shown in the following figure:

It can be seen that as the read concurrency increases, the IOPS and bandwidth of the HDD will decrease significantly, which will further affect the throughput and processing delay of the entire Broker.

Online statistics

At present, Kafka cluster TP99 traffic is 170MB/s, TP95 traffic is 100MB/s, TP50 traffic is 50-60MB/s; the average allocation of PageCache of a single machine is 80GB, and the traffic of TP99 is taken as a reference. Under this traffic and PageCache allocation, The maximum cacheable data time span of PageCache is 80*1024/170/60 = 8min, which shows that the current Kafka service as a whole has extremely low tolerance for delayed consumption operations. In this case, once the consumption of some jobs is delayed, the real-time consumption jobs may be affected.

At the same time, we have counted the consumption delay distribution of online real-time jobs, and the delay range of 0-8min (real-time consumption) jobs accounted for only 80%, indicating that there are currently 20% of online jobs in a state of delayed consumption.

Pain point analysis summary

Summarizing the above-mentioned principle analysis and online data statistics, the current online Kafka has the following problems:

  1. Real-time consumption and delayed consumption jobs compete at the PageCache level, leading to unexpected disk reads due to real-time consumption.

  2. Traditional HDD performance drops sharply as read concurrency increases.

  3. There are 20% delayed consumer operations online.

According to the current PageCache space allocation and online cluster traffic analysis, Kafka cannot provide a stable service quality guarantee for real-time consumer operations, and this pain point needs to be solved urgently.

Expected goal

Based on the above analysis of pain points, our expected goal is to ensure that real-time consumer jobs will not be affected by delayed consumer jobs due to PageCache competition, and to ensure that Kafka provides a stable service quality guarantee for real-time consumer jobs.

solution

Why choose SSD

According to the analysis of the above reasons, it can be seen from the following two directions to solve the current pain points:

  1. Eliminate the PageCache competition between real-time consumption and delayed consumption, such as: let the data read by delayed consumption jobs not write back PageCache, or increase the allocation of PageCache.

  2. Add a new device between HDD and memory, which has better read and write bandwidth and IOPS than HDD.

For the first direction, because PageCache is managed by the operating system, if its elimination strategy is modified, it will be more difficult to implement and will destroy the external semantics of the kernel itself. In addition, the cost of memory resources is high and unlimited expansion is not possible, so the second direction needs to be considered.

SSDs are currently becoming more and more mature. Compared with HDDs, SSDs have an order of magnitude improvement in IOPS and bandwidth, which is very suitable for taking part of the read traffic after PageCache competition in the above scenarios. We also tested the performance of SSD, and the results are shown in the following figure:

It can be seen from the figure that as read concurrency increases, the IOPS and bandwidth of the SSD will not decrease significantly. From this conclusion, we can use SSD as the cache layer between PageCache and HDD.

Architectural decision

After introducing SSD as the caching layer, the key issues to be solved in the next step include data synchronization among PageCache, SSD, and HDD, and data routing for read and write requests. At the same time, our new cache architecture needs to fully match the Kafka engine read and write. The characteristics of the request. This section will introduce how the new architecture solves the above-mentioned problems in selection and design.

Kafka engine has the following characteristics in read and write behavior:

  • The frequency of data consumption changes over time, the longer the data consumption frequency, the lower.

  • Each partition (Partition) only Leader provides read and write services.

  • For a client, the consumption behavior is linear, and data will not be repeatedly consumed.

Two alternatives are given below, and our selection basis and architectural decisions will be given for the two options below.

Alternative 1: Implementation based on the kernel layer of the operating system

Currently, open source caching technologies include FlashCache, BCache, DM-Cache, OpenCAS, etc. Among them, BCache and DM-Cache have been integrated into Linux, but there are requirements for the kernel version and are limited by the kernel version. We can only choose FlashCache/OpenCAS.

As shown in the figure below, the core design ideas of FlashCache and OpenCAS are similar. The core theoretical basis of the two architectures is the principle of "data locality". The SSD and HDD are divided into fixed management units at the same granularity, and then the SSD is installed. The space is mapped to multiple HDD layer devices (logical mapping or physical mapping). In the access process, similar to the process of CPU accessing cache and main memory, first try to access the Cache layer. If CacheMiss appears, the HDD layer will be accessed. At the same time, according to the principle of data locality, this part of the data will be written back to the Cache layer. If the Cache space is full, some data will be replaced through the LRU strategy.

FlashCache/OpenCAS provides four caching strategies: WriteThrough, WriteBack, WriteAround, WriteOnly. Since the fourth type does not do read caching, here we only look at the first three types.

Write:

  • WriteThrough : The data write operation will be written to the back-end storage at the same time as it is written to the SSD.

  • WriteBack : The data write operation returns only after writing to the SSD, and the cache strategy is flushed to the background storage.

  • WriteAround : Data write operations are directly written to the back-end storage, and the cache corresponding to the SSD will become invalid.

Read:

  • WriteThrough/WriteBack/WriteAround : First read the SSD, the backend storage will be read again if it fails , and the data will be flushed to the SSD cache.

For more detailed implementation details, please refer to the official documents of the two:

Alternative two: internal implementation of Kafka application

In the first type of alternatives mentioned above, the core theoretical basis "data locality" principle and Kafka's read and write characteristics are not completely consistent. The "data flashback" feature will still introduce the problem of cache space pollution. At the same time, the LRU-based elimination strategy of the above architecture also contradicts Kafka's read and write characteristics. When multiple Consumers consume concurrently, the LRU elimination strategy may mistakenly eliminate some near-real-time data, resulting in performance jitters in real-time consumer operations.

It can be seen that the alternative solution cannot completely solve the current pain points of Kafka and needs to be transformed from within the application. The overall design idea is as follows. The data is distributed in different devices according to the time dimension, and the near real-time part of the data is cached in the SSD, so that when PageCache competition occurs, the real-time consumer job reads the data from the SSD to ensure that the real-time job will not be affected. Impact of delayed consumption operations. The following figure shows the process of processing read requests based on the architecture implemented at the application layer:

When a consumption request arrives at the Kafka Broker, the Kafka Broker directly obtains data from the corresponding device according to the relationship between the message offset (Offset) maintained by it and the device and returns it, and the data read from the HDD is not included in the read request. Flush back to SSD to prevent cache pollution. At the same time, the access path is clear, and there will be no additional access overhead due to Cache Miss.

The following table provides a more detailed comparison of different candidate solutions:

Finally, considering the matching degree with Kafka's read and write features, the overall workload and other factors, we use the Kafka application layer to implement this solution, because the solution is closer to the read and write features of Kafka itself and can more thoroughly solve Kafka's pain points.

New architecture design

Overview

Based on the above analysis of Kafka's read and write characteristics, we give the design goals of the SSD-based cache architecture at the application layer:

  • Data is distributed on different devices according to the time dimension, near real-time data is distributed on SSD, and eliminated to HDD over time.

  • All data in the leader partition is written to the SSD.

  • The data read from HDD is not flashed back to SSD.

According to the above goals, we give the implementation of Kafka cache architecture based on SSD at the application layer:

A Partition in Kafka is composed of several LogSegments, and each LogSegment contains two index files and log message files. Several LogSegments of a Partition are arranged in order according to the Offset (relative time) dimension.

According to the design ideas in the previous section, we first mark different LogSegment as different states, as shown in the figure (the upper part of the figure) according to the time dimension, it is divided into three resident states: OnlyCache, Cached, and WithoutCache. The transition of the three states and the processing of read and write operations by the new architecture are shown in the lower part of the figure. The LogSegment marked as OnlyCached is only stored on the SSD, and the background thread will periodically store the Inactive (no write traffic) LogSegment Synchronize to the SSD, and the synchronized LogSegment is marked as Cached.

Finally, the background thread will periodically check the space used on the SSD. When the space reaches the threshold, the background thread will remove the LogSegment with the longest distance from the SSD according to the time dimension, and this part of the LogSegment will be marked as the WithoutCache state.

For write requests, the write request still writes the data to PageCache first, and the SSD will be flushed after the threshold condition is met. For a read request (when the PageCache has not obtained the data), if the status of the LogSegment corresponding to the read offset is Cached or OnlyCache, the data is returned from the SSD (LC2-LC1 and RC1 in the figure). If the status is WithoutCache, the HDD returns (LC1 in the figure).

For the data synchronization of the follower copy, you can decide whether to write to SSD or HDD through configuration according to Topic's requirements for latency and stability.

Key optimization points

The above introduced the design outline and core design ideas of the SSD-based Kafka application layer cache architecture, including read and write processes, internal state management, and new background thread functions. This section will introduce the key optimization points of the scheme, which are closely related to the performance of the service. It mainly includes LogSegment synchronization and Append flashing strategy optimization, which will be introduced separately below.

LogSegment synchronization

LogSegment synchronization refers to the process of synchronizing data on the SSD to the HDD. The mechanism is designed with the following two key points:

  1. Synchronization method : The synchronization method determines the visible timeliness of the SSD data on the HDD, which will affect the timeliness of failure recovery and LogSegment cleaning.

  2. Synchronization speed limit : LogSegment synchronization process uses a speed limit mechanism to prevent normal read and write requests from being affected during the synchronization process

Synchronously

Regarding the synchronization method of LogSegment, we have given three alternatives. The following table lists the introduction of the three schemes and their respective advantages and disadvantages:

In the end, we comprehensively considered factors such as consistency maintenance cost and implementation complexity, and chose the method of synchronizing Inactive LogSegment in the background.

Synchronous speed limit

LogSegment synchronization behavior is essentially data transmission between devices, which will generate additional read and write traffic on the two devices at the same time, occupying the read and write bandwidth of the corresponding device. At the same time, since we have chosen to synchronize the data in the Inactive part, we need to synchronize the entire section. If the synchronization process is not restricted, it will have a greater impact on the overall delay of the service, mainly in the following two aspects:

  • From the perspective of single-disk performance, because the performance of SSD is much higher than that of HDD, HDD write bandwidth will be full during data transmission. At this time, other read and write requests will have glitches. If there is a delay in consumption from HDD at this time Reading data or Follower is synchronizing data to HDD, which will cause service jitter.

  • From the stand-alone deployment perspective, a single machine deploys 2 SSDs and 10 HDDs. Therefore, during the synchronization process, 1 SSD needs to withstand the write volume of 5 HDDs. Therefore, the SSD will also have performance glitches during the synchronization process, which affects normality. The request response is delayed.

Based on the above two points, we need to add a speed limit mechanism in the LogSegment synchronization process. The overall speed limit principle is to synchronize as quickly as possible without affecting the delay of normal read and write requests. Because the synchronization speed is too slow, the SSD data cannot be cleared in time and eventually become full. At the same time, in order to be able to adjust flexibly, the configuration is also set as a configuration parameter of a single Broker granularity.

Optimized log flushing strategy

In addition to the synchronization problem, the flashing mechanism during the data writing process also affects the read and write latency of the service. The design of this mechanism will not only affect the performance of the new architecture, but also affect native Kafka.

The following figure shows the processing flow of a single write request:

In the process of Produce request processing, first determine whether the log segment needs to be rolled based on the current LogSegment location and the data information in the request, then write the requested data to PageCache, update the LEO and statistical information, and finally determine whether or not according to the statistical information The flashing operation needs to be triggered, if necessary, through fileChannel.forceforced flashing, otherwise the request is returned directly.

In the entire process, except for log scrolling and flashing operations, other operations are memory operations, which will not cause performance problems. Log rolling involves the operation of the file system. At present, Kafka provides disturbance parameters for log rolling to prevent multiple segments from triggering the rolling operation at the same time to put pressure on the file system. For log flushing operations, the current mechanism provided by Kafka is to trigger forced flushing with a fixed number of messages (currently 50,000 online). This mechanism can only ensure that messages will be flushed with the same frequency when the incoming traffic is constant. However, it cannot limit the amount of data that is flashed to the disk each time, and it cannot provide effective restrictions on the load of the disk.

As shown in the figure below, it is the instantaneous value of write_bytes of a disk during the noon peak time. During the noon peak time, due to the increase in write traffic, a large number of burrs will be generated during the brushing process, and the value of the burrs is almost close to the maximum value of the disk. This will cause the delay of read and write requests to jitter.

In response to this problem, we modified the mechanism of flashing the disk, and changed the original limit by number to the actual flashing rate limit. For a single segment, the flashing rate is limited to 2MB/s. This value takes into account the actual average message size on the line. If the setting is too small, the topic with a large single message will be refreshed too frequently, which will increase the average delay when the traffic is high. At present, the mechanism has a small range of gray scales online. The picture on the right shows the corresponding write_bytes index for the same time period after gray scale. It can be seen that compared to the picture on the left, the data flushing rate is significantly smoother than before gray scale, and the maximum rate is only 40MB /s around.

For the new SSD cache architecture, the above problems also exist. Therefore, in the new architecture, the flashing rate is also limited in the flashing operation.

Solution test

Test target

  • It is verified that the SSD cache architecture based on the application layer can prevent real-time jobs from being affected by delayed jobs.

  • It is verified that compared to the cache layer architecture based on the kernel layer of the operating system, the SSD architecture based on the application layer has lower read and write latency under different traffic.

Test scenario description

  • Build 4 clusters: new architecture cluster, ordinary HDD cluster, FlashCache cluster, OpenCAS cluster.

  • 3 nodes per cluster.

  • Fixed write flow, relatively time-consuming reading and writing.

  • Delayed consumption setting: Only consume data of 10 to 150 minutes relative to the current time (exceeding the carrying area of ​​PageCache and not exceeding the carrying area of ​​SSD).

Test content and key indicators

  • Case1: When there is only delayed consumption, observe the production and consumption performance of the cluster.

  • Key indicators: write time and read time. These two indicators reflect the read and write latency.

  • Hit rate indicators: HDD read volume, HDD read ratio (HDD read volume/total read volume), SSD read hit rate. These three indicators reflect the SSD cache hit rate.

  • Case2: When there is delayed consumption, observe the performance of real-time consumption.

  • Key indicators: the proportion of SLA (quality of service) for real-time operations in 5 different time areas.

Test Results

From the perspective of single Broker request delay:

Before the optimization of the flashing mechanism, the new SSD cache architecture has obvious advantages over other solutions in all scenarios.

After the disk flush mechanism is optimized, the service quality of other solutions has been improved in terms of delay. Under a small traffic volume, due to the optimization of the flush mechanism, the advantages of the new architecture and other solutions have become smaller. When the single node write traffic is large (more than 170MB), the advantage is obvious.

From the perspective of the impact of delayed operations on real-time operations:

In all scenarios involved in the test, the new cache architecture does not affect real-time operations due to delayed operations, which is in line with expectations.

Summary and future outlook

Kafka assumes the role of unified data caching and distribution on the Meituan data platform. Aiming at the current pain points of real-time operations being affected by delayed operations due to PageCache mutual contamination and causing PageCache competition, we self-developed Kafka's application layer caching architecture based on SSD. This article mainly introduces the design ideas of Kafka's new architecture and its comparison with other open source solutions. Compared with ordinary clusters, the new cache architecture has obvious advantages:

  1. Reduce read and write time : Compared with ordinary clusters, the new architecture cluster reduces read and write time by 80%.

  2. Real-time consumption is not affected by delayed consumption : Compared with ordinary clusters, the new architecture cluster has stable real-time read and write performance and is not affected by delayed consumption.

At present, this set of cache architecture has been verified and is in the gray stage, and will be deployed to high-quality clusters in the future. The code involved will also be submitted to the Kafka community as a feedback to the community, and everyone is welcome to communicate with us.

About the Author

Shiji and Shilu are both Meituan data platform engineers.

----------  END  ----------

Job Offers

The real-time storage group of Meituan's basic R&D platform is mainly responsible for the R&D, maintenance and related platform construction of real-time storage engines for big data scenarios. Designed to provide a unified, efficient, reliable and easy-to-use streaming storage service. Interested students are welcome to join us! Resume can be sent to: [email protected] (please indicate the subject of the email: real-time storage)

Maybe you still want to watch

Kafka file storage mechanism those things

Meituan OCTO trillion-level data center computing engine technology analysis

|  The exploration and practice of Meituan's next-generation service management system OCTO2.0

Guess you like

Origin blog.csdn.net/MeituanTech/article/details/112645937