Yan Yanfei: Kafka's high-performance demystification and optimization

Welcome to Tencent Cloud + Community to get more Tencent's massive technical practice dry goods~

This article was first published in the cloud+ community and may not be reproduced without permission.

Good afternoon everyone, I am Yan Yanfei, a senior engineer from the ckafka team of Tencent Cloud Infrastructure Department. Today, I will first share with you some key points of open source Kafka in terms of high performance, and then I will share some optimization points that our Tencent Cloud ckafka has done to community Kafka. Finally, I will introduce my future for the Kafka community. Outlook.

Kafka High Performance Demystified

First of all, I will introduce the entire Kafka architecture, so that everyone can have a macro understanding of Kafka, and then I will introduce the storage method of Kafka and the specific message storage format in more detail. Of course, in order to facilitate everyone to have an intuitive understanding of kafka's high performance, I will also give its performance data in the end.

Overall architecture

We can see that the entire Kafka cluster only contains two components: Broker and zookeeper.

Broker is the core engine of the entire Kafka cluster, responsible for the storage and forwarding of messages, and provides external services. We can see that the Kafka cluster can easily expand or shrink the entire cluster by adding or deleting Brokers. The basic unit for Kafka to provide external services is topic, so to achieve topic-level parallel expansion capability, it also realizes application-level parallel expansion capability. In order to achieve application-level parallel expansion capabilities, Kafka adopts the method of partitioning topics. By partitioning topics, different partitions fall on different brokers, so as to utilize the capabilities of more brokers, and finally realize application-level Expand horizontally.

Zookeeper is mainly responsible for storing some configuration information, Broker information, Topic information and other metadata in the entire cluster, and undertakes part of the function of coordinating and selecting the master, which can be understood as the configuration management center of the Kafka cluster. Speaking of this, you will think that in the Kafka cluster, you can simply add or delete the dynamic cluster to achieve cluster expansion and contraction, but there is only one Zookeeper in the entire cluster, so will Zookeeper become the bottleneck of the entire cluster, so Constrains the parallel expansion capability of the entire cluster? Indeed, in some older versions of Kafka, both Kafka producers and consumers need to communicate and interact with Zookeeper for metadata pulling, consumption group coordination, and consumption group offset submission and saving. One problem caused by this is that all clients have to communicate and interact directly with ZooKeeper, which puts a lot of pressure on it, affects the stability of the system, and ultimately affects the parallel expansion capability of the entire Kafka cluster. But since version 0.9 (inclusive), the Kafka team has optimized the entire Kafka by adding some protocols and adding coordination modules. At present, Kafka has achieved that the production and consumption of the client does not require any communication and interaction with Zookeeper. Therefore, Zookeeper currently only acts as a configuration management center, and the pressure is very small. It will not become a bottleneck point of the cluster and thus restrict the horizontal expansion capability of the cluster.

You can also see that producers and consumers directly interact with Broker to achieve production and consumption functions. Kafka does not use the traditional system to achieve parallel expansion capabilities by adding a layer of agents in its design. In the design of Kafka, through the internal routing protocol, producers and consumers can directly negotiate routing with the broker, so that the client can directly produce and consume with the broker without the need for a third-party agent. The agentless method will not only reduce the length of the entire data link, reduce the delay, but also improve the stability of the entire system, but also save a lot of costs.

To sum up, the overall architecture of Kafka reflects the following main advantages. First, the Kafka cluster can be scaled horizontally at the cluster level by adding or deleting Brokers. Second, by partitioning topics, infinite parallel expansion at the application level is achieved. Third, through the excellent communication protocol, the production system can directly communicate with the back-end Broker, saving the agent, not only reducing the length of the data link and reducing the delay, but also greatly reducing the cost.

So far, I think everyone has a relatively macro understanding of Kafka. We know the overall architecture of the system, which determines the upper limit of the capability of the entire system. However, the performance of key components in the system determines the number of servers in the cluster with the same capability. The increase in the number of servers will not only increase the cost, but also bring more pressure on operation and maintenance and affect the stability of the system. So below I will introduce the system architecture of Kafka's core engine Broker.

Broker Architecture

We can see that the Broker is a typical Reactor model, which mainly includes a network thread pool, which is responsible for processing network requests for network sending and receiving, packaging and unpacking, and then pushing the request to the core processing module through the request queue, which is responsible for the real business. Logical processing (Kafka will store all messages on the ground, so they are mainly file I/0 operations). We can see that Kafka adopts a multi-threaded approach, which can take full advantage of the multi-core advantages of modern systems. Secondly, it adopts the queue method to realize the asynchronous decoupling of the network processing module and the core processing module, realizes the parallel processing of network processing and file I/O, and greatly improves the efficiency of the whole system.

At this point, everyone already has a macro understanding of the Kafka architecture. We also mentioned above that Kafka will store all messages on the ground. So why doesn't Kafka fear disk as much as traditional message queues, and Energizer cache doesn't touch disk? Why did Kafka choose to store all messages on the ground? Below I will explain its storage organization and storage format to reveal the secrets for you one by one.

How storage is organized

This is the current storage organization method of Kafka. We can see that Topic is only a logical concept and does not correspond to any physical entity. In order to realize the horizontal expansion of Topic, Kafka will partition it. Partitions are displayed in the form of directories, and the specific data in Partitions has not been stored in shards. In this way, during production, you can quickly find the latest allocation and directly append it to the end of the file. You can see that Kafka makes full use of the sequential write of the disk in production, which greatly improves the throughput of production. At the same time, sharding storage has the advantage that we can easily delete expired messages by deleting old shards. At the same time, in order to facilitate consumption, Kafka also adopts certain skills in the naming of shards. The naming of shards is formatted with the offset of the first message contained in it. In this way, when consuming data, it is very convenient to pass binary Find the file segment where the message is located. At the same time, in order to achieve fast positioning within the shard, Kafka will also create two sparse index files for each data shard. Through the index file, binary search can be used to quickly locate the location of the specified message in the data shard, and then perform Consumption. Through the explanation, we can see that the entire production and consumption of Kafka is actually sequential read and write, making full use of the disk sequential read and write capabilities. Secondly, the consumption of Kafka adopts two-level binary search, and its search performance only depends on the index size of one shard, and will not be affected by the data volume of the entire system.

The most basic unit of Kafka processing is a message, so what is the specific format of Kafka's message? What is the specific format of the message stored on the ground? Needless to say, these will greatly affect the performance of the system, so I will also introduce the message format of Kafka in detail below.

message format

For the convenience of everyone's understanding, here I will display the message format in the form of C code. Kafka messages are actually simple binary encoding, stored in network byte order, so encoding and decoding operations can be performed very efficiently. At the same time, we can see that the entire Kafka message header is very compact, only about 30 bytes, and also includes the crc check code for related verification of the message. At the same time, the most subtle thing in Kafka is that the format of the message maintains the consistency of the message format on the production system side, network transmission, broker side, including the final file storage. Therefore, in the transmission of the message in the whole system, there is no need for any transcoding, and the efficiency is extremely high. Of course, in order to improve the throughput of the entire system, Kafka implements batch production and consumption of messages. Bulk messages are represented in Kafka as they are arranged one by one in memory in binary form. At the same time, in order to improve the utilization of the system network and disk, Kafka still implements message compression. The flow chart on the right describes the message compression process in detail. As you can see, first we compress the entire batch of messages as a whole to generate a new binary string. Then, pack the value into the value field of a new message. It can be seen that Kafka very cleverly implements batch compression of messages through message nesting, which improves the overall compression efficiency. At the same time, this method also ensures the consistency of the format of the message. Maintaining message consistency has the following benefits: First, we only need the producer to compress once in the entire message flow, and after sending the compressed message to the Broker, the Broker only needs one decompression operation for message verification. After checking, and setting the message offset, the message can be directly stored in the file, and there is no need for a broker to perform a very performance-consuming compression operation. Therefore, even if the message compression is adopted, the consumption on the Broker side is very low. At the same time, because the consistency of the compressed message format is maintained, when there is a consumption request, the Broker does not need to perform any decompression and compression operations, and can directly send the message to the consumer in a compressed manner, and the consumer is responsible for decompression. There is no need for any decompression and compression operations in consumption, which can greatly improve the performance of the Broker side. It can be seen that Kafka implements end-to-end data compression very simply in this way, and allocates computing resource consumption to the production system and the consumption system.

Kafka high performance

Due to time reasons, the key points of Kafka's high performance are basically covered here. Of course, the Kafka team still has many ingenious designs for its high performance. Due to time reasons, I will not go into details here. Of course, this page PPT is detailed. The key points of its high performance are listed. If you are interested, you can ponder these key points yourself.

I have talked about so many key points of Kafka's high performance, but what kind of performance does it have? In order to give you a more intuitive understanding of the performance of the entire Kafka, I will also give you relevant performance test data below. Of course, any performance test without pre-existing test conditions is empty, so before giving the data, give the test configuration conditions. 1. The test scenario is a single Broker topic and multiple partitions. 2. The hardware configuration of the machine is 32-core 64G memory, 10 Gigabit network card, and 12 2T SATA disks are mounted at the same time, and soft RAID10 is done for the SATA disk. 3. On the Broker version, we choose 0.10.2.0. In terms of Broker configuration, we choose to refresh the disk every 100,000 messages or two seconds. 4. We use the native stress testing tool of community Kafka, and simultaneously open 140 customers for stress testing to simulate certain concurrency. Of course, all the stress test data I follow are based on this unified configuration, and I no longer believe in describing the test conditions when I introduce them later.

From the table below, we can see that Kafka can easily reach millions of qps in the case of small packets, and even in the case of large packets of 1K, it can reach hundreds of thousands of qps. It can be said that the performance of the whole Kafka is still very strong. of. But at the same time, we also found that the CPU usage of Kafka Broker in the test, especially the disk I/O usage rate is not high, indicating that the community Kafka still has some room for optimization. So let me introduce some optimization points of our CKafka for community Kafka.

Kafka performance optimization

In this chapter, first of all, I will lead you to further understand the architecture of the entire Broker side. By further understanding its architecture, find out its possible bottleneck points, and then optimize and optimize the bottleneck points. Then I will pick out a few specific optimization points for a detailed introduction.

Anatomy of the current architecture

In order to facilitate everyone's understanding, we use a real request and the path passed to illustrate how each module of the Broker interacts. First of all, when a production starts, it needs to connect with the Broker. There will be an Accept thread module on the Broker side to monitor the connection to establish a new connection. After the connection is established, Accept will forward the network connection poll to the network for receiving and sending. One thread in the processing thread pool is processed, so far the connection access has been completed. The following will enter the data request. At this time, the production request data sent by the producer will be directly processed by the network sending and receiving processing thread. Whenever the network thread receives a complete package and completes the unpacking operation, it will push the request to The request queue, the real logic processing is performed by the backend core processing thread. After the back-end core I/O processing thread will compete to pull from the request queue to the production request task, first, it will parse the corresponding message, create the corresponding message object, and then perform some validity checks on the message object. And set the corresponding offset, when this is done, the message will be written to the file. Of course, after the writing is completed, it will check whether the number of messages that have not been flushed meets the flushing requirement. If the flushing requirement is met, the thread will also actively perform the flushing operation, and finally pass the processing result through the response queue of each network processing thread. , returned to the corresponding network processing thread, and finally packaged by the network thread and returned the result to the production side, and a production process is completed at this point.

In this process, we can find: 1. In the entire Kafka architecture, there is only one request queue, and the queue has not been optimized without any locks, which will lead to network processing thread pools and core I/O processing threads Generate fierce lock competition, which may affect the concurrency of the entire system, thereby affecting the performance of the system. So this is an optimization point we should think of. Second, we just mentioned that Kafka chooses to flush the disk directly in the thread at the core, which will block the entire core process and affect the performance of the entire system. This is also the second optimization point we found today that needs to be optimized. The third is that we found that a large number of message objects will be generated in the production message to verify the validity of the message. The generation of a large number of message objects will have a greater impact on the GC of the jvm, and may become a bottleneck of system performance. Of course, ckafka has also optimized many other aspects of community kafka. Due to time constraints, we mainly focus on these three points for introduction.

lock optimization

Our first version of lock optimization, as you can see from the architecture diagram, is actually very simple. We directly replaced the only request queue on the broker side with a lock-free request queue. Of course, we also optimized the replacement of lock-free queues for all response queues. After this round of optimization, what effect did we achieve? This is the optimization result of the current lock-free queue. By comparison, we can find that after lock-free queue optimization, the overall performance is basically the same as that of the community version of Kafka. But according to our previous analysis, it stands to reason that there should be a large performance improvement, but why did not achieve the expected effect?

To find out, we conducted a more detailed statistical analysis of Broker. After statistical analysis, we counted the number of requests. Through the following statistical chart, we found that whether it is community Kafka or our optimized version, even in the case of millions of qps messages, the number of production requests is very small, all is below the 10w level. At this point, we understand that it is precisely because the open source Kafka adopts the batch sending method to merge a large number of production requests, which drastically reduces the number of requests on the entire Broker side, and the reduction in the number of requests makes the network processing threads and cores The lock contention between processing threads is reduced. In the case of the current request level, the lock contention is unlikely to become the bottleneck of the real system, so it is normal that our lock-free queue does not achieve the desired effect. Speaking of this, it can be said that our first optimization was not very successful, but the road to optimization is always long and tortuous, and it is normal to fail once and will not scare us. Our ckafka will continue to not only advance on the road of optimization Row.

File brushing optimization

Next, we will introduce the second optimization point, asynchronous brushing optimization. In terms of asynchronous flashing optimization, we ckafka specially added a set of flashing threads, which are specially used for disk flashing, so that when the core thread finds that there is a need for flashing, it will directly generate a flashing task and pass the flashing task through The lock-free queue is pushed to the flushing thread, and it can be flushed. In this way, we can realize that the core processing thread processing is not blocked when the disk is flushed, which will greatly improve the performance of a system. Of course, after this round of words, will we have any effect? Specifically, let's take a look at the performance comparison data below.

After asynchronous brushing optimization, we can see that the optimized throughput is 4 to 5 times higher than the community version in the case of small packets, and about twice the performance even in the case of large packets (with The increase of the package and the increase of the partition will lead to a downward trend after optimization). At the same time, we can find that during our asynchronous flash disk optimization test, the I/O utilization rate of the entire system has been very high, basically above 90%. It can be said that the bottleneck point of the entire system should be disk I/O. At the same time, the I/O usage rate exceeds 90%, which means that we have exhausted the performance of the system disk, and it also means that there is not much room for our subsequent throughput optimization. So does this mean that there is no room for optimization in our entire Kafka? In fact, the optimization of the system not only includes the improvement of throughput, but also related optimizations in terms of resource utilization under the same throughput, so ckafka's optimization of community Kafka has not stopped at the current time, so we further Optimized for the next GC. We expect that through this optimization, not only the throughput will be improved, but more importantly, in similar scenarios, the utilization rate of resources can be better reduced and the cost can be further reduced.

GC optimization

In terms of GC optimization, the community Kafka will generate a message object for each message when verifying the production message, resulting in a large number of message objects. After optimization, ckafka uses the message verification directly on the ByteBuffer binary data. In this way, no message objects will be generated in the entire message verification, and the generation of a large number of message objects will be reduced, which will reduce the pressure on the jvm GC, thereby improving the performance of the system. By comparing the performance data before and after optimization, we can see that it has a certain effect. We can see that the time-consuming ratio of the entire GC after optimization has been lower than 2.5%. It can be said that GC will no longer become the bottleneck of the system. Let's take a look at the GC time consumption on the open source Kakfa side of the community. With more messages with smaller partitions, the consumption can directly reach 10%. After our GC optimization, we can see that there is a performance improvement of 1.5% to 7% in terms of the entire GC time consumption. Similarly, we can see that the CPU consumption of our entire system is 5% to 10% lower than that of the community version when the throughput is similar, so it is certain that GC optimization has effectively reduced the consumption of system CPU resources. , has a certain effect. Finally, we found that the I/O in the system before and after the GC optimization has basically reached its peak and has become the bottleneck of the system. Therefore, the GC optimization is consistent with the previous prediction and there is no significant increase in throughput, mainly focusing on reducing system resources. Consumption, for the front-end client, is to reduce the delay of the system to a certain extent.

So far, due to time reasons, the introduction of Kafka's performance optimization has been completed. But in order to make it easier for everyone to understand intuitively, this page of PPT posts the final optimized comparison effect. Let's take a look at the final optimization effect, we can see that our final full version optimization has a performance improvement of 4 to 5 times in the case of a small package, and even in the case of a large package of 1K, there is about a double performance improvement (of course With the increase of partition and the increase of message size, the optimization effect shows a certain downward trend). At the same time, we can find that the I/O of the entire system has become the bottleneck of the system. It also provides a reference for our later system hardware selection. Maybe we will mount more disks later to further improve the system's throughput and squeeze the performance of the system, and further balance the consumption of CPU and disk ratio. Of course, by choosing a more suitable one The hardware achieves an appropriate ratio between CPU, disk, and network to maximize resource utilization.

Next, I will talk about some of the problems we found in the operation of CKafka and our optimization points for these problems. At the same time, we hope that the community Kafka can adopt some key suggestions that we found and optimized in the operation of ckafka, so that Kafka can better Adapt to production conditions.

First, the current community Kafka cannot use the pipe method for consumption, which leads to the following problems: First, the performance of the consumer is very dependent on the network delay with the broker. The increase of network latency will lead to very low consumption performance, which will ultimately limit the usage scenarios of Kafka, making it unable to function well for cross-city data synchronization, limiting its usage scenarios. Second, Kafka reuses the consumption logic when replicating, and also uses the consumer logic, which also makes it impossible to use the pipe method, and finally leads to low replica synchronization performance and is very dependent on delay, which eventually leads to the entire Kafka. The cluster is unlikely to perform some cross-region deployments, which limits the flexibility of Kafka deployment. At the same time, under high pressure conditions, it is easy to see that the speed of replica pulling cannot keep up with the speed of production, resulting in ISR jitter affecting system performance. In response to this problem, in fact, on the second point, we have already optimized it so that the replica pull can already be piped. Even if some cross-city deployments need to be done later, the entire replica synchronization performance can meet the requirements. But the first problem here we can't solve here, because we are born here for the compatibility community version of Kafka, and let customers directly use the open source SDK, which leads us to have no way to optimize it here. Here, I hope that the community Kafka can adopt relevant suggestions and implement the consumer pipeline method, so that the entire consumption performance does not depend on the network delay, so that there are no geographical restrictions on the use of users.

The second is that the current community Kafka considers consumption for performance considerations. In its design, it does not support consumers of lower versions to directly consume messages produced by higher versions. However, currently Kafka has developed three message versions, so As a result, when the business uses Kafka, the upgrade and downgrade of producers and consumers are very unfriendly. Of course, when we are doing cloudification here, we have already realized the conversion of message formats, have realized that different versions of messages are mixed and stored in the same file, and have realized the production of any version and the consumption of any version. Here we hope that the community Kafka can try to release related support, because after all, in the production system, compatibility is one of the most important criteria for business use. And even if our current implementation has transcoding of high and low versions of messages, in fact, the CPU still has surplus now, and it is not the bottleneck of the system, so I hope the community can adopt it.

Speaking of which, my sharing is basically over. Then this is my personal WeChat, if you have any questions, you can add my WeChat. Of course, our CKafka team is also recruiting a lot of people now, and you can contact us if you are interested.

 

Q/A

Q: I need to ask you a question here, that is, when I saw a flash disk optimization just now, I found that the CPU also increased many times with the performance improvement. Is this mainly because you did it again during the optimization process? copy?

A: No, in fact, community Kafka still has some optimizations in memory copy. Alas, wait a moment, community Kafka will generate a ByteBuffer from the network layer to store message packets during the entire message flow, such as when producing messages, and then this ByteBuffer will continue to reverse in the entire system, it will not If there is a new copy, unless there are some messages that need to be stored in different ways, and there is a requirement for transcoding, he will generate a new message for one memory copy, otherwise there will be no multiple memory copies. And when we are doing asynchronous flushing optimization here, in fact, there will not be any more memory copies. Our CPU usage has been improved several times, mainly due to the improvement of throughput. We can see that our system is in small packets. There is a 4 to 5 times performance improvement. This improvement will lead to more network operations, more packaging and unpacking, and more system I/O. Of course, it will eventually lead to more CPUs. Of course, it will require more High CPU consumption. Of course, it is normal for the superposition of so much consumption to increase the CPU usage several times.

Q: I have a question I would like to ask you just now. The PPT also said that in the case of a large system volume, the pull rate of the copy cannot keep up with the production rate. For this, we can test it or say it online. It will encounter that the copy pull cannot keep up with the production speed, and it will affect other nodes, which will cause an avalanche effect. Then I want to ask about the solution on your side. You just said that there is a solution. I want to ask which solution is adopted on your side?

A: The main reason why the community Kafka copy cannot be pulled is because the consumption method is adopted, but the consumption method does not support pipe consumption. In fact, Kafka's replica pull is a synchronous method. After sending a replica pull request and waiting for the response to send a synchronous pull request again, the pipe pipeline method cannot be used. In this synchronization method, the network delay of the Broker will become the key point of the entire replica synchronization. , and in the case of high pressure, the delay of the entire Kakfa broker end will reach seconds, which will lead to insufficient pull performance of the entire copy. The real reason for the insufficient pull request is insufficient number of pull requests. If we use the pipe method to increase the pull The number of requests naturally increases the performance of replica synchronization.

Q: Do you have any more specific solutions?

Mainly on our side, a new protocol is adopted for the synchronization of replicas, so that the replica pull request can use the pipe method to increase the number of requests and thus improve the replica synchronization speed. In fact, the real reason why community Kafka replicas can’t keep up is that the requests are all synchronous, and the large delay leads to a decrease in the request volume. The number of synchronous requests for large replicas can fundamentally solve the problem. What we use here is the pipe method, which is used to increase the number of requests to solve the problem that replicas cannot keep up. This is actually the core and the simplest solution.

Q: Excuse me, it means that I just saw an optimization of asynchronous order placement. Excuse me, it means that the data you publish here has no delay information. I don’t know if it will cause business delay. Increase.

A: Because there are many clients in the test, adding the delay to the statistics will lead to more clutter. Therefore, the statistics are not shown here. In fact, the delay effect is better after using asynchronous brushing in our test statistics. . Because after using asynchronous flushing, no core process will be blocked in the entire request, and it is only necessary to push the flushing task to the queue, and then it can be directly returned to the front-end client. However, if you do not use asynchronous swiping to directly swipe in the core process, it will block the core process, and the time-consuming of each swiping is actually very large, often reaching a delay of about 400 milliseconds, so the delay will be longer. Big. After using asynchronous flash disk, after our test, even under the highest throughput of seven or eight hundred MB, the entire delay on our side is maintained very well. The average delay of the entire test is between 15 milliseconds and 30 milliseconds. However, in the community Kafka environment, the latency is actually around 200 milliseconds.

Q: I just asked, if when the order is placed asynchronously, it has not been flushed when it expires, does this not need to wait until the flushing is completed before returning it to the customer? Will it cause the message to be lost when the disk is not flashed?

A: This is actually related to the application scenario of Kafka. In the current community Kafka, not every message is flushed. In fact, the flushing is also performed by configuring a message interval or interval. In fact, the community Kafka side is powered off in the system. In this case, there is really no guarantee that the message will not be lost. Kafka application scenarios are generally not used in scenarios that do not require message loss at all, but are mainly used in scenarios with high real-time requirements such as log collection and high throughput requirements. Kafka actually chooses to sacrifice here for throughput. certain message reliability. Of course, for our asynchronous flushing, the current first step of optimization is relatively simple. We directly push the task to the flushing crowd queue and return it to the client successfully, which will indeed cause some messages to be lost when the system is powered off. At present, in order to meet the application scenario of Kafka, the priority of throughput is adopted. Of course, we will see whether it is necessary to realize the real refresh and return to the user as needed later. In fact, it is relatively simple to implement this. We choose to suspend the production request until the flushing thread is actually flushed, and then it can be returned to the customer successfully. But in this case, you just mentioned that the delay may increase, because you have to wait for the real brushing to complete. This may require you to adopt different methods according to the application form to achieve a high throughput and high reliability. A trade-off.

Q: Hello, there are two questions to ask. The first question is about the data, because I also mentioned that the data in Kafka is not necessarily reliable, and then I would like to ask Tencent's reliability of data here. Sex is what has been done to optimize and program this way.

A: On the one hand, in terms of hardware, all of our storage disks here use RAID10, so even if there is a small amount of disk damage, there is no risk of data loss for us. On the other hand, ckafka, like community kafka, can use multiple copies and certain reasonable configurations to ensure that data will not be lost when the machine is damaged. Thirdly, in terms of implementation, we have also implemented ckafka to periodically and quantitatively perform active disk flushing, which can reduce the data loss caused by unexpected power failure of the machine. Therefore, the corresponding community kafka ckafka has a higher data reliability guarantee in terms of hardware and software.

Q: Then the second question, I just mentioned that there is a problem with the synchronization of dungeons across urban areas. Is the deployment on Tencent's side now deployed across urban areas?

A: The current deployment of ckafka is in the same urban area and in the same zone, but our ckafka can be deployed across urban areas. Currently, this deployment method is not provided. The main reason is that the consumer performance of the community Kafka sdk is currently used. Strongly depends on the network latency with the Broker. If we deploy across regions, the consumption performance of the client cannot be guaranteed, because the network delay between different regions is often tens of milliseconds, or even hundreds of milliseconds, which makes the entire consumption performance degraded very seriously and cannot be satisfied. business needs. Of course, if the community Kafka SDK can adopt our above suggestions to implement consumer pipe consumption, then there will be no problem in cross-region deployment.

Q: OK brings another problem, that is, I think that if something unexpected happens, such as the Tianjin explosion, the entire Tencent computer room in Tianjin is blown up. Have you considered this kind of migration plan here?

A: In fact, our ckafka has related clusters deployed in various regions. Users can purchase different instances in different regions to achieve nearby access on the one hand, and remote disaster recovery on the other to ensure program availability. Of course, users can also synchronize data between different regions through some synchronization tools of Kafka we provide to achieve disaster recovery at the regional level.

Q: For the business, if he uses the synchronization tool, will it cost more for the business? Does the business need to retrofit related procedures to enable cross-city access?

A: For business users, you can directly use some open source tools and methods. Users can achieve cross-regional access without any changes. The community Kafka ecosystem and tools are relatively complete. You can go to the community more. You can always find a tool that suits you when you shop around.

Q&A
What is the purpose of Kafka's key/value pair based messaging?
Related reading
Rao Jun: The past, present, and future of Apache
Kafka

 

For more information, please click:

kafka-high performance demystification and optimization.pdf

 

This article has been authorized by the author to publish Tencent Cloud + Community, the original link: https://cloud.tencent.com/developer/article/1114834?fromSource=waitui

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325300115&siteId=291194637