Kafka + eCAL architecture design for small-scale and low-cost scenarios

As a message queue with both storage and performance, Kafka is applicable to many scenarios and has strong scalability. How to adjust the configuration parameters of Kafka, as well as design topics, the number of partitions, and physical locations, largely affect the success or failure of the entire architecture. Many articles talk about Kafka configuration from the perspective of the data center, but for small teams, they often only want to use Kafka as a cross-process and traceable isolator, replacing cumbersome files or custom TCP/UDP interfaces. At this point, it becomes necessary to discuss this scenario separately.

  • A small number of physical servers, even just 1.
  • Single-strand optical fiber or even gigabit network.

This article first tries to allow the kafka queue to play a greater role under limited resources through tuning; then, it introduces a compromise architecture design and introduces the eCAL of the Internet of Vehicles to solve the problem.

1. Disk and network are the two most important bottlenecks

Whether it is a cluster or a stand-alone machine, disk and network are the most important bottlenecks. Regardless of whether Kafka uses various acceleration strategies such as 0 copies, the underlying physical devices will affect the availability of the architecture.

1.1 SSD and mechanical disk array

According to my test, in the case of a mechanical hard disk, if there are enough concurrent read and write sessions, the mechanical hard disk will still encounter a concurrency trap—frequent mechanical head patrols caused by cache hit failures. This situation is especially obvious when multiple high-traffic topics are read and written at the same time, especially when the consumed data is far apart in time.

The typical phenomenon is that the write delay is very large, and the disk occupies 100%. Obviously the network bandwidth is not used up, but it cannot be written. In this case, careful analysis is required to prescribe the right medicine.

  1. View the server.properties file and pay attention to the parameter "num.io.threads"
# The number of threads that the server uses for processing requests, which may include disk I/O
num.io.threads=8

If this value has not been modified, it should be 8. It is recommended to reduce this value to 2 and see if the situation improves. For mechanical hard disks, limiting concurrent read and write sessions will improve part of the efficiency.

  1. If you use kafka in a virtual machine, turn on the "use host cache" option, which may help improve throughput.

Test

  1. Use SSD

If none of the above situations can alleviate the disk latency problem, use an SSD decisively. In the case of a large number of concurrent random reads and writes, a mechanical disk array is much slower than a 16TB enterprise SSD.

1.2 Network Bandwidth

Kafka is a message queue for the TCP protocol. The relationship between the egress bandwidth of this message queue and the number of consumer groups is xN. If the ingress bandwidth of 1 topic-partition is 10MBps, and 100 consumer groups are opened, the bandwidth is 1000MBps, basically fiber optics must be used to solve the problem. Therefore, we must pay great attention to the number of consumer groups. When it is necessary to maintain the number of consumer groups, the strategies for solving the problem are as follows.

  1. Enable zstd compression
    Recent versions of Kafka support zstd compression, which has less CPU overhead than gzip. You can start zstd compression on the producer side, and generally achieve a compression ratio of more than 2 times
		if (rd_kafka_conf_set(conf, "compression.type", "zstd", errstr,
							  sizeof(errstr)) != RD_KAFKA_CONF_OK) {
    
    
			cbprintf( "%s\n", errstr);
			rd_kafka_conf_destroy(conf);
			return false;
		}
		if (rd_kafka_conf_set(conf, "compression.level", "9", errstr,
							  sizeof(errstr)) != RD_KAFKA_CONF_OK) {
    
    
			cbprintf( "%s\n", errstr);
			rd_kafka_conf_destroy(conf);
			return false;
		}

But it should be noted at this time that compression.level will significantly increase the CPU usage of the producer. If it is too late, you need to lower the compression.level.

  1. Dual network card split

Use 2 network cards to provide physical transmission for 1 broker, which can be distributed. The broker can listen on two IP addresses, thus physically directing the consumer's traffic to different network cards.

2. Adjustment of parallel computing strategy

For stretched computing and storage resources, more low-level means may be needed to control traffic and disk overhead. For the message queue carried by TCP, the consumption data should be focused. In particular, avoid wasting bandwidth. For example, if three algorithm consumer groups share a piece of data, the traffic required is x3. If it is for 3 types of processing for 1 piece of data, it must be placed in 3 consumer groups (different consumers in 1 group see different data, which does not meet the requirements), it is recommended to use 1 consumer + 3 The way to handle submodules, instead of consuming data directly from Kafka. This is equivalent to using a consumer for second-level subcontracting. This difference is shown in the figure below:

算法1
组1
算法1
kafka集群
算法2
组2
算法2
算法3
组2
算法3

In the figure above, a dual-partition cluster is simulated. Algorithm 1, Algorithm 2, and Algorithm 3 are all consumers, directly connected to the cluster. So in total they consume the full data 3 times. If we add a manager between the algorithm and the cluster to consume data uniformly, we can only consume 1 share of data, as shown in the following figure:

管理者1
组1
管理者2
kafka集群
算法1
算法1
算法2
算法2
算法3
算法3

3. Use two message queues at the same time

In the above diagram, we introduced managers to reduce the throughput. But this brings significant complexity and pitfalls. Is there a local message queue that saves bandwidth and disk to complete this function?

We can use a local message queue, or a local kafka instance, a local database to complete this function. A typical high-speed local message queue is eCAL.

https://eclipse-ecal.github.io/ecal/

ECAL
eCAL is a cross-platform M2M message queue for reliable direct connection to LAN. The original intention is to achieve stable and reliable sensor interaction on the vehicle intelligent circuit. It is worth noting that this interaction uses the UDP multicast protocol. The advantage of this protocol is that the reception itself is 1-to-many. As long as the LAN environment is stable, UDP will basically not lose packets.

In this way, when consumers manage the algorithm module, they can choose to download data from Kafka to the local switch, and then use eCAL to multicast to the algorithm module.

The problem that needs attention is to ensure continuity. A counter is needed to check the continuity of data to prevent UDP packet loss.

====
With gpt, I don't really want to continue blogging anymore. I found the pitfalls I stepped on before, and I solved them all with just one question, and I don’t need to use Blog to consolidate knowledge anymore.

Guess you like

Origin blog.csdn.net/goldenhawking/article/details/129673664