How to better use Kafka

introduction

         To ensure the stability of Kafka during use, it is necessary to ensure it in sequence from the use cycle of Kafka in the business. It can be mainly divided into three stages: prevention in advance (preventing problems through standardized use and development), runtime monitoring (ensuring the stability of the cluster and detecting problems in time), and troubleshooting (having a complete emergency plan).

prevention in advance

Preventive prevention means preventing problems from occurring through standardized use and development. It mainly includes some best practices on the cluster/production side/consumer side, pre-launch testing, and some temporary switch functions for emergencies (such as message backlog, etc.).

Kafka tuning principles:

1. Determine the optimization goals and give the goals quantitatively (common optimization goals for Kafka are throughput, latency, durability and availability).

2. After determining the goal, it is necessary to clarify the dimensions of optimization.

Universal optimization: operating system, JVM, etc.

Targeted optimization: Optimize Kafka’s TPS, processing speed, latency, etc.

Best practices on the production side

Parameter tuning

  • Use the Java version of Client;

  • Use kafka-producer-perf-test.sh to test your environment;

  • Set memory, CPU, batch compression;

  • batch.size: The larger this value is set, the greater the throughput, but the greater the delay;

  • linger.ms: indicates the timeout of the batch. The larger the value, the greater the throughput, but the greater the delay;

  • max.in.flight.requests.per.connection: The default is 5, which indicates the maximum number of unconfirmed requests sent by the client to a single connection (broker) before blocking. When it exceeds 1, the order of the data will be affected;

  • compression.type: compression settings, which will improve throughput;

  • acks: data durability settings;

  • Avoid large messages (occupying too much memory and reducing broker processing speed);

  • Broker adjustment: Add num.replica.fetchers, improve Follower synchronization TPS, avoid Broker Full GC, etc.;

  • When the throughput is less than the network bandwidth: add threads, increase batch.size, add more producer instances, and increase the number of partitions;

  • When setting acks=-1, if the delay increases: you can increase num.replica.fetchers (the number of threads for follower synchronization data) to mediate;

  • Cross-data center transmission: Add socket buffer settings and OS tcp buffer settings.

development practices

a. Do a good job in Topic isolation

Differentiate Kafka topics according to specific scenarios (whether a certain delay is allowed, real-time messages, scheduled periodic tasks, etc.) to avoid crowding out or blocking the processing of real-time business messages.

b. Do a good job in message flow control

If there is a bottleneck in downstream message consumption or the cluster load is too high, it is necessary to implement traffic production rate control or delay/tentative message sending strategies on the production end (or message gateway) to avoid sending a large number of messages in a short period of time.

c. Do a good job of supplementary news promotion

Manually query the missing part of the data, and then resend the message to mq to restore the lost data.

d. Ensure the order of messages

If you need to ensure that Kafka is strictly ordered within the partition (that is, you need to ensure that the two messages are in strict order), you need to set a key to allow certain types of messages to be routed to the same partition of the same topic according to specified rules (can Solve most consumption order problems).
However, it is necessary to avoid the problem of message skew within the partition (for example, routing according to store IDs can easily lead to message imbalance).

1. Production side: Send messages with specified keys to ensure that messages with the same key are sent to the same partition.

2. Consumer side: Single thread consumes or writes N memory queues, and data with the same key goes to the same memory queue; then for N threads, each thread consumes a memory queue respectively.

e. Properly improve message sending efficiency

Batch sending : Kafka first caches messages in a double-ended queue (buffer) in memory, and sends them in batches when the message volume reaches the size specified by batch size, which reduces the frequency of network transmission and improves transmission efficiency;

End-to-end compressed messages : A batch of messages are packaged and compressed, and then sent to the Broker server. However, frequent compression and decompression will also reduce performance. Finally, they are delivered to consumers in a compressed manner, and are processed on the Consumer side. decompress;

Asynchronous sending : Transforming the producer into an asynchronous method can improve sending efficiency, but if messages are generated asynchronously too quickly, it will lead to too many hanging threads, insufficient memory, and eventually message loss;

Index partition parallel consumption : When a relatively long task is executed, it will occupy the index partition where the message is located and be locked. Subsequent tasks cannot be dispatched to idle clients in time for processing. If the server enables parallel consumption of index partitions, feature, subsequent tasks can be dispatched to other clients for execution in a timely manner, and there is no need to adjust the number of index partitions (but this type of message is only applicable to messages that do not need to guarantee the message order relationship).

f. Ensure the reliability of message delivery

Producer : If you have high requirements for data reliability, when sending messages, you need to choose an api with callBack to send, and set parameters such as acks, retries, factors, etc. to ensure that the messages sent by the Producer are not lost.

Broker : In order to obtain higher performance and throughput, Kafka stores data in disks asynchronously in batches, and adopts the method of batch flushing. If the data reliability requirements are high, it can be modified to synchronous flushing. Improve message reliability.

Best practices on the consumer side

Parameter tuning

  • Throughput: adjust the number of partitions and OS page cache (allocate enough memory to cache data);

  • offset topic (__consumer_offsets): offsets.topic.replication.factor (default is 3), offsets.retention.minutes (default is 1440, which is 1 day);

  • Offset commit is slower: asynchronous commit or manual commit;

  • fetch.min.bytes 、fetch.max.wait.ms;

  • max.poll.interval.ms: The maximum delay time after calling poll(). If poll() is not called after this time, the consumer will be considered dead and rebalance will be performed;

  • max.poll.records: The maximum number of records returned after calling poll(), the default is 500;

  • session.timeout.ms;

  • Consumer Rebalance:check timeouts、check processing times/logic、GC Issues;

  • Network Configuration.

development practices

a. Make message consumption idempotent

The idempotence of message consumption is mainly adjusted based on business logic.

Take processing order messages as an example :

1. The unique idempotent key consisting of order number + order status is stored in redis;

2. Before processing, it will first check whether the Key exists in Redis. If it exists, it means that it has been processed and will be discarded directly;

3. If Redis has not processed it, insert the processed data into the business DB, and finally insert the idempotent Key into Redis;

In short, idempotence is achieved through Redis as pre-processing + DB unique index as the final guarantee.

b. Do a good job of consumer isolation

When the message volume is very large, real-time and offline consumers consume a cluster at the same time, and the heavy disk IO operations of offline data will directly affect the real-time performance of the real-time business and the stability of the cluster.

According to the real-time nature of consumption, message consumer behavior can be divided into two categories: real-time consumers and offline consumers.

Real-time consumer : High requirements for real-time data; in real-time consumption scenarios, Kafka will use the system's page cache to forward data directly from memory to real-time consumers (hot reading), with zero disk pressure and suitable for advertising and recommendations. and other business scenarios.

Offline consumer (scheduled periodic consumer) : usually consumes messages from minutes or hours ago. This type of message is usually stored in the disk. When consumed, it will trigger the IO operation of the disk (cold reading) and is suitable for report calculations. , batch calculation and other periodically executed business scenarios.

c. Avoid accumulation of message consumption

  • Delay processing, control speed, and allocate messages within the time range (for messages with low real-time performance);

  • The production speed is greater than the consumption speed, so you can appropriately increase partitions, increase the number of consumers, and increase consumption TPS;

  • Avoid heavy consumption logic and optimize consumer TPS:

Whether there are a large number of DB operations;

Downstream/external service interface call timeout;

Whether there is a lock operation (leading to thread blocking);

Special attention needs to be paid to the logic involving message amplification in the Kafka asynchronous link.

  • If there is heavy consumption logic, the xx parameters need to be adjusted to avoid the consumer group exiting when the message is not consumed, causing problems such as reblance;

  • Ensure that the consumer side does not cause consumption to hang due to exceptions;

  • If you are using a consumer group, make sure rebalance does not occur frequently;

  • Multi-threaded consumption, batch pull processing.

Note: When doing batch pull processing, you need to pay attention to the kafka version, spring-kafka 2.2.11.RELEASE version or below. If you configure kafka.batchListener=true, but set the element received by the message to a single element (non-batch List), it may This will cause Kafka to only consume the first message in the header after pulling a batch of messages.

d.Avoid Rebalance problems

  • Triggering conditions:

1. Changes in the number of consumers: new consumers join, consumers go offline (failed to send heartbeats in time, and are "kicked out" of the group), consumers voluntarily exit the consumer group (caused by the consumer's consumption time being too long);

2. The number of topics or topic partitions subscribed to in the consumer group changes;

3. The GroupCoorinator node corresponding to the consumer group changes.

  • How to avoid unnecessary rebalance (reblance caused by consumers going offline or voluntarily quitting consumer groups):

1. You need to carefully set the values ​​of session.timeout.ms (which determines the time interval for Consumer survival) and heartbeat.interval.ms (a parameter that controls the frequency of sending heartbeat requests).

2.max.poll.interval.ms parameter configuration: controls the impact of the Consumer's actual consumption capacity on Rebalance, and limits the maximum time interval between two calls to the poll method by the Consumer-side application. The default value is 5 minutes, which means that if the Consumer program cannot consume the messages returned by the poll method within 5 minutes, the Consumer will actively initiate a "leave group" request, and the Coordinator will also start a new round of Rebalance. Specifically, you can count the historical time spent and set the longest time as a reference.

e. Ensure the reliability of message consumption

Under normal circumstances, there are more scenarios where the client consumes the broker and loses messages. If the client consumption data cannot be lost, autoCommit must not be used, so it must be submitted manually.

The mechanism of Consumer automatic submission is to commit the received messages according to a certain time interval. The commit process and the process of consuming messages are asynchronous. In other words, the consumption process may not be successful (for example, an exception is thrown), and the commit message has been submitted, and the message is lost at this time.

f. Ensure the order of message consumption

1. Different topics (out-of-order messages): If payment and order generation correspond to different topics, they can only be processed at the consumer level.

2. The same topic (out-of-order messages): A topic can correspond to multiple partitions, and each corresponds to multiple consumers. There is no essential difference from "different topics". (It can be understood that our service has multiple pods, and the producers send messages sequentially, but when they are routed to different partitions, they may become out of order, and the service consumes out-of-order messages).

3. The same topic, the same partition (sequential messages): Kafka messages are strictly ordered within the partition. For example, all messages for the same order are sent one by one to the same partition of the same topic in the order of generation. .

For out-of-order messages :

For example: Order and payment encapsulate their own messages respectively, but the business scenario on the consumer side needs to consume the messages in order of order message -> payment message.

Wide table (a database table in which indicators, dimensions, and attributes related to business topics are associated): When consuming messages, only the corresponding fields are updated. The messages will only have temporary status inconsistencies, but the status will eventually be consistent. . For example, orders and payments have their own status fields, orders have their own status fields, and after-sales have their own status fields. There is no need to ensure that payment, order, and after-sales messages are in order. Even if the messages are out of order, only their own status will be updated. Field will not affect other states;

Message compensation mechanism: Compare the message with the DB. If the data is found to be inconsistent, resend the message to the main process for processing to ensure final consistency;

MQ queue: an intermediary (such as redis queue) to maintain the order of MQ;

Business guarantee: Guarantee consumption order through business logic;

For sequential messages :

Both ensure sequence by binding messages to directed partitions or queues, and increase consumption capabilities by adding partitions or threads.

1.Consumer single-thread sequential consumption

When the producer sends a message, it has ensured that the message is ordered in the partition. One partition corresponds to a consumer, ensuring the order of message consumption.

2.Consumer multi-thread sequential consumption (specific strategies are in later chapters)

Single-threaded sequential consumption scales poorly. In order to improve the processing speed of consumers, in addition to horizontally expanding the number of partitions and adding consumers, you can also use multi-threaded sequential consumption.

Hash the received kafka data to take the modulo (note: if the kafka partition receives the message, it is already taking the modulo, here you must do a hash of the ID and then take the modulo) and send it to different queues, and then start multiple threads to consume Corresponds to the data in the queue.

In addition, the configuration center is used to switch and dynamically expand/shrink the thread pool.

g. Process Consumer transactions

Through transaction messages, the transaction logic of some business scenarios can be well ensured, and status inconsistencies between systems will not occur due to network unavailability and other reasons.

When any update service fails, an exception will be thrown. The transaction message will not be submitted or rolled back. The message server will call back the transaction query interface of the sending end to determine the transaction status. The sending end program can perform unfinished processing according to the content of the message. The task is re-executed and the message server is told the status of the transaction.

 Cluster configuration best practices

Cluster configuration

Broker evaluation: The number of Partitions for each Broker should not exceed 2k, and the partition size should be controlled (not to exceed 25GB).

Cluster evaluation (the number of Brokers is configured according to the following conditions): data retention time, cluster traffic size.

Cluster expansion: disk usage should be below 60%, and network usage should be below 75%.

Cluster monitoring: maintain load balancing, ensure that topic partitions are evenly distributed across all Brokers, and ensure that cluster stages do not exhaust disk or bandwidth.

Topic evaluation

1.Partition number:

The number of Partitions should be at least consistent with the number of consumer threads in the largest consumer group;

For frequently used topics, more partitions should be set;

Control the size of the partition (about 25GB);

Consider future growth of the application (can use a mechanism for automatic expansion);

2. Use topic with key;

3. Partition expansion: When the data volume of the partition exceeds a threshold, it should be automatically expanded (in fact, network traffic should also be considered).

Partition configuration

Setting up multiple partitions can improve the concurrency of consumer consumption to a certain extent, but too many partitions may cause: excessive handle overhead, excessive memory usage on the production end, possible increase in end-to-end delay, and impact on system availability. , long fault recovery time and other issues.

Set the number of partitions according to throughput requirements:

1. Assume that the throughput of a single partition of the Producer is P

2. The throughput of consumer consuming a partition is C

3. And the required throughput is T

4.Then the number of partitions should be at least greater than the maximum values ​​of T/P and T/c

Performance tuning

Tuning goals: high throughput, low latency.

Hierarchical tuning

From top to bottom, it is divided into application layer, framework layer, JVM layer and operating system layer. The higher the layer, the more obvious the tuning effect.

Tuning type

suggestion

operating system

Disable atime updates when mounting the file system; select ext4 or XFS file system; swap space settings; page cache size

JVM (heap settings and GC collector)

Set the JVM heap size to 6~8GB; it is recommended to use the G1 collector, which is convenient and trouble-free, and is less difficult to optimize than the CMS collector.

Broker side

Keep server and client versions consistent

Application layer

Frequently create instances of Producer and Consumer objects; close them promptly when finished; make rational use of multi-threading to improve performance.

Throughput (TPS) tuning

parameter list

Broker side

Increase the num.replica.fetchers parameter value appropriately, but do not exceed the number of CPU cores.

Tuning GC parameters to avoid frequent Full GC

Producer side

Increase the batch.size parameter value appropriately, such as increasing it from the default 16KB to 512KB or 1MB

Increase the linger.ms parameter value appropriately, such as 10~100

Set compression.type=lz4 or zstd

Set acks=0 or 1

set retries=0

If multiple threads share the same Producer instance, increase the buffer.memory parameter value

Consumer

Use multiple Consumer processes or threads to consume data at the same time

Increase the fetch.min.bytes parameter value, such as setting it to 1KB or larger

Latency tuning

parameter list

Broker side

Set the num.replica.fetchers value appropriately

Producer side

Set linger.ms=0

Disable compression, set compression.type=none

setackes=1

Consumer

Set fetch.min.bytes=1

Stability test

The stability test of Kafka mainly tests the health and high availability of Kafka instances/clusters before the business goes online.

health check

1. Check the instance: Check the Kafka instance object to get all the information (such as IP, port, etc.);

2. Test availability: access producers and consumers and test connections.

High availability testing

Single node exception test: restart the Pod where the Leader copy or Follower copy is located

step:

1. View the copy information of the topic

2. Delete the corresponding pod

3. Script to detect the availability of Kafka

Expectation: No impact on availability to both producers and consumers.

Cluster exception test: restart all pods

step:

1. Delete all pods

2. Script to detect the availability of Kafka

Expectation: Service will be normal after all brokers are ready.

Runtime monitoring

Runtime monitoring mainly includes best practices for cluster stability configuration and Kafka monitoring, aiming to promptly discover related problems and exceptions generated by Kafka during runtime.

Cluster stability monitoring

Tencent Cloud CKafka cluster configuration

To properly configure Kafka instances, focus on these data:

  1. Disk capacity and peak bandwidth

  2. message retention time;

  3. Dynamic retention policy;

a. Disk capacity and peak bandwidth

It can be estimated based on the actual business message content size, message sending qps, etc. You can set it as large as possible; the specific value can be checked based on instance monitoring. If the disk usage percentage reaches a high value in a short period of time, expansion is required.

Peak bandwidth = maximum production traffic * number of copies

b. Message retention time

Even if the message is consumed, it will be persisted for as long as the disk storage is retained. This setting will occupy disk space. If the amount of messages every day is large, the retention time can be shortened appropriately.

c. Dynamic retention policy

It is recommended to turn on dynamic retention settings. When the disk capacity reaches the threshold, the oldest messages will be deleted, and messages outside the guaranteed duration range will be deleted at most (elimination strategy), which can largely prevent the disk from becoming full.

However, there will be no active notification when adjustments are made, but we can detect changes in disk capacity by configuring alarms.

Self-built Kafka cluster configuration

1. Set log configuration parameters to make logs easy to manage;

2. Understand the (low) hardware requirements of kafka;

3. Make full use of Apache ZooKeeper;

4. Set up replication and redundancy the right way;

5. Pay attention to theme configuration;

6. Use parallel processing;

7. Configure and isolate Kafka with security in mind;

8. Avoid downtime by raising limits;

9. Keep network latency low;

10. Leverage effective monitoring and alerting.

Resource isolation

a.Broker level physical isolation

If topics in different business lines share a disk, if a problem occurs with a consumer and consumption lags, resulting in frequent disk reads, it will affect the writing of TPs of other business lines on the same disk.

Solution: Broker level physical isolation: topic creation, topic migration, and downtime recovery process

b.RPC queue isolation

Kafka's RPC queue lacks isolation. Once a certain topic is processed slowly, all requests will hang.

Solution: It is necessary to separate control flow and data flow, and the data flow must be isolated according to topic.

1. Split the call queue into multiple calls, and allocate a thread pool to each call queue.

2. One queue processes controller requests alone (isolates control flow), and the remaining queues are hashed according to topic (isolates data flows).

If there is a problem with a topic, only one of the RPC processing thread pools and the call queue will be blocked, ensuring that other processing links are smooth.

Intelligent speed limit

The entire speed limit logic is implemented at the end of the RPC worker thread processing. Once the RPC is processed, speed limit detection is performed through the speed limit control module.

1. Configure the waiting time, and then put it into the delayed queue, otherwise put it into the response queue.

2. Requests placed in the delayed queue will be placed in the response queue by the delayed thread after the waiting time is reached.

3. Finally, the request in the response queue is returned to the consumer.

Kafka monitoring

White box monitoring: service or system own indicators, such as CPU load, stack information, number of connections, etc.;

Black box monitoring: Generally, it is a monitoring method that monitors system functions visible to it by simulating external users. Related indicators include performance and availability indicators such as message delay, error rate, and repetition rate.

monitor

Features/Metrics

Details

Black box monitoring

operate

Topic operations: create, preview, view, update, delete

Serve

Data writing and consumption are successful

system

CPU load, stack information, number of connections, etc.

White box monitoring

capacity

Total storage space, used storage space, maximum partition usage, cluster resources, number of partitions, number of topics;

flow

Message writing, consumption rate, cluster network entry and exit;

Delay

Message writing and consumption time (average, 99th percentile, maximum time), topic consumption delay (offset lag)

mistake

The number of abnormal nodes in the cluster, the number of message write rejections, the number of message consumption failures, and related errors that rely on zookeeper

Tencent Cloud CKafka alarm

For CKafka, alarms need to be configured (such alarms are generally for message backlog, availability, cluster/machine health, etc.).

a.Indicators

For example: instance health status, number of nodes, number of healthy nodes, number of problem partitions, number of production messages, number of consumption requests, JVM memory utilization, average production response time, partition consumption offset, etc.

For specific indicators, please refer to: https://cloud.tencent.com/document/product/597/54514

b.Configuration

Configuration document: https://cloud.tencent.com/document/product/597/57244

Select a monitoring instance and configure alarm content and thresholds.

Generally, alarm configurations are made for the Kafka cluster of the current service itself. However, if a downstream service that relies on its own messages has a consumption problem, we will not be aware of it. Moreover, if the consumer service does not share the same cluster, messages may be sent repeatedly. The problem,with the service itself is difficult to find.

c.Plan

Before the business goes online, it is best to sort out the topic messages involved in your own service (upstream production end and downstream consumer end), and refine the alarm configuration. If an upstream Kafka exception occurs or downstream Kafka message accumulation occurs, you can detect it in time. In particular, certain alarms or plans need to be made for scenarios where there may be a large amount of instantaneous messages (such as batch data import, scheduled full data synchronization, etc.) to avoid service unavailability or impact on normal business messages.

Self-built alarm platform

Use the self-built alarm platform to configure exception alarms for the service itself, including business exceptions thrown by the framework when using the Kafka component and during the Kafka consumption logic process.

Among them, the situations that may require abnormal upgrade (due to) should be handled separately (for spring kafka):

1. Custom kafka exception handler: implement the method of KafkaListenerErrorHandler interface, register a custom exception listener, distinguish business exceptions and throw them;

2. When consuming Kafka messages, set the errorHandler parameter of @KafkaListener to the defined Kafka exception handler;

3. After that, the specified business exception will be thrown instead of being encapsulated into a Spring Kafka framework exception, resulting in the inability to clearly understand the specific exception information.

Kafka monitoring component

At present, there is no recognized solution in the industry, and each company has its own monitoring method.

Kafka Manager: It should be regarded as the most famous exclusive Kafka monitoring framework. It is an independent monitoring system.

Kafka Monitor: LinkedIn's open source free framework supports system testing of clusters and real-time monitoring of test results.

CruiseControl: It is also an open source monitoring framework developed by LinkedIn. It is used to monitor resource usage in real time and provide common operation and maintenance operations. There is no UI interface, only REST API is provided.

JMX monitoring: Since the monitoring indicators provided by Kafka are all based on JMX, any framework on the market that can integrate JMX can be used, such as Zabbix and Prometheus. There are existing big data platforms with their own monitoring systems: Big data platforms such as CDH provided by Cloudera naturally provide Kafka monitoring solutions.

JMXTool: A command line tool provided by the community that can monitor JMX indicators in real time. Answering this question is an absolute bonus, because very few people know about it, and it will give people the feeling that you are very familiar with the Kafka tool. If you don't understand its usage yet, you can execute kafka-run-class.sh kafka.tools.JmxTool without parameters on the command line to learn its usage.

Kafka Monitor

Among them, Kafka Monitor simulates client behavior, produces and consumes data, and collects performance and availability indicators such as message delay, error rate, and repetition rate. It can well discover the downstream message consumption situation and dynamically adjust the sending of messages. (During use, please pay attention to the control of sample coverage, function coverage, traffic, data isolation, and delay)

Kakfa Monitor Advantages :

1. Ensure monitoring covers all Partitions by starting a separate production task for each Partition.

2. The produced messages contain timestamps and sequence numbers. Kafka Monitor can use these data to make statistics on message delay, loss rate, and repetition rate.

3. Achieve the purpose of controlling traffic by setting the frequency of message generation.

4. The produced message is specified as a configurable size during serialization (to verify the processing capabilities of data of different sizes and performance comparison of the same message size).

5. By setting separate Topic and Producer IDs to operate the Kafka cluster, you can avoid contaminating online data and achieve a certain degree of data isolation.

Based on the design idea of ​​Kafka Monitor, performance monitoring and alarm indicators such as message delay, error rate, and repetition rate can be introduced based on business characteristics.

Troubleshoot

Prevent minor changes and have a complete emergency plan when encountering problems/faults to quickly locate and solve problems.

Kafka message accumulation emergency plan

Problem description : There is a backlog of messages on the consumer side, causing services that rely on the messages to be unable to detect business changes in a timely manner, causing delays in some business logic and data processing, and easily causing business blocking and data consistency issues.

Plan : troubleshooting, capacity expansion and configuration upgrade strategy, message topic conversion strategy, configurable multi-threaded consumption strategy.

Troubleshooting

When encountering a backlog of messages, you can locate the cause of the problem from the following perspectives:

1. Whether there is a sudden increase in the amount of data at the message production end.

2. Whether the consumption power of news consumers has declined.

3. Does the message backlog occur in all partitions or does all partitions have a backlog?

Regarding the message backlog caused by points 1 and 2: For the temporary message backlog, increasing the consumption speed through partition expansion, capacity expansion and configuration upgrade, multi-thread consumption, batch consumption, etc. can solve this problem to a certain extent.

For the message backlog caused by point 3: You can use the message topic transfer strategy.

Capacity expansion and configuration upgrade strategy

1. Check the consumption and sending status of the production end (mainly checking whether messages continue to be generated, whether there are logical defects, and whether duplicate messages are sent);

2. Observe the consumption situation at the consumer end (estimate the processing and cleaning of accumulated messages and whether there is a downward trend);

3. If it is a production-side problem, evaluate whether it can be solved by increasing the number of partitions, adjusting the offset, deleting the topic (the impact needs to be evaluated), etc.;

4. Add new machines and dependent resources on the consumer side to improve consumption capacity;

5. If there is a data consistency issue, it needs to be verified through data comparison, reconciliation and other functions.

Configure multi-threaded consumption strategy

In short, that is, thread pool consumption + dynamic thread pool configuration strategy: hash the received kafka data to take the modulo (if the kafka partition receives the message, it is already modulo, here you must do a hash on the id and then take the modulo) ) are sent to different queues, and then multiple threads are started to consume the data in the corresponding queues.

Design ideas:

1. Initialize the sequential consumption thread pool of the corresponding business when the application starts (in the demo, it is the order consumption thread pool);

2. The order monitoring class pulls the message and submits the task to the corresponding queue in the thread pool;

3. The thread of the thread pool processes the task data in the bound queue;

4. After each thread completes the task, increase the number of offsets to be submitted;

5. In the listening class, verify whether the number of offsets to be submitted is equal to the number of records pulled, and if so;

6. Manually submit the offset (turn off kafka's automatic submission, and wait until the task pulled this time is processed before submitting the offset)

In addition, the thread configuration and pod configuration can be adjusted according to business traffic. For example, a relatively high concurrency level is set during peak periods to process messages quickly, and a smaller concurrency level is set during off-peak periods to free up system resources. Here, you can refer to the configuration center provided by Meituan to modify the configuration and dynamically set thread pool parameters to achieve dynamic expansion or contraction.

Realizes dynamic expansion and contraction :

1. Refresh the configuration concurrent value in the OrderKafkaListener listening class through the configuration center.

2. When modifying the concurrent value through the set method, first modify the stopped value to stop the currently executing thread pool.

3. After execution, a new thread pool is created with the new concurrency level to achieve dynamic expansion and contraction.

In addition, you can also add a new switch. If it is set to true, you can interrupt the thread pool during startup and switch the function when there is a failure.

Note: If data consistency issues are involved, verification needs to be performed through data comparison, reconciliation and other functions.

Topic transfer strategy

When message backlog occurs in all partitions or all partitions are backlogged, temporary expansion can only be performed to consume data at a faster speed.

Design ideas:

1. Temporarily create 10 times or 20 times the original number of queues (create a new topic and partition 10 times the original number);

2. Then write a consumer program that temporarily distributes messages. This program is deployed to consume the backlog of messages. After consumption, no time-consuming processing is performed, and the messages are directly polled and written into the temporarily built queue divided into 10 numbers;

3. Then requisition 10 times the number of machines to deploy consumers, and each batch of consumers consumes a temporary queue message;

4. This approach is equivalent to temporarily expanding the queue resources and consumer resources by 10 times, and consuming messages at 10 times the normal speed.

5. After the quick consumption is completed, restore the original deployment structure and reuse the original consumer machine to consume messages.

Improvements :

1. The consumer program can be written in the service;

2. Specify a "plan topic" and write the "plan topic" in advance in the service;

3. Use strategy mode to convert "business topic" -> "plan topic".

Note :

1. If data consistency issues are involved, verification needs to be done through data comparison, reconciliation and other functions;

2. You need to have a separate topic conversion service, or modify the service code, or write the multi-threaded logic in advance.

Kafka consumption exception causes consumption blocking

Problem description: A certain message consumption is abnormal or a certain operation is time-consuming, resulting in a decrease in the consumption capacity of a single pod, or even blocking.

Solution: Set offset; switch multi-threaded consumption strategy.

Set offset

1. Adjust the offset: Contact operation and maintenance and move the offset back one position;

2. Message supplementary push: Message supplementary push for skipped messages or data within a certain time period;
3. If data consistency issues are involved, it needs to be verified through data comparison, reconciliation and other functions.

Switch multi-threaded consumption strategy

Refer to the "Configurable Multi-threaded Consumption Strategy" above to turn on the multi-threaded consumption switch when blocking occurs.

Note: You need to modify the code or write the multi-threading logic in advance

Kafka message loss plan

Problem description: The service did not consume Kafka messages as expected, causing business problems.

Plan: root cause analysis; news push.

root cause analysis

1. Whether the production end successfully sends the consumption (the source is lost)

Broker lost messages: In order to obtain higher performance and throughput, Kafka asynchronously stores data in batches on the disk. Asynchronous disk flushing may cause source data to be lost;

The Producer lost the message: there was a bug in the sending logic, causing the message to be sent successfully.

Solution: The health of the production end and the cluster needs to be checked; the message is reissued.

2. Whether it was successfully consumed

The mechanism of Consumer automatic submission is to commit the received messages according to a certain time interval. The commit process and the process of consuming messages are asynchronous. In other words, the consumption process may not be successful (for example, an exception is thrown), and the commit message has been submitted.

In addition, if there is a bug in the consumption logic, it will also lead to the illusion of message loss.

Solution: Fix the problem and modify the consumption confirmation mechanism as appropriate.

3. Are there other services that share the same consumer group?

Misuse of the same consumer group by multiple services will result in message loss at a certain rate or regularity.

For example, when creating a user's Kafka message, the price center and promotion service may have misused a consumer group, causing each service to consume part of the message, causing some problems to occur occasionally.

Solution: Modify the configuration, restart the service, and create various consumer groups; you need to check in advance whether multiple services share the same consumer (detection + comparison).

News push

1. Query the affected data information through business impact;

2. Construct kafka messages and perform message compensation;

3. If data consistency issues are involved, verification needs to be performed through data comparison, reconciliation and other functions.

For each externally sent service, the production end generally needs to have a relatively complete message push interface, and the consumer end also needs to ensure the idempotence of message consumption.

other

Kafka cost control

Machines, storage and networks

machine

Need to re-evaluate your instance type decisions: Is your cluster saturated? Under what circumstances is it saturated? Are there other instance types that might be more suitable than the one you selected when you first created the cluster? Does a mix of EBS optimized instances with GP2/3 or IO2 drives really provide better price/performance than i3 or i3en machines (and the advantages they bring)?

Storage and Networking

Compression is not new in Kafka, and most users already know that they can choose between GZIP, Snappy, and LZ4. But since KIP-110 was merged into Kafka and the compressor for Zstandard compression was added, it has achieved significant performance improvements and is a perfect way to reduce network costs.

At the cost of slightly higher CPU usage on the producer side, you get higher compression and "squeeze" more information on the line.

Amplitude said in their post that after switching to Zstandard, their bandwidth usage was reduced by two-thirds, saving tens of thousands of dollars in monthly data transfer costs on the processing pipeline alone.

cluster

An unbalanced cluster can harm cluster performance, causing some brokers to be more heavily loaded than others, resulting in higher response latencies, and in some cases saturating the resources of these brokers, leading to unnecessary expansion and further Will affect cluster costs.

Additionally, an unbalanced cluster runs the risk of a higher MTTR after a broker failure (for example, when that broker holds more partitions unnecessarily), and a higher risk of data loss (imagine a replication factor of 2 topic, one of the nodes is difficult to start due to too many segments to be loaded at startup).

Idempotence of message consumption

definition:

The so-called idempotence, the mathematical concept is: f(f(x)) = f(x). The f function represents the processing of messages. In layman's terms, when consumers receive duplicate messages and process them repeatedly, they must also ensure the consistency of the final results.

For example, bank transfer, order placement, etc., no matter how many times you try again, you must ensure that the final result is consistent.

Leverage database unique constraints

Combine multiple fields in the database to create a unique constraint, which ensures that at most one record exists in the table even with multiple operations (such as creating an order, creating a bill, creating a flow, etc.).

In addition, any storage class system that supports semantics similar to "INSERT IF NOT EXIST" (such as Redis's SETNX) can be used to implement idempotent consumption.

Set preconditions

1. Set a precondition for data change (version number, updateTime);

2. Update the data if the conditions are met, otherwise refuse to update the data;

3. When updating the data, change the data in the precondition at the same time (version number + 1, update updateTime).

Record and review actions

1. Record a globally unique ID for each message;

2. When consuming, first check whether this message has been consumed based on this globally unique ID;

3. If it has not been consumed, update the data and set the consumption status to "consumed".

Among them, in "checking the consumption status, then updating the data and setting the consumption status", the three operations must be used as a set of operations to ensure atomicity.

Guess you like

Origin blog.csdn.net/qq_40322236/article/details/128135154