Regarding the reasons and solutions for possible disorder after time series data flows through Kafka

The blogger is currently doing data migration work, but encountered a problem during the migration process. The data is always lost for no reason, and my log did not report any error or exception information. After investigation, flink is consuming kafka. At the time, I processed the data through event time. There is a concept of water mark. Because the data in Kafka has a lot of disorder, and the disorder time is also more serious, although the work of writing data is not done by the blogger, However, with the mentality of seeking knowledge, it is still a little bit worse. What causes the data in Kafka to appear out of order.

Introduction to Kafka: As a popular message queue, Kafka has been widely used in a variety of scenarios due to its distributed high performance and high reliability.

However, in the actual deployment process, the data passing through Kafka may be out of order at the receiver due to configuration reasons, which will bring sorting and other work to the subsequent processing links, causing unnecessary processing overhead, reducing system processing performance and additional sorting jobs.

In fact, Kafka's configuration and methods can be properly planned and designed to prevent messages from being generated out of order after passing through Kafka. You only need to follow the following principles: For a message stream that needs to be sequenced, send it to the same partition.

Kafka can set up multiple partitions under a topic to achieve distribution and load balancing, which are consumed by different consumers under the same consumer group; this mechanism can support multi-threaded distributed processing, bringing high performance, but also To avoid the possibility of the same message flow taking different paths, if there is no targeted planning, the order of the messages cannot be guaranteed from the architecture. As shown in the figure below, for a message flow of the same topic, writing to different partitions will generate multiple paths.

Insert picture description here

In order to ensure that the data of a message stream can be consumed in strict chronological order, the principle of one path must be followed, so that FIFO (First In First Out) can be realized.

According to the Kafka documentation, which record is sent to which partition is the responsibility of the producer:

Producers

Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!

It can be seen that Kafka has considered the need to ensure the order of messages and provides an interface to implement the method of sending to the same partition according to the specified key value. You can look at Kafka related source code:

Concrete realization method

The specific implementation is very simple. When the producer sends data, select a key, generate a message through the KeyedMessage method, and then send. Taking Java as an example, other languages ​​can find interfaces with the same functions in the Kafka documentation:

producer.send(new KeyedMessage<String, String>(topic,key,record))

This interface makes it very convenient for users to achieve the result of specifying each message flow to bind a partition without adding code. Users can also implement a partition algorithm by themselves to achieve more precise partition allocation control. For specific implementation, please refer to "kafka designated partition production and consumption"

Attached Kafka producer Demo, java code implementation

public class KafkaProducerDemo {
    public static void main(String[] args) {
        //1. 创建配置对象 指定Producer的信息
        Properties properties = new Properties();
        properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "node1:9092,node2:9092,node3:9092");
        properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, IntegerSerializer.class); // 对record的key进行序列化
        properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);

        //2. 创建Producer对象
        KafkaProducer<Integer, String> producer = new KafkaProducer<Integer, String>(properties);

        //3. 发布消息
        ProducerRecord<Integer, String> record = new ProducerRecord<Integer, String>("t1",1,"Hello World");
        producer.send(record);

        //4. 提交
        producer.flush();
        producer.close();

    }
}

Guess you like

Origin blog.csdn.net/qq_44962429/article/details/106053799