How to ensure the order of message consumption across regions in the case of Kafka's topic with multiple partitions

Strictly speaking, this problem is definitely there, and kafka can only guarantee the ordering within the partition.

 

The following is a passage from the blog of kafka author Jay Kreps that introduces the design ideas of kafka.

Each partition is a totally ordered log, but there is no global ordering between partitions (other than perhaps some wall-clock time you might include in your messages). The assignment of the messages to a particular partition is controllable by the writer, with most users choosing to partition by some kind of key (e.g. user id). Partitioning allows log appends to occur without co-ordination between shards and allows the throughput of the system to scale linearly with the Kafka cluster size.

 

For the scenario where some messages are ordered (messages with the same message.key must ensure the order of consumption), it can be controlled when the producer inserts data into kafka, and the same key is distributed to the same partition.

 

The kafka source code is as follows, which supports this method

private[kafka]classDefaultPartitioner[T]extendsPartitioner[T]{
  privateval random = newjava.util.Random
  def partition(key: T, numPartitions: Int): Int = {
    if(key== null){
        println("key is null")
        random.nextInt(numPartitions)
    }
    else{
        println("key is "+ key + " hashcode is "+key.hashCode)
        math.abs(key.hashCode) % numPartitions
    }
  }
}

 

 

In kafka-storm, if one partition -> one consumer instance, there is no such problem, but parallelism is lost.

If N1 partitions -> N2 consumer instances,

1) N1<N2, this situation will cause some consumers to idle and waste resources.

2) N1>N2 (N2>1), in this case, each kafka-spout instance will consume a fixed 1 or several partitions, and msg will not be repeatedly consumed by different consumers.

3) N1=N2, in this case, the actual operation found that one consumer instance consumes one partition. A partition can only have one consumer instance, otherwise operations such as locking are required, which reduces the complexity of consumption control.

 

 

Specific application scenarios:

Calculate the residence time of a user at a certain location, and the log content can be abstracted into user ID, time point, and location.

Application system - "log file sftp server -" data collection layer - "kafka-" storm real-time data cleaning processing layer - "Redis, Hbase-" scheduled tasks, mapreduce

During the integration test, since there is no actual log, data insertion into Kafka is simulated at the acquisition layer (especially when the sending frequency is simulated very roughly), and it is found that at the real-time processing layer, the calculated user staying time at a certain location is calculated as a negative number , for the following reasons,

 

1) The simulation of the acquisition layer is unreal (the time the same user inserts into kafka is randomly generated), but it should be considered whether the current log file sftp server or the acquisition layer will have this situation. If so, it can be avoided from the business level. Filter out the invalid data.

 

2) If the tuple processing fails in storm, it is resent. In kafka-storm, the offset is returned to the location where it failed, but the location information may have been cached in redis before (in order to reduce the number of hbase accesses, the user's latest location information is placed in Redis), so that all messages after the offset will be re-consumed, so that the retention time is negative, and the record can be filtered out and not stored in redis. 

 

  Real data: U1 T1 A1->U1 T2 A2

 

  fail retransmission: U1 T1 A1->U1 T2 A2 -> The first two fail, retransmit -> U1 T1 A1 (negative dwell time) -> U1 T2 A2

 

Since the failure retransmission is used, it is at least once. If it is only once, there will be no such situation.

 

 

PS: For some principle problems, please refer to the introduction of "Kafka Consumption Principle".

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326382517&siteId=291194637
Recommended