Kafka optimization notes

The role of 1 mq

Decoupling, asynchronous, peak shaving and valley filling

2 kafka architecture

kafka architecture
1) Producer: message producer, which is the client that sends messages to kafka broker;
2) Consumer: message consumer, the client that gets messages from kafka broker;
3) Consumer Group (CG): consumer group, composed of multiple consumer composition. Each consumer in a consumer group is responsible
for consuming data from different partitions, and a partition can only be consumed by one consumer in the group; consumer groups do not affect each other. All consumers belong to a consumer group, that is, a consumer group is a logical subscriber.
4) Broker: A kafka server is a broker. A cluster consists of multiple brokers. A broker
can hold multiple topics.
5) Topic: It can be understood as a queue, and both producers and consumers are facing a topic;
6) Partition: In order to achieve scalability, a very large topic can be distributed to multiple brokers (ie servers), and
a topic It can be divided into multiple partitions, and each partition is an ordered queue;
7) Replica: copy, to ensure that when a node in the cluster fails, the partition data on the node will not be lost, and kafka can still continue to work , Kafka provides a copy mechanism, each partition of a topic has several copies, a leader and several followers.
8) Leader: The "master" of multiple copies of each partition, the object that the producer sends data, and the object that the consumer consumes data are all leaders.
9) follower: "slave" in multiple copies of each partition, synchronizes data from the leader in real time, and maintains and leader data
synchronization. When the leader fails, a follower will become the new follower.

3 kafka storage mechanism

1) Topic is a logical concept, while partition is a physical concept. Each partition corresponds to a log file, and the log file stores the data produced by the producer. The data produced by the Producer will be continuously appended to the end of the log file, and each piece of data has its own offset. Each consumer in the consumer group will record which offset it consumes in real time, so that when the error is restored, it can continue to consume from the last position.
2) Since the messages produced by the producer will be continuously appended to the end of the log file, in order to prevent the inefficiency of data location caused by the log file being too large, Kafka adopts a fragmentation and indexing mechanism to divide each partition into multiple segments. Each segment corresponds to two files - ".index" file and ".log" file. These files are located under a folder, and the naming rule of this folder is: topic name + partition number. For example, if the topic first has three partitions, the corresponding folders are first-0, first-1, and first-2.
3) The index and log files are named after the offset of the first message of the current segment. The ".index" file stores a large amount of index information, and the ".log" file stores a large amount of data. The metadata in the index file points to the physical offset address of the message in the corresponding data file.

4 producer partition principle

The data sent by the producer is encapsulated into a ProducerRecord object.
1) If partition is specified, use the specified value directly as the partition value;
2) If no partition value is specified but there is a key, take the remainder of the hash value of the key and the partition number of the topic to obtain the partition value; 3
) When there is neither partition value nor key value, an integer is randomly generated at the first call (the integer is incremented by each subsequent call), and the partition value is obtained by taking the remainder of this value and the total number of partitions available for the topic. This is the so-called round-robin algorithm.

5 Producer lost data?

1) Replica synchronization strategy

  • When more than half of the synchronization is completed, ack is sent. When a new leader is elected with a low delay, 2n+1 secondary nodes are required to tolerate the failure of n nodes.
  • All copies are synchronized before sending ack to elect a new leader. To tolerate the failure of n nodes, n+1 replicas are required. The delay is high
val properties = new Properties
properties.put("bootstrap.servers", broker_list)
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
properties.put("enable.idempotence",(true: java.lang.Boolean)) //幂等性、开启事务
properties.put(ProducerConfig.ACKS_CONFIG, "-1")
var producer: KafkaProducer[String, String] = null
try
  producer = new KafkaProducer[String, String](properties)
catch {
    
    
  case e: Exception => e.printStackTrace()
}
producer
  • Extreme case set ProducerConfig.RETRY_CONFIG
    2) ISR
  • Leader maintains a dynamic in-sync replica set (ISR), which means a follower set that is synchronized with the leader. When the follower in the ISR completes data synchronization, the leader will send ack to the follower. If the follower does not synchronize data with the leader for a long time, the follower will be kicked out of the ISR, and the time threshold is set by the parameter replica.lag.time.max.ms. After the leader fails, a new leader will be elected from the ISR.

6 Kafka lost data?

1) Set the replication.factor parameter for the topic: this value must be greater than 1, and each partition must have at least 2 copies.
2) Set the min.insync.replicas parameter on the Kafka server: this value must be greater than 1. This is to require a leader to at least perceive that there is at least one follower who is still in touch with itself and is not left behind, so as to ensure that there is another follower when the leader hangs up. follower.

7 Consumers lose data?

  • Turn off the automatic submission of offsets, and manually submit the offsets after processing to ensure that the data will not be lost
// kafka消费者配置
  var kafkaParam = collection.mutable.Map(
    "bootstrap.servers" -> broker_list, //用于初始化链接到集群的地址
    "key.deserializer" -> classOf[StringDeserializer],
    "value.deserializer" -> classOf[StringDeserializer],
    //用于标识这个消费者属于哪个消费团体
    "group.id" -> "gmall_group",
    //latest自动重置偏移量为最新的偏移量
    "auto.offset.reset" -> "latest",
    //如果是true,则这个消费者的偏移量会在后台自动提交,但是kafka宕机容易丢失数据
    //如果是false,会需要手动维护kafka偏移量
    "enable.auto.commit" -> (false: java.lang.Boolean)
  )

// 存储每个分区的offset
def saveOffset(topic: String, groupId: String, offsetRanges: Array[OffsetRange]): Unit = {
    
    
    //拼接redis中操作偏移量的key
    var offsetKey = "offset:" + topic + ":" + groupId

    //定义java的map集合,用于存放每个分区对应的偏移量
    val offsetMap: util.HashMap[String, String] = new util.HashMap[String, String]()

    //对offsetRanges进行遍历,将数据封装offsetMap
    for (offsetRange <- offsetRanges) {
    
    
      val partitionId: Int = offsetRange.partition
      val fromOffset: Long = offsetRange.fromOffset
      val untilOffset: Long = offsetRange.untilOffset

      offsetMap.put(partitionId.toString, untilOffset.toString)
      println("保存分区" + partitionId + ":" + fromOffset + "----->" + untilOffset)
    }

    val jedis: Jedis = MyRedisUtil.getJedisClient()
    jedis.hmset(offsetKey, offsetMap)
    jedis.close()
  }

8 repeated consumption

  • Kafka actually has a concept of offset, that is, each message written has an offset, which represents the serial number of the message, and after the consumer consumes the data, every once in a while (regularly), it will consume the message it has consumed Submit the offset, saying "I have already consumed it, if I restart or something next time, you can let me continue to consume from the offset I consumed last time."
  • The new version of Kafka has shifted the storage of offsets from Zookeeper to Kafka brokers, and uses the internal displacement topic __consumer_offsets for storage.
  • producer business. In order to realize cross-partition and cross-session transactions, it is necessary to introduce a globally unique Transaction ID, and bind the PID obtained by the Producer with the Transaction ID. In this way, when the Producer is restarted, the original PID can be obtained through the ongoing Transaction ID.
// 开启producer事务
properties.put("enable.idempotence",(true: java.lang.Boolean)) //幂等性、开启事务

Guess you like

Origin blog.csdn.net/wolfjson/article/details/121323037