Kafka core principles

1. Message middleware

1. Advantages

  • Asynchronous call: synchronous to asynchronous
  • Application decoupling/scalability: provide data-based interface layer
  • Flow peak clipping: relieve instantaneous high flow pressure
  • Recoverability
  • Order guarantee

2. Message middleware working mode

  • Peer-to-peer model: one-to-one, consumers actively pull data
  • Publish and subscribe mode: one-to-many, after data production, push to all subscriber
    Insert picture description here

3. Terminology in message middleware

  • Broker: Message server, providing core services
  • Producer: message producer
  • Consumer: message consumer
  • Topic: Topic, the unified collection of messages in the publish-subscribe mode
  • Queue: queue, message queue in point-to-point mode

二 .Apache Kafka

Kafka Chinese document

Kafka is a high-throughput distributed publish-subscribe messaging system designed for scenarios such as ultra-high throughput real-time log collection, real-time data synchronization, and real-time data calculation.

  • Fast, hundreds of MB per second read by a single Broker
  • Scale the cluster without downtime
  • Message copy redundancy
  • Real-time data pipeline

Written in Scala
Insert picture description here

1. Kafka

  • Download and unzip
  • Configuration file
//config/server.properties
broker.id=0
listeners=PLAINTEXT://master:9092
zookeeper.connect=master:2181,slave1:2181,slave2:2181
log.dirs、log.retention.hours
  • start up
    bin/kafka-server-start.sh config/server.properties
  • verification
    bin/kafka-topics.sh、kafka-console-producer.sh、kafka-console-consumer.sh
  • The specific process is as follows
    Insert picture description here
    : I get the script and extract the code: 39fk

2.Kafka architecture

  • Broker: Server in Kafka cluster
  • Topic: Maintain a message in a topic, which can be regarded as a message classification
  • Producer: Publish (production) messages to Kafka topics
  • Consumer: subscribe (consumer) topics and process messages
    Insert picture description here

3.Kafka Topic

Topic

  • Subject is the category name of the published message
  • Publish and subscribe data must specify topics
  • 主题副本数量不大于Brokers个数
    Insert picture description here

Partition (improve concurrency)

  • A topic contains multiple partitions, partitioned by Key Hash by default
  • Each Partition corresponds to a folder <topic_name>-<partition_id>
  • Each Partition is regarded as an ordered log file (LogSegment)
  • Replication strategy is based on Partition, not Topic
  • Each Partition has a Leader, 0 or more Followers

4.Kafka Message

header: message header, fixed length

  • offset: uniquely determine the position of each message in the partition
  • CRC32: Use crc32 to check the message
  • "Magic": Indicates the version number of the Kafka service program protocol released this time
  • "Attributes": indicates the independent version, or identifies the compression type, or the encoding type

body: message body

  • key: indicates the message key, optional
  • value bytes payload: represents the actual message data

Physical structure
Insert picture description here

5.Kafka Producer

The producer writes the message to the Broker

  • Producer sends messages directly to Leader Partition on Broker (follower will only copy)
  • The Producer client itself controls which partitions the message is pushed to: the specified key is passed through the hash, the polling is not specified, the custom partitioning algorithm, etc.
  • Batch push to improve efficiency

6.Kafka Broker

Each Broker in the Kafka cluster can respond to Producer requests

  • Which Brokers are alive? Need to ensure that the broker is alive
  • Where is Topic's Leader Partition? Distributed in multiple brokers

Each Broker acts as Leader and Followers to maintain load balance

  • Leader handles all read and write requests
  • Followers passively copy Leader

7.Kafka Consumer

Consumers consume news by subscribing

  • The offset management is based on the level of the consumer group (group.id)
  • 每个Partition只能由同一消费组内的一个Consumer来消费
  • 每个Consumer可以消费多个分区
  • The consumed data will still be kept in Kafka
  • The number of consumers generally does not exceed the number of partitions

Consumption pattern

  • Queue: all consumers are in a consumer group
  • Publish/Subscribe: All consumers are assigned to different consumer groups

8. Kafka data flow

Replica synchronization : ISR (In-Sync Replica)
disaster recovery : Leader Partition
high concurrency

  • Read and write performance
  • Consumer Group

Load balancing
data is not lost (ack mechanism)
Insert picture description here

9. The role of ZooKeeper in Kafka

Broker registration and monitoring status

  • /brokers/ids

Topic registration

  • /brokers/topics

Producer load balancing

  • When each Broker starts, it will complete the Broker registration process, and the producer will dynamically perceive the change of the Broker server list through the change of the node

offset maintenance

  • Early versions of Kafka used ZooKeeper to store offsets for each consumer. Due to the poor write performance of ZooKeeper, since version 0.10, Kafka uses its own internal theme to maintain offsets.
    Insert picture description here

三 .Kafka API

  • Producer API
  • Consumer API
  • Streams API
  • Connector API
    Insert picture description here

1.Kafka Producer API

Key category

  • KafkaProducer
  • ProducerRecord
<dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>0.11.0.2</version>
 </dependency>

Properties props = new Properties();
props.put("bootstrap.servers", "node01:9092,node02:9092,node03:9092");
props.put("acks", "all");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);
for (int i = 0; i < 100; i++)
    producer.send(new ProducerRecord<String, String>("topic1", Integer.toString(i), "dd:"+i));

Configuration item

parameter name Description Defaults
bootstrap.servers The broker-list of the Kafka cluster
acks Ensure producer reliability settings -1
acks=0: do not wait for a successful return
acks=1: Wait for Leader to write successfully and return
acks=all: Wait for Leader and all Followers in ISR to write successfully and return, all can also be replaced by -1
key.serializer key serializer
value.serializer serializer for value
retries Number of failed attempts to retransmit 0
batch.size Unsent message size for each partition 16384
partitioner.class Partition class, you can customize the partition class and implement the partitioner interface The default is the hash value %partitions
max.block.ms Maximum blocking time 60000

2.Kafka Consumer API

Key category

  • KafkaConsumer
  • ConsumerRecords
Properties props = new Properties();
props.put("bootstrap.servers", "node01:9092,node02:9092,node03:9092");
props.put("group.id", "testGroup1");
props.put("enable.auto.commit", "true");//默认值true
props.put("auto.commit.interval.ms", "1000");//默认值5000
props.put("key.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer(props);
consumer.subscribe(Arrays.asList("20190626"));
while (true) {
    
    
     ConsumerRecords<String, String> records = consumer.poll(1000);
     for (ConsumerRecord<String, String> record : records)
           System.out.printf("partition=%d, offset = %d, key = %s, value = %s%n",
                                                                                  record.partition(),record.offset(), record.key(), record.value());
}
parameter name Description Defaults
bootstrap.servers The broker-list of the Kafka cluster
group.id Used to indicate which group the consumer wants to join “”
key.deserializer key deserializer
value.deserializer value deserializer
enable.auto.commit Whether to submit automatically TRUE
auto.commit.interval.ms Set the frequency of automatic submission 5000 (5s)
auto.offset.reset 1) earliest: When there is a submitted offset under each partition, start consumption from the submitted offset; when there is no submitted offset, start consumption from the beginning latest
2) Latest: When there is a submitted offset in each partition, start consumption from the submitted offset; when there is no submitted offset, consume the newly generated data in the partition
3) none: When there is a submitted offset in each partition of the topic, consumption starts after the offset; as long as there is no submitted offset in a partition, an exception will be thrown

Submit Offset manually

Properties props = new Properties();
props.put("bootstrap.servers", "node01:9092,node02:9092,node03:9092");
props.put("group.id", "testGroup1");
props.put("enable.auto.commit", "false");
props.put("key.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer(props);
consumer.subscribe(Arrays.asList("20190626"));
List<ConsumerRecord<String, String>> buffer = new ArrayList();
while (true) {
    
    
    ConsumerRecords<String, String> records = consumer.poll(1000);
    for (ConsumerRecord<String, String> record : records) {
    
             
        buffer.add(record);
   }
   if(buffer.size()>5){
    
    //此处可进行业务逻辑处理,如保存数据库
                consumer.commitAsync(); //异步调用,非阻塞方式
                buffer.clear();
   }}

Four. Kafka optimization

1. The messages are in order

Kafka guarantees order in the same topic and the same partition
How to ensure global order based on topic

  • One topic, one section
  • The producer groups the messages by Key, such as (Table+PK), and writes one group into one partition

2. Message copy guarantee

request.required.acks

  • 0-The producer never waits for ack
  • 1-The producer waits for Leader to write successfully and returns
  • -1 /all-the producer leader and all followers in the ISR will return after writing successfully

min.insync.replicas

  • This attribute specifies the minimum ISR number. When the producer sets request.required.acks to all or -1, specify the minimum number of replicas. If this number is not reached, the producer will generate an exception

3.Producer data loss analysis

Kafka Producer API

  • The message accumulates in the batch buffer
  • Messages are processed in batches by partition and are being retried at the batch level
  • After retrying, expired batches are discarded
  • Producer close/flush failed
  • Data production is faster than delivery, resulting in BufferExhausedException

Best Practices

  • Real-time stream processing combined with Spark Streaming
  • Universal Message Bus
  • Collect user activity data
  • Collect operational metrics from applications, servers or devices
  • Log aggregation (combined with ELK)
  • Distributed system commit log

Guess you like

Origin blog.csdn.net/sun_0128/article/details/108062549