1. Message middleware

1. Advantages

Asynchronous call: synchronous to asynchronous
Application decoupling/scalability: provide data-based interface layer
Flow peak clipping: relieve instantaneous high flow pressure
Recoverability
Order guarantee

2. Message middleware working mode

Peer-to-peer model: one-to-one, consumers actively pull data
Publish and subscribe mode: one-to-many, after data production, push to all subscriber

3. Terminology in message middleware

Broker: Message server, providing core services
Producer: message producer
Consumer: message consumer
Topic: Topic, the unified collection of messages in the publish-subscribe mode
Queue: queue, message queue in point-to-point mode

二 .Apache Kafka

Kafka is a high-throughput distributed publish-subscribe messaging system designed for scenarios such as ultra-high throughput real-time log collection, real-time data synchronization, and real-time data calculation.

Fast, hundreds of MB per second read by a single Broker
Scale the cluster without downtime
Message copy redundancy
Real-time data pipeline

Written in Scala
Insert picture description here

1. Kafka

Download and unzip
Configuration file

//config/server.properties
broker.id=0
listeners=PLAINTEXT://master:9092
zookeeper.connect=master:2181,slave1:2181,slave2:2181
log.dirs、log.retention.hours

start up
bin/kafka-server-start.sh config/server.properties
verification
bin/kafka-topics.sh、kafka-console-producer.sh、kafka-console-consumer.sh
The specific process is as follows

: I get the script and extract the code: 39fk

2.Kafka architecture

Broker: Server in Kafka cluster
Topic: Maintain a message in a topic, which can be regarded as a message classification
Producer: Publish (production) messages to Kafka topics
Consumer: subscribe (consumer) topics and process messages

3.Kafka Topic

Topic

Subject is the category name of the published message
Publish and subscribe data must specify topics
主题副本数量不大于Brokers个数

Partition (improve concurrency)

A topic contains multiple partitions, partitioned by Key Hash by default
Each Partition corresponds to a folder <topic_name>-<partition_id>
Each Partition is regarded as an ordered log file (LogSegment)
Replication strategy is based on Partition, not Topic
Each Partition has a Leader, 0 or more Followers

4.Kafka Message

header: message header, fixed length

offset: uniquely determine the position of each message in the partition
CRC32: Use crc32 to check the message
"Magic": Indicates the version number of the Kafka service program protocol released this time
"Attributes": indicates the independent version, or identifies the compression type, or the encoding type

body: message body

key: indicates the message key, optional
value bytes payload: represents the actual message data

Physical structure
Insert picture description here

5.Kafka Producer

The producer writes the message to the Broker

Producer sends messages directly to Leader Partition on Broker (follower will only copy)
The Producer client itself controls which partitions the message is pushed to: the specified key is passed through the hash, the polling is not specified, the custom partitioning algorithm, etc.
Batch push to improve efficiency

6.Kafka Broker

Each Broker in the Kafka cluster can respond to Producer requests

Which Brokers are alive? Need to ensure that the broker is alive
Where is Topic's Leader Partition? Distributed in multiple brokers

Each Broker acts as Leader and Followers to maintain load balance

Leader handles all read and write requests
Followers passively copy Leader

7.Kafka Consumer

Consumers consume news by subscribing

The offset management is based on the level of the consumer group (group.id)
每个Partition只能由同一消费组内的一个Consumer来消费
每个Consumer可以消费多个分区
The consumed data will still be kept in Kafka
The number of consumers generally does not exceed the number of partitions

Consumption pattern

Queue: all consumers are in a consumer group
Publish/Subscribe: All consumers are assigned to different consumer groups

8. Kafka data flow

Replica synchronization : ISR (In-Sync Replica)
disaster recovery : Leader Partition
high concurrency

Read and write performance
Consumer Group

Load balancing
data is not lost (ack mechanism)
Insert picture description here

9. The role of ZooKeeper in Kafka

Broker registration and monitoring status

/brokers/ids

Topic registration

/brokers/topics

Producer load balancing

When each Broker starts, it will complete the Broker registration process, and the producer will dynamically perceive the change of the Broker server list through the change of the node

offset maintenance

Early versions of Kafka used ZooKeeper to store offsets for each consumer. Due to the poor write performance of ZooKeeper, since version 0.10, Kafka uses its own internal theme to maintain offsets.

三 .Kafka API

Producer API
Consumer API
Streams API
Connector API

1.Kafka Producer API

Key category

KafkaProducer
ProducerRecord

<dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>0.11.0.2</version>
 </dependency>

Properties props = new Properties();
props.put("bootstrap.servers", "node01:9092,node02:9092,node03:9092");
props.put("acks", "all");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);
for (int i = 0; i < 100; i++)
    producer.send(new ProducerRecord<String, String>("topic1", Integer.toString(i), "dd:"+i));

Configuration item

parameter name	Description	Defaults
bootstrap.servers	The broker-list of the Kafka cluster
acks	Ensure producer reliability settings	-1
	acks=0: do not wait for a successful return
	acks=1: Wait for Leader to write successfully and return
	acks=all: Wait for Leader and all Followers in ISR to write successfully and return, all can also be replaced by -1
key.serializer	key serializer
value.serializer	serializer for value
retries	Number of failed attempts to retransmit	0
batch.size	Unsent message size for each partition	16384
partitioner.class	Partition class, you can customize the partition class and implement the partitioner interface	The default is the hash value %partitions
max.block.ms	Maximum blocking time	60000

2.Kafka Consumer API

Key category

KafkaConsumer
ConsumerRecords

Properties props = new Properties();
props.put("bootstrap.servers", "node01:9092,node02:9092,node03:9092");
props.put("group.id", "testGroup1");
props.put("enable.auto.commit", "true");//默认值true
props.put("auto.commit.interval.ms", "1000");//默认值5000
props.put("key.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer(props);
consumer.subscribe(Arrays.asList("20190626"));
while (true) {
    
    
     ConsumerRecords<String, String> records = consumer.poll(1000);
     for (ConsumerRecord<String, String> record : records)
           System.out.printf("partition=%d, offset = %d, key = %s, value = %s%n",
                                                                                  record.partition(),record.offset(), record.key(), record.value());
}

parameter name	Description	Defaults
bootstrap.servers	The broker-list of the Kafka cluster
group.id	Used to indicate which group the consumer wants to join	“”
key.deserializer	key deserializer
value.deserializer	value deserializer
enable.auto.commit	Whether to submit automatically	TRUE
auto.commit.interval.ms	Set the frequency of automatic submission	5000 (5s)
auto.offset.reset	1) earliest: When there is a submitted offset under each partition, start consumption from the submitted offset; when there is no submitted offset, start consumption from the beginning	latest
	2) Latest: When there is a submitted offset in each partition, start consumption from the submitted offset; when there is no submitted offset, consume the newly generated data in the partition
	3) none: When there is a submitted offset in each partition of the topic, consumption starts after the offset; as long as there is no submitted offset in a partition, an exception will be thrown

Submit Offset manually

Properties props = new Properties();
props.put("bootstrap.servers", "node01:9092,node02:9092,node03:9092");
props.put("group.id", "testGroup1");
props.put("enable.auto.commit", "false");
props.put("key.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer(props);
consumer.subscribe(Arrays.asList("20190626"));
List<ConsumerRecord<String, String>> buffer = new ArrayList();
while (true) {
    
    
    ConsumerRecords<String, String> records = consumer.poll(1000);
    for (ConsumerRecord<String, String> record : records) {
    
             
        buffer.add(record);
   }
   if(buffer.size()>5){
    
    //此处可进行业务逻辑处理，如保存数据库
                consumer.commitAsync(); //异步调用，非阻塞方式
                buffer.clear();
   }}

Four. Kafka optimization

1. The messages are in order

Kafka guarantees order in the same topic and the same partition
How to ensure global order based on topic

One topic, one section
The producer groups the messages by Key, such as (Table+PK), and writes one group into one partition

2. Message copy guarantee

request.required.acks

0-The producer never waits for ack
1-The producer waits for Leader to write successfully and returns
-1 /all-the producer leader and all followers in the ISR will return after writing successfully

min.insync.replicas

This attribute specifies the minimum ISR number. When the producer sets request.required.acks to all or -1, specify the minimum number of replicas. If this number is not reached, the producer will generate an exception

3.Producer data loss analysis

Kafka Producer API

The message accumulates in the batch buffer
Messages are processed in batches by partition and are being retried at the batch level
After retrying, expired batches are discarded
Producer close/flush failed
Data production is faster than delivery, resulting in BufferExhausedException

Best Practices

Real-time stream processing combined with Spark Streaming
Universal Message Bus
Collect user activity data
Collect operational metrics from applications, servers or devices
Log aggregation (combined with ELK)
Distributed system commit log

Kafka core principles

Article Directory