[Kafka] Message Queue Kafka Basics

Introduction to Message Queuing

  Message Queue, often abbreviated as MQ. Literally understood, a message queue is a queue used to store messages. For example a queue in Java:

// 1. 创建一个保存字符串的队列
Queue<String> stringQueue = new LinkedList<String>();
// 2. 往消息队列中放入消息
stringQueue.offer("message");
// 3. 从消息队列中取出消息并打印
System.out.println(stringQueue.poll());

  The above code creates a queue, first adds a message to the queue, and then takes a message out of the queue. This shows that the queue can be used to access messages. We can simply understand that the message queue is to store the data that needs to be transmitted in the queue.
  Message queue middleware is software (components) used to store messages. There are many message queues, such as: Kafka, RabbitMQ, ActiveMQ, RocketMQ, ZeroMQ, etc.

Application Scenarios of Message Queuing

asynchronous processing

  For example, in an e-commerce website, when a new user registers, the user's information needs to be saved in the database, and an additional registration email notification and SMS registration code need to be sent to the user.
  However, because sending emails and sending registration SMS needs to connect to an external server, it needs to wait for a period of time. At this time, you can use message queues for asynchronous processing, so as to achieve fast response.
insert image description here

System decoupling

insert image description here

flow clipping

insert image description here

log processing

  Large-scale e-commerce websites (Taobao, JD.com, Gome, Suning...), Apps (Douyin, Meituan, Didi, etc.) need to analyze user behavior, and discover user preferences and activities based on user access behavior. A large amount of user access information is collected on the page.
insert image description here

Two modes of message queue

peer-to-peer mode

  The message sender produces a message and sends it to the message queue, and then the message receiver takes it out from the message queue and consumes the message. After the message is consumed, there is no more storage in the message queue, so it is impossible for the message receiver to consume the message that has already been consumed.

Features:

  • Each message has only one receiver (Consumer) (that is, once consumed, the message is no longer in the message queue)
  • There is no dependency between the sender and the receiver. After the sender sends a message, no matter whether the receiver is running or not, it will not affect the sender's next message sending;
  • After successfully receiving the message, the receiver needs to respond to the queue successfully, so that the message queue can delete the currently received message;

insert image description here

publish-subscribe model

Features:

  • Each message can have multiple subscribers;
  • There is a time dependency between publishers and subscribers. For a subscriber of a topic (Topic), it must create a subscriber before it can consume the publisher's message.
  • In order to consume messages, subscribers need to subscribe to the role topic in advance and keep running online;

insert image description here

Kafka Introduction and Application Scenarios

Apache Kafka is a distributed streaming platform. A distributed streaming platform should contain three key capabilities:

  • Publish and subscribe to streaming data streams, similar to message queues or enterprise messaging systems
  • Storing data streams in a fault-tolerant persistent manner
  • process data flow

Apache Kafka is typically used in two types of programs:

  • Build real-time data pipelines to reliably get data between systems or applications

  • Build real-time streaming applications that transform or react to streams of data

Producers: There can be many applications that put message data into the Kafka cluster.
Consumers: There can be many applications that pull message data from the Kafka cluster.
Connectors: Kafka's connector can import data from the database to Kafka, and can also export Kafka data to the database.
Stream Processors: The stream processor can pull data from Kafka, or write data to Kafka.

insert image description here

Advantages of Kafka over other MQs

characteristic ActiveMQ RabbitMQ Kafka RocketMQ
Community/Company Apache Mozilla Public License Apache Apache/Ali
Maturity Mature Mature Mature more mature
producer-consumer pattern support support support support
publish-subscribe support support support support
REQUEST-REPLY support support - support
API completeness high high high low (static configuration)
multilingual support Support JAVA priority language independent Support, JAVA priority support
Stand-alone throughput 10,000 (worst) 10,000 level 100,000 level 100,000 level (highest)
message delay - microsecond level Millisecond -
availability high (master-slave) high (master-slave) very high (distributed) high
message lost - Low theoretically not lost -
message repeat - controllable In theory there will be repetition -
affairs support not support support support
completeness of documentation high high high middle
Provides a quick start have have have none
Difficulty of first deployment - Low middle high

Kafka directory structure

The Kafka version used is 2.4.1.

directory name illustrate
bin All execution scripts for Kafka are here. For example: start Kafka server, create Topic, producer, consumer program, etc.
config All configuration files of Kafka
libs All JAR packages required to run Kafka
logs All log files of Kafka, if there are some problems with Kafka, you need to check the exception information in this directory
site-docs Kafka's website help file

Build a Kafka cluster

The Kafka version used is 2.4.1, which was released on March 12, 2020.

Note : The version number of Kafka is: kafka_2.12-2.4.1, because Kafka is mainly developed using the scala language, and 2.12 is the version number of scala.

Create and unpack:

sudo mkdir export
cd /export
sudo mkdir server
sudo mkdir software
sudo chmod 777 software/
sudo chmod 777 server/
cd /export/software/
tar -xvzf kafka_2.12-2.4.1.tgz -C ../server/

Modify server.properties:

# 创建Kafka数据的位置
mkdir /export/server/kafka_2.12-2.4.1/data
vim /export/server/kafka_2.12-2.4.1/config/server.properties
# 指定broker的id
broker.id=0
# 指定Kafka数据的位置
log.dirs=/export/server/kafka_2.12-2.4.1/data
# 配置zk的三个节点
zookeeper.connect=10.211.55.8:2181,10.211.55.9:2181,10.211.55.7:2181

Repeat the above steps for the other two servers, only modify the broker.id to be different.
Configure the KAFKA_HOME environment variable:

sudo su
vim /etc/profile
export KAFKA_HOME=/export/server/kafka_2.12-2.4.1
export PATH=:$PATH:${KAFKA_HOME}
#源文件无下面这条需手动添加
export PATH

每个节点加载环境变量
source /etc/profile

Start the server:

# 启动ZooKeeper
# 启动Kafka,需要在kafka根目录下启动
cd /export/server/kafka_2.12-2.4.1

nohup bin/kafka-server-start.sh config/server.properties &
# 测试Kafka集群是否启动成功
bin/kafka-topics.sh --bootstrap-server 10.211.55.8:9092 --list
# 无报错,打印为空

Write Kafka one-click startup/shutdown script

In order to facilitate one-click startup and shutdown of Kafka in the future, a shell script can be written to operate, and Kafka can be quickly started or shut down as long as the script is executed once.

Prepare the slave configuration file to save kafka on which nodes to start:

# 创建 /export/onekey 目录
sudo mkdir onekey

cd /export/onekey
sudo su
#新建slave文件
touch slave

#slave中写入以下内容
10.211.55.8
10.211.55.9
10.211.55.7

Write the start-kafka.sh script:

vim start-kafka.sh

cat /export/onekey/slave | while read line
do
{
    
    
 echo $line
 ssh $line "source /etc/profile;export JMX_PORT=9988;nohup ${KAFKA_HOME}/bin/kafka-server-start.sh ${KAFKA_HOME}/config/server.properties >/dev/nul* 2>&1 & "
 wait
}&
done

Write the stop-kafka.sh script:

vim stop-kafka.sh

cat /export/onekey/slave | while read line
do
{
    
    
 echo $line
 ssh $line "source /etc/profile;jps |grep Kafka |cut -d' ' -f1 |xargs kill -s 9"
 wait
}&
done

Configure execution permissions for start-kafka.sh and stop-kafka.sh:

chmod u+x start-kafka.sh
chmod u+x stop-kafka.sh

# 执行一键启动、一键关闭,注:执行shell脚本需实现服务器间ssh免密登录
./start-kafka.sh
./stop-kafka.sh

# 当查看日志发生Error connecting to node ubuntu2:9092错误时需在三台服务器上配置如下命令,以ubuntu2为例,另外两台同样的规则配置
# sudo vim /etc/hosts
# 10.211.55.8 ubuntu1
# 10.211.55.7 ubuntu3

Kafka Basic Operations

create topic

Create a topic (topic). All messages in Kafka are stored in topics. To produce messages to Kafka, you must first have a certain topic.
insert image description here

# 创建名为test的主题
bin/kafka-topics.sh --create --bootstrap-server 10.211.55.8:9092 --topic test
# 查看目前Kafka中的主题
bin/kafka-topics.sh --list --bootstrap-server 10.211.55.8:9092
# 成功打印出 test

Produce messages to Kafka

Use Kafka's built-in test program to produce some messages to Kafka's test topic.

bin/kafka-console-producer.sh --broker-list 10.211.55.8:9092 --topic test
# “>”表示等待输入

Consume messages from Kafka

Open another window:

# 使用消费 test 主题中的消息。
bin/kafka-console-consumer.sh --bootstrap-server 10.211.55.8:9092 --topic test --from-beginning

# 实现了生产者发送消息,消费者接受消息

Using Kafka Tools to work with Kafka

insert image description here
insert image description here
insert image description here

Connect to Kafka Tool with Security

insert image description here
insert image description here
insert image description here

Java programming operation Kafka

Import Maven Kafka pom.xml dependencies:

<repositories><!-- 代码库 -->
    <repository>
        <id>central</id>
        <url>http://maven.aliyun.com/nexus/content/groups/public//</url>
        <releases>
            <enabled>true</enabled>
        </releases>
        <snapshots>
            <enabled>true</enabled>
            <updatePolicy>always</updatePolicy>
            <checksumPolicy>fail</checksumPolicy>
 				</snapshots>
    </repository>
</repositories>

<dependencies>
    <!-- kafka客户端工具 -->
    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka-clients</artifactId>
        <version>2.4.1</version>
    </dependency>

    <!-- 工具类 -->
    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-io</artifactId>
        <version>1.3.2</version>
    </dependency>

    <!-- SLF桥接LOG4J日志 -->
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <version>1.7.6</version>
    </dependency>

    <!-- SLOG4J日志 -->
    <dependency>
        <groupId>log4j</groupId>
        <artifactId>log4j</artifactId>
        <version>1.2.16</version>
    </dependency>
</dependencies>

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.7.0</version>
            <configuration>
                <source>1.8</source>
        				<target>1.8</target>
            </configuration>
        </plugin>
    </plugins>
</build>

log4j.properties: (put into the resources folder)

log4j.rootLogger=INFO,stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender 
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout 
log4j.appender.stdout.layout.ConversionPattern=%5p - %m%n

Synchronize production messages to Kafka:

  • Create a Properties configuration for connecting to Kafka
Properties props = new Properties();
//这个配置是 Kafka 生产者和消费者必须要指定的一个配置项,它用于指定 Kafka 集群中的一个或多个 broker 地址,生产者和消费者将使用这些地址与 Kafka 集群建立连接。
props.put("bootstrap.servers", "192.168.88.100:9092");
//这行代码将 acks 配置设置为 all。acks 配置用于指定消息确认的级别。在此配置下,生产者将等待所有副本都成功写入后才会认为消息发送成功。这种配置级别可以确保数据不会丢失,但可能会影响性能。
props.put("acks", "all");
//这行代码将键(key)序列化器的类名设置为 org.apache.kafka.common.serialization.StringSerializer。键和值都需要被序列化以便于在网络上传输。这里使用的是一个字符串序列化器,它将字符串序列化为字节数组以便于发送到 Kafka 集群。
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
//这行代码将值(value)序列化器的类名设置为 org.apache.kafka.common.serialization.StringSerializer。这里同样使用的是一个字符串序列化器,它将字符串序列化为字节数组以便于发送到 Kafka 集群。
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
  • Create a producer object KafkaProducer
  • Call send to send 1-100 messages to the specified Topic test, and get the return value Future, which encapsulates the return value
  • Then call a Future.get() method to wait for the response
  • shutdown producer

Send a message using a synchronous wait

import org.apache.kafka.clients.producer.Callback;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
import java.util.Properties;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

/**
 * Kafka的生产者程序,会将消息创建出来,并发送到Kafka集群中
 * 1. 创建用于连接Kafka的Properties配置
 * 2. 创建一个生产者对象KafkaProducer
 * 3. 调用send发送1-100消息到指定Topic test,并获取返回值Future,该对象封装了返回值
 * 4. 再调用一个Future.get()方法等待响应
 * 5. 关闭生产者
 */
public class KafkaProducerTest {
    
    
    public static void main(String[] args) throws ExecutionException, InterruptedException {
    
    
        // 创建用于连接Kafka的Properties配置
        Properties props = new Properties();
        props.put("bootstrap.servers", "172.xx.xx.1x8:9092");
        props.put("acks", "all");
        props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

        props.put("security.protocol", "SASL_PLAINTEXT");
        props.put("sasl.mechanism", "PLAIN");

        props.put("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"xxxx\" password=\"xxxx\";");

        // 实现生产者的幂等性
        props.put("enable.idempotence",true);

        // 创建一个生产者对象KafkaProducer
        KafkaProducer<String, String> kafkaProducer = new KafkaProducer<>(props);

        // 发送1-100的消息到指定的topic中
        for (int i = 0; i < 100; ++i) {
    
    
            // 一、使用同步等待的方式发送消息
            // 构建一条消息,直接new ProducerRecord
            //"test":这个参数是指定 Kafka 主题(topic)的名称,表示这条记录将被发送到哪个主题中。
            // null:这个参数表示记录的键(key)。在 Kafka 中,每条消息都可以有一个键值对,键是一个可选参数,如果没有设置,则为 null。
            //i + "":这个参数表示记录的值(value)。这里的 i 是一个整数,通过将它转换为字符串来设置记录的值。这个值将被序列化为字节数组并被发送到 Kafka 集群。
            ProducerRecord<String, String> producerRecord = new ProducerRecord<>("test", null, i + "");
            Future<RecordMetadata> future = kafkaProducer.send(producerRecord);
            
            // 调用Future的get方法等待响应
            future.get();
            System.out.println("第" + i + "条消息写入成功!");
        }

        // 关闭生产者
        kafkaProducer.close();
    }
}

Asynchronously use methods with callback functions to produce messages

If you want to get whether the producer message is successful, or after successfully producing the message to Kafka, perform some other actions. At this point, it is convenient to use a function with a callback to send a message.

  • When an exception occurs in sending a message, the exception information can be printed out in time
  • When the message is sent successfully, print Kafka's topic name, partition id, offset
import org.apache.kafka.clients.producer.Callback;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
import java.util.Properties;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

/**
 * Kafka的生产者程序,会将消息创建出来,并发送到Kafka集群中
 * 1. 创建用于连接Kafka的Properties配置
 * 2. 创建一个生产者对象KafkaProducer
 * 3. 调用send发送1-100消息到指定Topic test,并获取返回值Future,该对象封装了返回值
 * 4. 再调用一个Future.get()方法等待响应
 * 5. 关闭生产者
 */
public class KafkaProducerTest {
    
    
    public static void main(String[] args) throws ExecutionException, InterruptedException {
    
    
        // 创建用于连接Kafka的Properties配置
        Properties props = new Properties();
        props.put("bootstrap.servers", "172.16.4.158:9092");
        props.put("acks", "all");
        props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

        props.put("security.protocol", "SASL_PLAINTEXT");
        props.put("sasl.mechanism", "PLAIN");

        props.put("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"admin\" password=\"admin\";");

        //实现生产者的幂等性
        props.put("enable.idempotence",true);

        // 创建一个生产者对象KafkaProducer
        KafkaProducer<String, String> kafkaProducer = new KafkaProducer<>(props);

        // 发送1-100的消息到指定的topic中
        for (int i = 0; i < 100; ++i) {
    
    
            // 二、使用异步回调的方式发送消息
            ProducerRecord<String, String> producerRecord = new ProducerRecord<>("test", null, i + "");
            //使用匿名内部类实现Callback接口,该接口中表示Kafka服务器响应给客户端,会自动调用onCompletion方法
            //metadata:消息的元数据(属于哪个topic、属于哪个partition、对应的offset是什么)
            //exception:这个对象Kafka生产消息封装了出现的异常,如果为null,表示发送成功,如果不为null,表示出现异常。
            kafkaProducer.send(producerRecord, new Callback() {
    
    
                @Override
                public void onCompletion(RecordMetadata metadata, Exception exception) {
    
    
                    // 1. 判断发送消息是否成功
                    if(exception == null) {
    
    
                        // 发送成功
                        // 主题
                        String topic = metadata.topic();
                        // 分区id
                        int partition = metadata.partition();
                        // 偏移量
                        long offset = metadata.offset();
                        System.out.println("topic:" + topic + " 分区id:" + partition + " 偏移量:" + offset);
                    }
                    else {
    
    
                        // 发送出现错误
                        System.out.println("生产消息出现异常!");
                        // 打印异常消息
                        System.out.println(exception.getMessage());
                        // 打印调用栈
                        System.out.println(exception.getStackTrace());
                    }
                }
            });
        }

        // 4.关闭生产者
        kafkaProducer.close();
    }
}

Consume messages from Kafka topics

From the test topic, consume all the messages, and print out the recorded offset, key, and value.

  • Create Kafka consumer configuration
Properties props = new Properties();
//这一行将属性"bootstrap.servers"的值设置为"node1.itcast.cn:9092"。这是Kafka生产者和消费者所需的Kafka集群地址和端口号。
props.setProperty("bootstrap.servers", "node1.itcast.cn:9092");
//这一行将属性"group.id"的值设置为"test"。这是消费者组的唯一标识符。所有属于同一组的消费者将共享一个消费者组ID。
props.setProperty("group.id", "test");
//这一行将属性"enable.auto.commit"的值设置为"true"。这表示消费者是否应该自动提交偏移量。
props.setProperty("enable.auto.commit", "true");
//这一行将属性"auto.commit.interval.ms"的值设置为"1000"。这是消费者自动提交偏移量的时间间隔,以毫秒为单位。
props.setProperty("auto.commit.interval.ms", "1000");
//这两行将属性"key.deserializer"和"value.deserializer"的值都设置为"org.apache.kafka.common.serialization.StringDeserializer"。这是用于反序列化Kafka消息的Java类的名称。在这种情况下,消息的键和值都是字符串类型,因此使用了StringDeserializer类来反序列化它们。
props.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
  • Create a Kafka consumer
  • Subscribe to topics to consume
  • Use a while loop to continuously pull messages from Kafka's topic
  • Will print out the offset, key, and value of the record (record)
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import java.time.Duration;
import java.util.Arrays;
import java.util.Properties;

/**
 * 消费者程序
 * 1.创建Kafka消费者配置
 * 2.创建Kafka消费者
 * 3.订阅要消费的主题
 * 4.使用一个while循环,不断从Kafka的topic中拉取消息
 * 5.将将记录(record)的offset、key、value都打印出来
 */
public class KafkaConsumerTest {
    
    

    public static void main(String[] args) throws InterruptedException {
    
    
        // 创建Kafka消费者配置
        Properties props = new Properties();
        props.setProperty("bootstrap.servers", "172.16.4.158:9092");
        props.setProperty("group.id", "test");
        props.setProperty("enable.auto.commit", "true");
        props.setProperty("auto.commit.interval.ms", "1000");
        props.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

        props.put("security.protocol", "SASL_PLAINTEXT");
        props.put("sasl.mechanism", "PLAIN");

        props.put("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"xxxx\" password=\"xxxx\";");

        // 创建Kafka消费者
        KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<>(props);

        // 订阅要消费的主题
        // 指定消费者从哪个topic中拉取数据
        kafkaConsumer.subscribe(Arrays.asList("test"));

        // 使用一个while循环,不断从Kafka的topic中拉取消息
        while(true) {
    
    
            // Kafka的消费者一次拉取一批的数据
            ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(Duration.ofSeconds(5));
            // 将将记录(record)的offset、key、value都打印出来
            for (ConsumerRecord<String, String> consumerRecord : consumerRecords) {
    
    
                // 主题
                String topic = consumerRecord.topic();
                // offset:这条消息处于Kafka分区中的哪个位置
                long offset = consumerRecord.offset();
                // key和value
                String key = consumerRecord.key();
                String value = consumerRecord.value();

                System.out.println("topic: " + topic + " offset:" + offset + " key:" + key + " value:" + value);
            }
        }
    }
}

Key concepts of Kafka

broker

  A Kafka cluster usually consists of multiple brokers to achieve load balancing and fault tolerance. Brokers are stateless (Sateless), and they maintain the cluster state through ZooKeeper. A Kafka broker can handle hundreds of thousands of reads and writes per second, and each broker can handle TB messages without affecting performance.
insert image description here

Zookeeper

  ZK is used to manage and coordinate brokers, and stores Kafka metadata (for example: how many topics, partitions, consumers there are). The ZK service is mainly used to notify producers and consumers that there are new brokers in the Kafka cluster, or brokers that fail in the Kafka cluster.

Note : Kafka is gradually trying to find a way to strip ZooKeeper, and the cost of maintaining two sets of clusters is high. The community proposed that KIP-500 is to replace the dependency of ZooKeeper. "Kafka on Kafka" - Kafka manages its own metadata.

Kafka Tool can view ZooKeeper configuration
insert image description here
insert image description here

producer

The producer is responsible for pushing data to the topic of the broker

consumer

Consumers are responsible for pulling data from the broker's topic and processing it themselves

consumer group

  Consumer group is a scalable and fault-tolerant consumer mechanism provided by Kafka. A consumer group can contain multiple consumers. A consumer group has a unique ID (group Id). Consumers in the group consume all partition data of the topic together.
insert image description here

Topic

  A topic is a logical concept for producers to publish data and consumers to pull data. Topics in Kafka must have identifiers and be unique. There can be any number of topics in Kafka, and there is no limit on the number. The messages in the topic are structured, generally a topic contains a certain type of messages. Once a producer sends messages to a topic, those messages cannot be updated (changed)
insert image description here

Partitions

  In a Kafka cluster, topics are divided into partitions. In Kafka, messages of the same topic can be allocated to different partitions, and the specific allocation rules depend on the partitioner.

  Kafka provides a default partitioner implementation, called DefaultPartitioner, which hashes the key of the message (if it exists), and then determines which partition the message should be assigned to based on the hash value. If the message has no key, the message is allocated to different partitions in a polling manner.

  In addition to the default partitioner, users can also customize the partitioner implementation to meet different needs. The custom partitioner implementation needs to implement the Partitioner interface provided by Kafka, and specify the partitioner to be used in the producer configuration.
  Whether using the default partitioner or a custom partitioner, the following rules need to be followed:

  • For the same key, it is always assigned to the same partition.
  • For messages without keys, messages should be assigned to different partitions in a random or polling manner.

  It should be noted that changes in the number of partitions may also cause messages to be allocated to different partitions. For example, when the number of partitions of a topic changes, previously written messages may be redistributed to different partitions. Therefore, changes in the number of partitions should be handled carefully in the producer code to avoid data loss or duplication.
insert image description here

Replicas

  Replicas ensure that data is still available in the event of a server failure. In Kafka, the number of replicas is generally designed to be >1.
insert image description here

offset (offset)

  offset records the sequence number of the next message to be sent to Consumer. By default Kafka stores offsets in ZooKeeper. In a partition, messages are stored in a sequential manner, and each consumption in the partition has an increasing id. This is the offset offset. Offsets are only meaningful within partitions. Between partitions, offset is meaningless.

consumer group

  Kafka supports multiple consumers consuming data from a topic at the same time. Start two consumers together to consume the data of the test topic.

Modify the producer program so that the producer keeps producing 1-100 numbers every 3 seconds:

// 发送1-100数字到Kafka的test主题中
while(true) {
    
    
    for (int i = 1; i <= 100; ++i) {
    
    
        // 注意:send方法是一个异步方法,它会将要发送的数据放入到一个buffer中,然后立即返回
        // 这样可以让消息发送变得更高效
        producer.send(new ProducerRecord<>("test", i + ""));
    }
    Thread.sleep(3000);
}

Run two consumers at the same time:
insert image description here
It can be found that only one consumer program can pull the message. If you want two consumers to consume messages at the same time, you must add a partition to the test topic.

# 设置 test topic为2个分区
bin/kafka-topics.sh --zookeeper 10.211.55.8:2181 -alter --partitions 2 --topic test

Re-run the producer and two consumer programs, and you can see that both consumers can consume Kafka Topic data.

Kafka producer idempotency

  Take http as an example, one or more requests, the response is consistent (except for network timeout and other issues), in other words, the impact of performing multiple operations is the same as performing one operation. If a system is not idempotent, if a user submits a form repeatedly, it may cause adverse effects. For example, if the user clicks the submit order button multiple times on the browser, multiple identical orders will be generated in the background.

  Kafka producer idempotency: When a producer produces a message, if a retry occurs, a message may be sent multiple times. If Kafka is not idempotent, it may save multiple identical messages in the partition news.
insert image description here

//配置幂等性
props.put("enable.idempotence",true);

Principle of idempotence

In order to realize the idempotence of producers, Kafka introduces the concepts of Producer ID (PID) and Sequence Number.

  • PID: Each Producer is assigned a unique PID when it is initialized, and this PID is transparent to the user.
  • Sequence Number: The message sent to the specified topic partition for each producer (corresponding to PID) corresponds to a Sequence Number that increments from 0.
    insert image description here

Kafka transactions

  Kafka transactions are a new feature introduced in Kafka 0.11.0.0 in 2017.
  Similar to database transactions. Kafka transaction refers to the operation of producers producing messages and consumers submitting offsets in one atomic operation, either succeeding or failing. Especially when producers and consumers coexist, transaction protection is particularly important. ( consumer-transform-producermode)
insert image description here

Transaction operation API:

The following five transaction-related methods are defined in the Producer interface:

  • initTransactions (initialization transaction): To use Kafka transactions, the initialization operation must be performed first.
  • beginTransaction (start transaction): start a Kafka transaction.
  • sendOffsetsToTransaction (commit offset): Send the offsets corresponding to the partitions to the transaction in batches to facilitate subsequent block submission.
  • commitTransaction (commit transaction): Commit the transaction.
  • abortTransaction (abandon the transaction): cancel the transaction.

Kafka transaction programming

Transaction-related attribute configuration

Producer:

// 配置事务的id,开启了事务会默认开启幂等性
props.put("transactional.id", "first-transactional");

consumer:

// 消费者需要设置隔离级别
props.put("isolation.level","read_committed");
// 关闭自动提交,开启事务的,不能开启offset自动提交,假设每秒提交一次,offset不受事务控制
props.put("enable.auto.commit", "false");

Kafka transaction programming case

Requirements: There are some user data in the topic [ods_user] of Kafka, the data format is as follows:

姓名,性别,出生日期
张三,1,1980-10-09
李四,0,1985-11-01

It is necessary to write a program to convert the user's gender into male and female (1-male, 0-female), and write the data into the topic [dwd_user] after conversion.
It is required to use transaction guarantee, or consume data and write data to topic at the same time, and submit offset. or all fail.

Start the producer console program to simulate data:

# 创建名为ods_user和dwd_user的主题
bin/kafka-topics.sh --create --bootstrap-server 10.211.55.8:9092 --topic ods_user
bin/kafka-topics.sh --create --bootstrap-server 10.211.55.8:9092 --topic dwd_user

# 窗口一:生产数据到 ods_user
bin/kafka-console-producer.sh --broker-list 10.211.55.8:9092 --topic ods_user

# 窗口二:从 dwd_user 消费数据
bin/kafka-console-consumer.sh --bootstrap-server 10.211.55.8:9092 --topic dwd_user --from-beginning  --isolation-level read_committed

Create consumer code:
createConsumer method, which returns a consumer to subscribe to the [ods_user] topic. Note: You need to configure the transaction isolation level and turn off automatic commit.

	//创建消费者
    public static Consumer<String, String> createConsumer() {
    
    
        // 1. 创建Kafka消费者配置
        Properties props = new Properties();
        props.setProperty("bootstrap.servers", "10.211.55.8:9092");
        props.setProperty("group.id", "ods_user");
        props.put("isolation.level","read_committed");
        props.setProperty("enable.auto.commit", "false");
        props.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

        // 2. 创建Kafka消费者
        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);

        // 3. 订阅要消费的主题
        consumer.subscribe(Arrays.asList("ods_user"));
        
        return consumer;
}

Write the code to create the producer:
createProducer method returns a producer object. Note: The id of the transaction needs to be configured. If the transaction is enabled, idempotence will be enabled by default.

Note : If using transactions, do not use asynchronous send

	//创建生产者
	public static Producer<String, String> createProduceer() {
    
    
        // 1. 创建生产者配置
        Properties props = new Properties();
        props.put("bootstrap.servers", "10.211.55.8:9092");
        props.put("transactional.id", "dwd_user");
        props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

        // 2. 创建生产者
        Producer<String, String> producer = new KafkaProducer<>(props);
        return producer;
    }

Write code to consume and produce data:

steps :

  • Call the previously implemented method to create consumer and producer objects.
  • The producer calls initTransactions to initialize the transaction.
  • Write a while loop, continuously pull data in the while loop, process it, and then write it to the specified topic.

In the while loop:

  • The producer starts the transaction
  • Consumer pulls messages
  • Traverse the fetched messages and perform preprocessing (convert 1 to male and 0 to female)
  • Produce messages to topic [dwd_user]
  • Commit the offset into the transaction
  • commit transaction
  • Catch the exception, and cancel the transaction if an exception occurs
    public static void main(String[] args) {
    
    
    	// 调用之前实现的方法,创建消费者、生产者对象
        Consumer<String, String> consumer = createConsumer();
        Producer<String, String> producer = createProducer();
        // 初始化事务
        producer.initTransactions();

		// 在while死循环中不断拉取数据,进行处理后,再写入到指定的topic
        while (true) {
    
    
            try {
    
    
                // 1. 开启事务
                producer.beginTransaction();
                // 2. 定义Map结构,用于保存分区对应的offset
                Map<TopicPartition, OffsetAndMetadata> offsetCommits = new HashMap<>();
                // 2. 拉取消息
                ConsumerRecords<String, String> records = consumer.poll(Duration.ofSeconds(2));

                for (ConsumerRecord<String, String> record : records) {
    
    
                    // 3. 保存偏移量
                    //将当前消息所属分区的偏移量保存到HashMap中,并且将偏移量加1,以便下次从此偏移量开始消费消息。
                    offsetCommits.put(new TopicPartition(record.topic(), record.partition()), new OffsetAndMetadata(record.offset() + 1));

                    // 4. 进行转换处理
                    String[] fields = record.value().split(",");
                    fields[1] = fields[1].equalsIgnoreCase("1") ? "男" : "女";
                    String message = fields[0] + "," + fields[1] + "," + fields[2];
                    
                    // 5. 生产消息到dwd_user
                    producer.send(new ProducerRecord<>("dwd_user", message));
                }
                
                // 6. 提交偏移量到事务,
                producer.sendOffsetsToTransaction(offsetCommits, "ods_user");
                // 7. 提交事务
                producer.commitTransaction();
            }
            catch (Exception e) {
    
    
                // 8. 放弃事务
                producer.abortTransaction();
            }
        }
    }

  Committing the offset of the consumed message to the producer's transaction is to ensure that the offset of the consumed message has been recorded and saved in the transaction before the producer sends the message to the new topic.
  If the offset is not submitted, it may cause the consumed messages to be consumed repeatedly the next time the consumer is started.
  Therefore, it is very important to commit the offset to the producer's transaction to ensure that the consumer can continue to consume correctly from the last stop position when it starts next time.

Test:
successfully converted and consumed:
insert image description here
simulated abnormal test transaction:

// 3. 保存偏移量
offsetCommits.put(new TopicPartition(record.topic(), record.partition()),new OffsetAndMetadata(record.offset() + 1));

// 4. 进行转换处理
String[] fields = record.value().split(",");
fields[1] = fields[1].equalsIgnoreCase("1") ? "男":"女";
String message = fields[0] + "," + fields[1] + "," + fields[2];

// 模拟异常
int i = 1/0;

// 5. 生产消息到dwd_user
producer.send(new ProducerRecord<>("dwd_user", message));

Start the program once, an exception is thrown. Start the program again, or throw an exception. Until we handle that exception.

insert image description here
The message can be consumed, but if an exception occurs in the middle, the offset will not be submitted, and the transaction will not be submitted unless the consumption and production of the message are successful.

Guess you like

Origin blog.csdn.net/qq_44033208/article/details/131917719