Introduction to kafka

foreword

As an introductory article, it is mainly to understand the concept of Kafka, as well as some basic operations and uses.

introduce

The official made a relatively complete introduction for Kafka: Kafka Chinese Documentation - ApacheCN

Kafka is a distributed outflow platform.

As a stream processing platform, it has three characteristics:

  1. Allows you to publish and subscribe to streaming records. This aspect is similar to message queues or enterprise messaging systems.
  2. Streaming records can be stored and have better fault tolerance.
  3. Streaming records can be processed as they are generated.

It can be used in two broad categories of applications:

  1. Construct real-time streaming data pipelines that reliably fetch data between systems or applications. (equivalent to message queue)
  2. Build real-time streaming applications that transform or influence these streaming data. (that is, stream processing, through internal changes between kafka stream topic and topic)

Concept and Description

As a cluster, kafka can run on one or more servers

Kafka topicclassifies the stored data by

Each record contains a key, a valueand atimestamp

A structure of kafka is as follows:

structure

Kafka producers and consumers are connected through the TCP protocol.

producer : The producer is also our application, or other programs to be sent, it is responsible for sending messages topicto partition,

consumer : consumer, represented by the consumer group, the consumer group can be regarded as a consumer, the consumer subscribes to a topic, then Kafka will send the message to the consumer group, at this time, the consumer group will load balance The message is evenly distributed to all consumer instances under the group.

topic : topic (data topic), the place where data records are stored, and business data can be distinguished through it. A topic can have one or more consumer subscriptions

partition : partition, each topic will have one or more partitions, and the partition is to split the data received by the topic into different partitions according to the specified partition and partition rules. When consumers subscribe to topic consumption messages, they can specify a partition or not. If not specified, Kafka will use polling to send the data under all partitions to the consumer; of course, the specified partition will only send the messages of the specified partition. The point is also obvious that messages on the same partition are ordered.

(Official website description: For each topic, the kafka cluster will maintain a partition log, and the data of each partition is in order, and is continuously appended to the commit log file. Each record in the partition will be assigned an ID number to indicate the order , which can also be called offset (offset), that is to say, consumers can obtain logs through this offset, which can be obtained from a specified location.

The storage capacity of a partition is limited by the file limit of the host, but a topic can have multiple partitions. In a cluster, if a topic is on a cluster node, the partition will also be synchronized to the topic of other nodes to maintain fault tolerance)

replication : copy, each partition has a leader, 0 or more followers, the leader only does exclusive writing, and the follower only does backup. When the leader goes down, a leader will be elected among the remaining followers.

Note: When creating a topic, the number of replicas cannot be greater than the number of brokers, that is, the number of Kafka nodes, because the replica is synchronized to other nodes, so if it is greater than the node data, it cannot be created.

broker : kafka service

Cluster : Under the cluster, there will always be one node as a node leader, zero or more nodes follwers, only leaderthe node handles all read and write requests, and other nodes are passive to leaderachieve data synchronization. If these leadernodes are down, one of them will be follwerselected new leader.

Message system : Compared with other message system modes (with 队列and 发布-订阅two modules), in kafka's queue mode, the consumer is a consumer group, like redis, the queue mode can only have one consumer, and the consumption will be gone ;Publish-subscribe mode, other middleware is also multi-subscribe mode, kafka subscription is for topic, the number of consumer instances in all consumer groups cannot be more than the number of partitions.

Stream processing : Kafka has the function of stream processing. It will continuously obtain data from topic, and then do data processing. It can do data aggregation, join and other complex processing, and then write to topic

Install

Official address: Kafka Chinese Documentation-ApacheCN

The official download address, the download is very fast.

start up

method one

Kafka uses zookeeper for cluster management, so you need a zookeeper server. Kafka provides zookeeper, which can be used directly

./bin/zookeeper-server-start.sh -daemon config/zoeeper.properties

It is a bit different from zookeeper. The kafka script is a startup script, and the zookeeper one is a command line.

start kafka

./bin/kafka-server-start.sh -daemon config/server.properties

way two

Instead of using the zookeeper that comes with kafka, use the zookeeper cluster we built before

vi config/server.properties

/zookeeper

#Modify the configuration to the zookeeper cluster address

zookeeper.connect=localhost:2181,localhost:2182,localhost:2183

start up

./bin/zkServer.sh start zoo.cfg

./bin/zkServer.sh start zoo2.cfg

./bin/zkServer.sh start zoo3.cfg

./bin/kafka-server-start.sh config/server.properties

configuration

Description of configuration items: Kafka Chinese Documentation - ApacheCN

command operation

Kafka brings a lot of commands, and we use these commands to create topics, send messages, and consume messages.

image-20230103221311638

create topic

./bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic test_topic --partitions 3 --replication-factor 3

–zookeeper: specify the zookeeper address

–partitions: how many topic fragments

–replication-factor: replication factor, controls the number of data copies written, this value is <= number of brokers, here I am a stand-alone, so here is 1

View topic list

./bin/kafka-topics.sh --list --zookeeper localhost:2181

send a message (start a producer)

#Long connection

./bin/kafka-console-producer.sh --broker-list 192.168.17.128:9092 --topic test_topic

The client receives a message indicating success

Consume messages (start a consumer)

#Long connection

./bin/kafka-console-consumer.sh --bootstrap-server 192.168.17.128:9092 --topic test_topic

Send a small message on the producer client, if it can be received here, it is already successful

Query topic information

./bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic test_topic

image-20230211143956872

Although the concept has been explained above, it is still mentioned here.

**topic:** topic name, because 3 partitions and 3 distributions are specified, so here are 3 pieces of data;

**Partition: **Partition, indicating which node the current partition is on. Each topic will have one or more partitions, and the data stored in each partition is different (the data on the topic is evenly distributed)

**Leader and Replicas: **Replica information, each partition data will back up a specified number of copies to ensure disaster recovery capabilities, each partition has a leader, 0 or more followers, the leader only does solo writing, and the follower only does Backup, when the leader of a partition goes down, the remaining followers will elect a leader, as shown in the figure above Replicas: 1,2,3,

image-20230211152324585

If leader1 goes offline, the remaining 2 and 3 will elect one of them as the leader.

So here Leaderis the master node of the partition copy, which is responsible for all reading and writing, and Replicasis the information where the distribution is located.

Now kill a kafka node:

image-20230211144054008

It successfully removed 1 and elected a new leader

image-20230211144146960

Notice:

  1. As mentioned before, the number of replicas cannot be greater than the number of Kafka nodes. Here, the number of nodes is sufficient. However, force majeure factors during the operation process caused the cluster to actively delete nodes that gave up abnormalities.
  2. This command can view topic information, and it is also used as a command to check the cluster status in Kafka. There is no similar command in Kafka zkServer.sh status, so you can use this command to check whether any partition is offline.

delete topic

This needs to server.propertiesbe set in: delete.topic.enable=true, otherwise this topic will only be marked as deleted, and will not be deleted

./bin/kafka-topics.sh --delete --zookeeper localhost:2181 --topic test_topic

image-20230206222944176

cluster

The kafka official website provides a configuration example. We can add the configuration according to it. There are only 3 different places in the configuration of the cluster. Because I am deploying on one server, I need to distinguish other services.

  • log.dir
  • broker.id
  • listeners (note the address configuration ip here)

Zookeeper simple configuration example

# ZooKeeper服务地址
zookeeper.connect=localhost:2181,localhost:2182,localhost:2183
# topic分区日志数
num.partitions=8
# 自动创建topic的默认副本数
default.replication.factor=3
# 自动创建topic
auto.create.topics.enable=false
# 指定写入成功的副本数,当写入数据的副本数达不到这个值,则会报错
min.insync.replicas=2
delete.topic.enable=true

# 集群配置
# 日志保存目录
log.dir=./log1
# broker服务ID
broker.id=1
# 监听地址
listeners=PLAINTEXT://192.168.17.128:9092

Determine whether the cluster status is successful:

./bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic test_topic

shutdown

Shutdown and failure are different, normal shutdown, Kafka will execute:

  1. Synchronize all logs to disk, which can avoid the recovery log process when restarting, and can improve startup speed;
  2. When the leader is shut down, the partitions on the leader will be migrated to other replicas, and switching the leader is faster;

use

  1. First make sure the server firewall port is open

    firewall-cmd --add-port=9200/tcp --permanent
    
  2. Kafka requires logback dependencies

     <dependency>
                <groupId>org.apache.kafka</groupId>
                <artifactId>kafka-clients</artifactId>
                <version>2.8.0</version>
            </dependency>
            <dependency>
                <groupId>com.fasterxml.jackson.core</groupId>
                <artifactId>jackson-databind</artifactId>
                <version>2.13.1</version>
            </dependency>
            <dependency>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-api</artifactId>
                <version>1.7.33</version>
            </dependency>
            <dependency>
                <groupId>ch.qos.logback</groupId>
                <artifactId>logback-core</artifactId>
                <version>1.2.10</version>
            </dependency>
            <dependency>
                <groupId>ch.qos.logback</groupId>
                <artifactId>logback-classic</artifactId>
                <version>1.2.10</version>
            </dependency>
            <dependency>
                <groupId>ch.qos.logback</groupId>
                <artifactId>logback-access</artifactId>
                <version>1.2.3</version>
            </dependency>
    

    logback.xml

    <!-- 级别从高到低 OFF 、 FATAL 、 ERROR 、 WARN 、 INFO 、 DEBUG 、 TRACE 、 ALL -->
    <!-- 日志输出规则 根据当前ROOT 级别,日志输出时,级别高于root默认的级别时 会输出 -->
    <!-- 以下 每个配置的 filter 是过滤掉输出文件里面,会出现高级别文件,依然出现低级别的日志信息,通过filter 过滤只记录本级别的日志 -->
    <!-- scan 当此属性设置为true时,配置文件如果发生改变,将会被重新加载,默认值为true。 -->
    <!-- scanPeriod 设置监测配置文件是否有修改的时间间隔,如果没有给出时间单位,默认单位是毫秒。当scan为true时,此属性生效。默认的时间间隔为1分钟。 -->
    <!-- debug 当此属性设置为true时,将打印出logback内部日志信息,实时查看logback运行状态。默认值为false。 -->
    <configuration scan="true" scanPeriod="60 seconds" debug="false">
        <!-- 动态日志级别 -->
        <jmxConfigurator />
        <!-- 定义日志文件 输出位置 -->
        <!-- <property name="log_dir" value="C:/test" />-->
        <property name="log_dir" value="./logs" />
        <!-- 日志最大的历史 30天 -->
        <property name="maxHistory" value="30" />
    
        <!-- ConsoleAppender 控制台输出日志 -->
        <appender name="console" class="ch.qos.logback.core.ConsoleAppender">
            <encoder>
                <pattern>
                    <!-- 设置日志输出格式 -->
                    %d{yyyy-MM-dd HH:mm:ss.SSS} %-5level %logger - %msg%n
                </pattern>
            </encoder>
        </appender>
    
        <!-- ERROR级别日志 -->
        <!-- 滚动记录文件,先将日志记录到指定文件,当符合某个条件时,将日志记录到其他文件 RollingFileAppender -->
        <appender name="ERROR" class="ch.qos.logback.core.rolling.RollingFileAppender">
            <!-- 过滤器,只记录WARN级别的日志 -->
            <!-- 果日志级别等于配置级别,过滤器会根据onMath 和 onMismatch接收或拒绝日志。 -->
            <filter class="ch.qos.logback.classic.filter.LevelFilter">
                <!-- 设置过滤级别 -->
                <level>ERROR</level>
                <!-- 用于配置符合过滤条件的操作 -->
                <onMatch>ACCEPT</onMatch>
                <!-- 用于配置不符合过滤条件的操作 -->
                <onMismatch>DENY</onMismatch>
            </filter>
            <!-- 最常用的滚动策略,它根据时间来制定滚动策略.既负责滚动也负责出发滚动 -->
            <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
                <!--日志输出位置 可相对、和绝对路径 -->
                <fileNamePattern>
                    ${log_dir}/error/%d{yyyy-MM-dd}/error-log.log
                </fileNamePattern>
                <!-- 可选节点,控制保留的归档文件的最大数量,超出数量就删除旧文件假设设置每个月滚动,且<maxHistory>是6, 则只保存最近6个月的文件,删除之前的旧文件。注意,删除旧文件是,那些为了归档而创建的目录也会被删除 -->
                <maxHistory>${maxHistory}</maxHistory>
            </rollingPolicy>
            <encoder>
                <pattern>
                    <!-- 设置日志输出格式 -->
                    %d{yyyy-MM-dd HH:mm:ss.SSS} %-5level %logger - %msg%n
                </pattern>
            </encoder>
        </appender>
    
        <!-- WARN级别日志 appender -->
        <appender name="WARN" class="ch.qos.logback.core.rolling.RollingFileAppender">
            <!-- 过滤器,只记录WARN级别的日志 -->
            <!-- 果日志级别等于配置级别,过滤器会根据onMath 和 onMismatch接收或拒绝日志。 -->
            <filter class="ch.qos.logback.classic.filter.LevelFilter">
                <!-- 设置过滤级别 -->
                <level>WARN</level>
                <!-- 用于配置符合过滤条件的操作 -->
                <onMatch>ACCEPT</onMatch>
                <!-- 用于配置不符合过滤条件的操作 -->
                <onMismatch>DENY</onMismatch>
            </filter>
            <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
                <!--日志输出位置 可相对、和绝对路径 -->
                <fileNamePattern>${log_dir}/warn/%d{yyyy-MM-dd}/warn-log.log</fileNamePattern>
                <maxHistory>${maxHistory}</maxHistory>
            </rollingPolicy>
            <encoder>
                <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} %-5level %logger - %msg%n</pattern>
            </encoder>
        </appender>
    
        <!-- INFO级别日志 appender -->
        <appender name="INFO" class="ch.qos.logback.core.rolling.RollingFileAppender">
            <filter class="ch.qos.logback.classic.filter.LevelFilter">
                <level>INFO</level>
                <onMatch>ACCEPT</onMatch>
                <onMismatch>DENY</onMismatch>
            </filter>
            <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
                <fileNamePattern>${log_dir}/info/%d{yyyy-MM-dd}/info-log.log</fileNamePattern>
                <maxHistory>${maxHistory}</maxHistory>
            </rollingPolicy>
            <encoder>
                <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} %-5level %logger - %msg%n</pattern>
            </encoder>
        </appender>
    
        <!-- DEBUG级别日志 appender -->
        <appender name="DEBUG" class="ch.qos.logback.core.rolling.RollingFileAppender">
            <filter class="ch.qos.logback.classic.filter.LevelFilter">
                <level>DEBUG</level>
                <onMatch>ACCEPT</onMatch>
                <onMismatch>DENY</onMismatch>
            </filter>
            <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
                <fileNamePattern>${log_dir}/debug/%d{yyyy-MM-dd}/debug-log.log</fileNamePattern>
                <maxHistory>${maxHistory}</maxHistory>
            </rollingPolicy>
            <encoder>
                <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} %-5level %logger - %msg%n</pattern>
            </encoder>
        </appender>
    
        <!-- TRACE级别日志 appender -->
        <appender name="TRACE" class="ch.qos.logback.core.rolling.RollingFileAppender">
            <filter class="ch.qos.logback.classic.filter.LevelFilter">
                <level>TRACE</level>
                <onMatch>ACCEPT</onMatch>
                <onMismatch>DENY</onMismatch>
            </filter>
            <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
                <fileNamePattern>${log_dir}/trace/%d{yyyy-MM-dd}/trace-log.log</fileNamePattern>
                <maxHistory>${maxHistory}</maxHistory>
            </rollingPolicy>
            <encoder>
                <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} %-5level %logger - %msg%n</pattern>
            </encoder>
        </appender>
    
        <!-- root级别 DEBUG -->
        <root>
            <!-- 打印debug级别日志及以上级别日志 -->
            <level value="info" />
            <!-- 控制台输出 -->
            <appender-ref ref="console" />
            <!-- 文件输出 -->
            <appender-ref ref="ERROR" />
            <appender-ref ref="INFO" />
            <appender-ref ref="WARN" />
            <appender-ref ref="DEBUG" />
            <appender-ref ref="TRACE" />
        </root>
    </configuration>
    
    

report error

  1. connection error
org.apache.kafka.common.network.Selector - [Consumer clientId=con, groupId=con-group] Connection with /192.168.17.128 disconnected
java.net.ConnectException: Connection refused: no further information

org.apache.kafka.clients.NetworkClient - [Consumer clientId=con, groupId=con-group] Connection to node -1 (/192.168.17.128:9092) could not be established. Broker may not be available.

The most likely reason is that listenersthe configuration does not configure the ip. You can try to change it first. I am here because there is no configuration of the IP. Finally, it is changed listeners=PLAINTEXT://192.168.17.128:9092, and then the consumer’s error is gone.

  1. Next is the producer's error:
Error while fetching metadata with correlation id 53 : {
    
    test_topic=UNKNOWN_TOPIC_OR_PARTITION}

It looks like an unknown topic. I think the official website is TRUE by default.

image-20230110221509106

In the final configuration, another sentence is added auto.create.topics.enable=true. The complete configuration is as follows:

# ZooKeeper服务地址
zookeeper.connect=localhost:2181,localhost:2182,localhost:2183
# topic分区日志数
num.partitions=2
# 自动创建topic的默认副本数
default.replication.factor=3
# 自动创建topic
auto.create.topics.enable=false
# 指定写入成功的副本数,当写入数据的副本数达不到这个值,则会报错
min.insync.replicas=2
# 设置自动创建topic
auto.create.topics.enable=true
delete.topic.enable=true

#集群配置
# 日志保存目录
log.dir=./log1
# broker服务ID
broker.id=1
# 监听地址
listeners=PLAINTEXT://192.168.17.128:9092

after that it's normal

Get a one-click startup script:

/data/kafka/kafka_2.11-1.0.0/bin/kafka-server-start.sh -daemon /data/kafka/kafka_2.11-1.0.0/config/server.properties

/data/kafka/kafka_2.11-1.0.0/bin/kafka-server-start.sh -daemon /data/kafka/kafka_2.11-1.0.0/config/server2.properties

/data/kafka/kafka_2.11-1.0.0/bin/kafka-server-start.sh -daemon /data/kafka/kafka_2.11-1.0.0/config/server3.properties

sleep 3

echo 'kafka启动完成'

java connection example

Consumer code:

Here are several required configurations, corresponding to the parameters in the shell command

bootstrap.servers - Kafka server address

client.id - client id

key.deserializer - key deserializer configuration

value.deserializer - value deserialization configuration

group.id - consumer group id

auto.offset.reset - offset method

    public static void consumer() {
    
    
        Properties properties = new Properties();
        // 必填参数
        // kafka服务地址
        properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, IP);
        // 客户端消费者ID
        properties.put(ConsumerConfig.CLIENT_ID_CONFIG, "con");
        // key序列化器
        properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        // value序列化器
        properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        // 消费者组ID,如果需要重新读取,可以切换groupId为新的ID就可以了,然后设置自动提交为true
        properties.put(ConsumerConfig.GROUP_ID_CONFIG, "con-group");
            // 消费偏移量
        // earliest:分区中有offset时,从offset位置消费,没有从头消费
        // latest:分区中有offset时,从offset位置下佛,没有时,消费新产生的
        // none:分区有offset时,从offset后开始消费,如果有一个分区缺少已提交的offset时,抛异常
        properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

        // 回话超时时间
        properties.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 6000);
        // 心跳间隔
        properties.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG, "2000");
        // 自动提交间隔
        properties.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "1000");
        // 每次拉取的最大数据量
        properties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 2);
        // 如果需要重复消费,可以设置自动提交为FALSE,这样每次拉取都是从之前的offset开始拉取
        properties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
    

        new Thread(() -> {
    
    
            while (true) {
    
    
                try(KafkaConsumer<String, Object> con = new KafkaConsumer<>(properties);){
    
    
                    con.subscribe(Collections.singleton(TOPIC));
                    ConsumerRecords<String, Object> poll = con.poll(Duration.ofSeconds(5000));
                    for (ConsumerRecord<String, Object> record : poll) {
    
    
                        System.out.println("topic:" + record.topic() + ",offset:" + record.offset() + ",数据:" + record.value().toString());
                    }
                }
            }
        }).start();
    }

Producer code:

is also a required parameter

bootstrap.servers

client.id

key.deserializer - key serial number configuration

value.deserializer - value serial number configuration

public static void producer() {
    
    
        Properties properties = new Properties();
        // 必要条件
        properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, IP);
        properties.put(ProducerConfig.CLIENT_ID_CONFIG, "pro");
        properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        KafkaProducer<String, String> pro = new KafkaProducer<>(properties);

        new Thread(() ->{
    
    
            while (true) {
    
    
                try {
    
    
                    TimeUnit.SECONDS.sleep(3);
                } catch (InterruptedException e) {
    
    
                    throw new RuntimeException(e);
                }
                Future<RecordMetadata> result = pro.send(new ProducerRecord<>(TOPIC, "1","dddd"), (meta, error) -> {
    
    
                    if (error != null) {
    
    
                        error.printStackTrace();
                        return;
                    }
                    System.out.println("发送回调:" + meta.toString());
                });
                System.out.println("发送的结果:" + result );
            }
        }).start();

    }

Remark:

  1. Kafka will automatically group big data into consumer groups, and at the same time, it can also be set on the client max.poll.records(ConsumerConfig.MAX_POLL_RECORDS_CONFIG)to control the amount of data pulled each time;

    Because it is pulled through the consumer group, multiple clients can set the same groupId to subscribe to the same topic;

  2. If it is necessary to consume data in an orderly manner, it is also necessary to ensure that the producer's messages are sent to a partition, and Kafka's partition mechanism can be used for sequential consumption;

  3. If you need to pull all the data, you can set the groupId, and the amount of data in batches, and automatically submit to true, k

        // 消费者组ID,如果需要重新读取,可以切换groupId为新的ID就可以了,然后设置自动提交为true
        properties.put(ConsumerConfig.GROUP_ID_CONFIG, "con-group");
        // 每次拉取的最大数据量
        properties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 2000);
        // 如果需要重复消费,可以设置自动提交为FALSE,这样每次拉取都是从之前的offset开始拉取
        properties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, true);
        // 消费偏移量

Guess you like

Origin blog.csdn.net/qq_28911061/article/details/128985197