Kafka principle, cluster construction and use

In recent projects, it is necessary to log various operations of the user on the service system for subsequent audit and verification. The project uses kafka combined with ELK to collect, transmit, save, analyze and display user operation behavior logs; the whole process is sent by business services to kafka, and logstash consumes data from kafka and transfers it to Elasticsearch, and finally Extract, view, and analyze logs from the kibana interface. On the basis of the actual use of kafka clusters, combined with many blog articles on the Internet and relevant information on the official website, further in-depth understanding and research on kafka, in order to provide reference for subsequent further learning and actual combat and for latecomers to avoid unnecessary pitfalls , so prepare this article. There are inevitably mistakes and omissions in the article. I hope readers can discuss and correct them in the comments, thank you very much!

1. Introduction to Kafka

Kafka is a distributed stream processing platform and a distributed message engine/message middleware. Kafka supports the transmission of messages between systems in the form of subscription and publishing. At the same time, Kafka Connect and Kafka Stream are added based on the message function to support data connection to other systems, such as ES, Hadoop, etc. The core function of Kafka is still its message engine, and most application scenarios are based on it, including system decoupling, peak processing, buffering, and asynchronous communication.

Advantages of Kafka

   高吞吐量、低延迟:kafka每秒可以处理几十万条消息,它的延迟最低只有几毫秒;
   可扩展性:kafka集群支持热扩展;
   持久性、可靠性:消息被持久化到本地磁盘,并且支持数据备份防止数据丢失;
   容错性:允许集群中节点故障(若副本数量为n,则允许n-1个节点故障);
   高并发:支持数千个客户端同时读写。

Kafka is suitable for the following application scenarios:

   日志收集:一个公司可以用Kafka可以收集各种服务的log,通过kafka以统一接口服务的方式开放给各种consumer;
   消息系统:解耦生产者和消费者、缓存消息等;
   用户活动跟踪:kafka经常被用来记录web用户或者app用户的各种活动,如浏览网页、搜索、点击等活动,
   				这些活动信息被各个服务器发布到kafka的topic中,然后消费者通过订阅这些topic来做实时的监控分析,亦可保存到数据库;
   运营指标:kafka也经常用来记录运营监控数据。包括收集各种分布式应用的数据,生产各种操作的集中反馈,比如报警和报告;
   流式处理:比如spark streaming和storm。

2. kafka principle

2.0 Topology

insert image description here

2.1 Related concepts

Related concepts

1.Producer
	消息的产生者,发布消息到 kafka 集群的服务;
2.Broker:
 	kafka 集群中包含的服务器。
3.kafka cluster:
	由多个Broker组成的kafka集群;
4.Topic:
	逻辑上的消息主题,可以理解为消息的分类,每条发布到 kafka 集群的消息属于的类别,即 kafka 是面向 topic 的。
5.Partition:
	partition是物理上的概念,是Topic的分区,每个topic可以有多个分区,分区的作用是做负载,提高kafka的吞吐量。
	同一个topic在不同的分区的数据是不重复的,partition的表现形式就是一个一个的文件夹!
6.Replication:
	partition的副本,每一个分区都有多个副本,保障 partition 的高可用。
	当主分区Leader故障的时候会选择一个Follower上位,成为Leader。
	在kafka中默认副本的最大数量是10个,且副本的数量不能大于Broker的数量,
	follower和leader绝对是在不同的机器,同一机器对同一个分区也只可能存放一个副本(包括自己)。
7.Message:
	每一条发送的消息主体。
8.Consumer:
	从 kafka 集群中消费消息的服务,即消息的消费方,是消息的出口。
9.Consumer Group:
	可以将多个消费组组成一个消费者组,在kafka的设计中同一个分区的数据只能被消费者组中的某一个消费者消费。
	同一个消费者组的消费者可以消费同一个topic的不同分区的数据,这也是为了提高kafka的吞吐量!
10.Zookeeper:
	kafka集群依赖zookeeper来保存集群的的元信息,来保证系统的可用性。 

2.2 Kafka message storage method

insert image description here

Topic是发布的消息的类别名称。对于每个Topic,Kafka集群都会维护这个分区的Log。
如上图,topic的每个分区都是一个顺序的、不可变的消息队列,并且可以持续添加。
分区中的消息都被分配了一个序列号,称之为偏移量(offset),在每个分区中此偏移量都是唯一的。
Kafka集群保存所有的消息,直到过期,无论消息是否被消费。
实际上,消费者所持有的仅有的元数据就是这个偏移量,也就是消费者在这个Log中的位置。
在正常情况下,当消费者消费消息的时候偏移量也线性增加。
但是实际偏移量由消费者控制,消费者可以重置偏移量,以重新读取消息。
这种设计对消费者来说操作自如,一个消费者的操作不会影响其他消费者对此Log的处理。

2.4 kafka producer data transmission transaction

Generally, there are three situations:

  1. At most once messages may be lost, but never retransmitted
  2. At least one message will never be lost, but may be retransmitted
  3. Exactly once Each message will definitely be transmitted once and only once

When the producer sends a message to the broker, once the message is committed, it will not be lost due to the existence of replication. However, if the producer encounters network problems after sending data to the broker and the communication is interrupted, then the producer cannot judge whether the message has been committed. Although Kafka cannot determine what happened during a network failure, the producer can generate something similar to a primary key, and idempotent retry multiple times when a failure occurs, so that Exactly once can be achieved, but it has not yet been realized. Therefore, by default, a message is guaranteed to be At least once from the producer to the broker, and At most once can be achieved by setting the producer to send it asynchronously.

3. Kafka cluster construction

3.0 Installation environment

操作系统:	ubuntu-20.04.2-live-server-amd64
jdk版本:	jdk-8u271-linux-x64.tar.gz
zookeeper版本:	apache-zookeeper-3.5.9-bin.tar.gz
kafka版本:	kafka_2.12-2.8.0.tgz

3.1 deploy zookeeper cluster

1. 下载https://www.apache.org/dyn/closer.lua/zookeeper/zookeeper-3.5.9/apache-zookeeper-3.5.9-bin.tar.gz    
 	注意:apache-zookeeper-3.5.9-bin.tar.gz,名称中包含bin的包
2. 解压后,复制配置文件
	tar -zvxf apache-zookeeper-3.5.9-bin.tar.gz
	cp zoo_sample.cfg  zoo.cfg
3. 创建data文件夹,并创建myid文件,文件内容为1或2或3等,与该节点配置文件中的server.id中的id对应
4. 调整zoo.cfg配置文件内容
	#与3中myid所在的文件夹路径一致
	# the directory where the snapshot is stored.
	dataDir=/home/r-0/kafka/zookeeper-cluster/zookeeper02/data
	# the port at which the clients will connect
	clientPort=2181
	# 集群节点配置
	server.1=192.168.65.150:2881:3881
	server.2=192.168.65.150:2882:3882
	server.3=192.168.65.150:2883:3883
5.  启动
	依次进入不同节点的bin目录下执行
	./zkServer.sh start
	查看节点状态
	./zkServer.sh status

3.3 deploy kafka cluster

1. 下载安装包
https://www.apache.org/dyn/closer.cgi?path=/kafka/2.8.0/kafka_2.12-2.8.0.tgz

2. 解压文件
tar -zvxf kafka_2.12-2.8.0.tgz

3. 调整配置文件config/server.properties
broker.id=0
listeners=PLAINTEXT://:9092
advertised.listeners=PLAINTEXT://192.168.65.150:9092
log.dirs=/home/r-0/kafka/kafka-cluster/kafka01/log
num.partitions=2
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=3
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=192.168.65.150:2181,192.168.65.150:2182,192.168.65.150:2183
group.initial.rebalance.delay.ms=0

4. 复制到不同的节点,需要分别调整对应的参数如下:
broker.id=0
advertised.listeners=PLAINTEXT://192.168.65.150:9092
log.dirs=/home/r-0/kafka/kafka-cluster/kafka01/log

5. kafka启动,bin目录下执行
 ./kafka-server-start.sh -daemon ../config/server.properties

4. springboot client integration use

springboot integrates using kafka cluster and introduces jar

    compile 'org.springframework.kafka:spring-kafka:2.3.4.RELEASE'

configuration information

spring.kafka.bootstrap-servers=192.168.65.150:9092,192.168.65.150:9093,192.168.65.150:9094
spring.kafka.producer.key-serializer=org.apache.kafka.common.serialization.StringSerializer
spring.kafka.producer.value-serializer=org.springframework.kafka.support.serializer.JsonSerializer
spring.kafka.producer.properties.spring.json.trusted.packages=*

spring.kafka.consumer.group-id=log_admin
spring.kafka.consumer.auto-offset-reset=earliest
spring.kafka.consumer.key-deserializer=org.apache.kafka.common.serialization.StringDeserializer
spring.kafka.consumer.value-deserializer=org.springframework.kafka.support.serializer.JsonDeserializer
spring.kafka.consumer.properties.spring.json.trusted.packages=*

Tools

@Configuration
@ConditionalOnClass(KafkaTemplate.class)
public class KafkaConfiguration {
    @Autowired
    private KafkaTemplate<Object, Object> kafkaTemplate;

    @Bean
    public KafkaSender kafkaSender() {
        return new KafkaSender<>(kafkaTemplate);
    }
}

public class KafkaSender<K, V> {
    private KafkaTemplate<K, V> kafkaTemplate;

    public KafkaSender(KafkaTemplate<K, V> kafkaTemplate) {
        this.kafkaTemplate = kafkaTemplate;
    }

    public ListenableFuture<SendResult<K, V>> sendMessage(String topic, @Nullable V message) {
        return kafkaTemplate.send(topic, message);
    }

    public ListenableFuture<SendResult<K, V>> sendMessage(String topic, K key, @Nullable V message) {
        return kafkaTemplate.send(topic, key, message);
    }
}

5. References

[1]. Kafka Chinese document .
[2]. The basic structure and principle of Kafka .
[3]. The road to learning Kafka .
[4] . .

Guess you like

Origin blog.csdn.net/shy871/article/details/117598153