The concept of Kafka|architecture|build|view command

An overview of Kafka

1. The main reason why message queue (MQ) is needed
is that in a high-concurrency environment, synchronous requests are too late to process, and requests often block. For example, a large number of requests access the database concurrently, resulting in row locks and table locks. In the end, too many request threads will accumulate, which will trigger too many connection errors and cause an avalanche effect.
We use message queues to ease the pressure on the system by processing requests asynchronously. Message queues are often used in scenarios such as asynchronous processing, traffic peak shaving, application decoupling, and message communication.

Currently, the more common MQ middleware include ActiveMQ, RabbitMQ, RocketMQ, Kafka, etc.

Two benefits of using message queues

(1) Decoupling
allows you to independently expand or modify the processing on both sides, as long as they abide by the same interface constraints.

(2) Recoverability
When a part of the system fails, it will not affect the whole system. The message queue reduces the coupling between processes, so even if a process that processes messages hangs up, the messages added to the queue can still be processed after the system recovers.

(3) Buffering
helps to control and optimize the speed of data flow through the system, and solve the inconsistency in the processing speed of production messages and consumption messages.

(4) Flexibility & peak processing capacity
In the case of a sharp increase in traffic, the application still needs to continue to function, but such burst traffic is not common. It is undoubtedly a huge waste to invest resources on standby at all times to handle such peak access. The use of message queues can enable key components to withstand sudden access pressure without completely crashing due to sudden overload requests.

(5) Asynchronous communication
In many cases, users do not want or need to process messages immediately. Message queues provide an asynchronous processing mechanism that allows users to put a message into a queue without processing it immediately. Put as many messages on the queue as you want, and process them when needed.

//Two modes of message queues
(1) Point-to-point mode (one-to-one, consumers actively pull data, and messages are cleared after messages are received)
message producers send messages to the message queue, and then message consumers send messages from the message queue Take out and consume the message. After the message is consumed, there is no more storage in the message queue, so it is impossible for the message consumer to consume the message that has already been consumed. The message queue supports multiple consumers, but for a message, only one consumer can consume it.

(2) Publish/subscribe mode (one-to-many, also known as observer mode, the message will not be cleared after the consumer consumes the data) the
message producer (publish) publishes the message to the topic, and there are multiple message consumers (subscribe ) to consume the message. Unlike the peer-to-peer method, messages published to a topic will be consumed by all subscribers.
The publish/subscribe mode defines a one-to-many dependency relationship between objects, so that whenever the state of an object (target object) changes, all objects (observer objects) that depend on it will be notified and automatically updated.

Three Kafka definitions

Kafka is a distributed publish/subscribe-based message queue (MQ, Message Queue), which is mainly used in the field of big data real-time processing.

3.1 Introduction to Kafka

Kafka was originally developed by Linkedin. It is a distributed, partition-supporting, replica-based distributed message middleware system based on Zookeeper coordination. Its biggest feature is that it can process large amounts of data in real time. To meet various demand scenarios, such as hadoop-based batch processing system, low-latency real-time system, Spark/Flink streaming processing engine, nginx access log, message service, etc., written in scala language, Linkedin contributed to it in
2010 The Apache Foundation and become a top open source project.

3.2 Features of Kafka

●High-throughput, low-latency
Kafka can process hundreds of thousands of messages per second, and its minimum delay is only a few milliseconds. Each topic can be divided into multiple Partitions, and the Consumer Group performs consumption operations on the Partitions to improve load balancing and consumption capabilities.

●Scalability
Kafka cluster supports hot expansion

●Persistence and reliability
Messages are persisted to the local disk, and data backup is supported to prevent data loss

●Fault tolerance
allows nodes in the cluster to fail (in the case of multiple copies, if the number of copies is n, n-1 nodes are allowed to fail)

●High concurrency
Support thousands of clients to read and write at the same time

3.3 Kafka system architecture

(1) Broker
A kafka server is a broker. A cluster consists of multiple brokers. A broker can accommodate multiple topics.

(2) Topic
can be understood as a queue, and both producers and consumers face the same topic.
Similar to the table name of the database or the index of the ES,
the messages of different topics are stored separately

(3) Partition
In order to achieve scalability, a very large topic can be distributed to multiple brokers (ie servers), a topic can be divided into one or more partitions, and each partition is an ordered queue. Kafka only guarantees that the records in the partition are in order, but does not guarantee the order of different partitions in the topic.

Each topic has at least one partition. When the producer generates data, it will select a partition according to the allocation strategy, and then append the message to the end of the queue of the specified partition.

3.4 Partation data routing rules

1. If patition is specified, use it directly;
2. 3. If no patition is specified but a key is specified (equivalent to an attribute in the message), a patition is selected by performing hash modulo on the value of the key;
3.Both patition and key are not specified, and a patition is selected by polling.

Each message will have a self-incrementing number, which is used to identify the offset of the message, and the identification sequence starts from 0.

Data in each partition is stored using multiple segment files.

If the topic has multiple partitions, the order of the data cannot be guaranteed when consuming data. In the scenario where the order of consumption of messages is strictly guaranteed (such as flash sales of products and grabbing red envelopes), the number of partitions needs to be set to 1.

●broker stores topic data. If a topic has N partitions and the cluster has N brokers, each broker stores a partition of the topic.
●If a topic has N partitions and the cluster has (N+M) brokers, then there are N brokers that store a partition of the topic, and the remaining M brokers do not store the partition data of the topic.
● If a topic has N partitions and the number of brokers in the cluster is less than N, then one broker stores one or more partitions of the topic. In the actual production environment, try to avoid this situation, which can easily lead to data imbalance in the Kafka cluster.

//The reason for the partition
●It is convenient to expand in the cluster. Each Partition can be adjusted to adapt to the machine where it is located, and a topic can be composed of multiple Partitions, so the entire cluster can adapt to data of any size; ●
Yes Improve concurrency, because you can read and write in units of Partition.

(4) Replica
copy. In order to ensure that when a node in the cluster fails, the partition data on the node will not be lost, and Kafka can still continue to work. Kafka provides a copy mechanism. Each partition of a topic has several Replica, a leader and several followers.

(5)
Each partition of the leader has multiple copies, and only one of them is the leader. The leader is currently responsible for reading and writing data.

(6) Follower
Follower follows the Leader, all write requests are routed through the Leader, data changes will be broadcast to all Followers, and the Follower and Leader maintain data synchronization. Follower is only responsible for backup, not for reading and writing data.
If the Leader fails, a new Leader is elected from the Followers.
When the Follower hangs, gets stuck, or is too slow to synchronize, the Leader will delete the Follower from the ISR (a set of Followers maintained by the Leader that is synchronized with the Leader) list, and create a new Follower.

(7)
The producer is the publisher of the data, and this role publishes the message push to the topic of Kafka.
After the broker receives the message sent by the producer, the broker appends the message to the segment file currently used for appending data.
The message sent by the producer is stored in a partition, and the producer can also specify the partition of the data storage.

(8) Consumers
can pull data from the broker. Consumers can consume data from multiple topics.

(9) Consumer Group (CG)
The consumer group consists of multiple consumers.
All consumers belong to a consumer group, that is, a consumer group is a logical subscriber. A group name can be specified for each consumer, and if no group name is specified, it belongs to the default group.
Collecting multiple consumers together to process the data of a certain topic can improve the consumption capacity of data faster.
Each consumer in the consumer group is responsible for consuming data from different partitions. A partition can only be consumed by one consumer in the group to prevent data from being read repeatedly.
Consumer groups do not affect each other.

(10) offset
can uniquely identify a message.
The offset determines the location of the read data, and there will be no thread safety issues. The consumer uses the offset to determine the message to be read next time (that is, the consumption location).
After the message is consumed, it is not deleted immediately, so that multiple businesses can reuse Kafka messages.
A certain service can also achieve the purpose of re-reading messages by modifying the offset, which is controlled by the user.
The message will eventually be deleted, and the default life cycle is 1 week (7*24 hours).

(11) Zookeeper
Kafka uses Zookeeper to store the meta information of the cluster.

Since the consumer may experience failures such as power outages and downtime during the consumption process, after the consumer recovers, it needs to continue to consume from the location before the failure. Therefore, the consumer needs to record which offset it consumes in real time, so that it can continue to consume after the failure recovers.
Before Kafka version 0.9, the consumer saved the offset in Zookeeper by default; starting from version 0.9, the consumer saved the offset in a built-in Kafka topic by default, which is __consumer_offsets.

That is to say, the role of zookeeper is that when the producer pushes data to the kafka cluster, it is necessary to find out where the nodes of the kafka cluster are, and these are all found through zookeeper. Which piece of data the consumer consumes also needs the support of zookeeper. The offset is obtained from zookeeper, and the offset records where the last consumed data was consumed, so that the next piece of data can be consumed next.

Four kafka architecture

Please add a picture description

我自己总结的:几个kafka就有几个broker(代理),生产者(producer)生产数据到主题(topic)中,一个topic由多个partition(分区)组成,一个partition由多个replica(副本)组成,而一个replica由1个leader(领导者)和多个followers(追随者)组成,leader只负责数据的读取,而followers只负责数据的复制,consumer(消费者)从topic里面调取数据来消费。

Five build kafka

5.1 Environment preparation

基于 之前zookeeper三台机子上操作安装kafka
192.168.10.40   zookeeper   +  kafka
192.168.10.50   zookeeper   +  kafka
192.168.10.60   zookeeper   +  kafka

5.2 Install kafka

cd /opt/
tar zxvf kafka_2.13-2.7.1.tgz                 #解压
mv kafka_2.13-2.7.1 /usr/local/kafka          #移动改名

wget https://mirrors.tuna.tsinghua.edu.cn/apache/kafka/2.7.1/kafka_2.13-2.7.1.tgz #官方下载安装包

Please add a picture description

5.3 Modify the configuration file

cd /usr/local/kafka/config/
cp server.properties{
    
    ,.bak}             #备份

vim server.properties
broker.id=0                   #21行,broker的全局唯一编号,每个broker不能重复,因此要在其他机器上配置 broker.id=1、broker.id=2
listeners=PLAINTEXT://192.168.10.40:9092    #31行,指定监听的IP和端口,如果修改每个broker的IP需区分开来,也可保持默认配置不用修改
num.network.threads=3           #42行,broker 处理网络请求的线程数量,一般情况下不需要去修改
num.io.threads=8                #45行,用来处理磁盘IO的线程数量,数值应该大于硬盘数
socket.send.buffer.bytes=102400       #48行,发送套接字的缓冲区大小
socket.receive.buffer.bytes=102400    #51行,接收套接字的缓冲区大小
socket.request.max.bytes=104857600    #54行,请求套接字的缓冲区大小
log.dirs=/usr/local/kafka/logs        #60行,kafka运行日志存放的路径,也是数据存放的路径
num.partitions=1    #65行,topic在当前broker上的默认分区个数,会被topic创建时的指定参数覆盖
num.recovery.threads.per.data.dir=1    #69行,用来恢复和清理data下数据的线程数量
log.retention.hours=168    #103行,segment文件(数据文件)保留的最长时间,单位为小时,默认为7天,超时将被删除
log.segment.bytes=1073741824    #110行,一个segment文件最大的大小,默认为 1G,超出将新建一个新的segment文件
zookeeper.connect=192.168.10.40:2181,192.168.10.50:2181,192.168.10.60:2181    #123行,配置连接Zookeeper集群地址
  只需修改31行  60行 123行

Please add a picture description
Please add a picture description
Please add a picture description

5.4 Edit the configuration files of the other two virtual machines

scp -r /usr/local/kafka/ 192.168.10.50:/usr/local/
scp -r /usr/local/kafka/ 192.168.10.60:/usr/local/
vim /usr/local/kafka/config/server.properties
修改21行和31行
broker.id=
listeners=

Please add a picture description
Please add a picture description

5.5 Edit the environment variables of the three machines

vim /etc/profile
export KAFKA_HOME=/usr/local/kafka
export PATH=$PATH:$KAFKA_HOME/bin

source /etc/profile

Please add a picture description

5.6 Configure Zookeeper startup script

三台虚拟机同时进行
vim /etc/init.d/kafka

#!/bin/bash
#chkconfig:2345 22 88
#description:Kafka Service Control Script
KAFKA_HOME='/usr/local/kafka'
case $1 in
start)
	echo "---------- Kafka 启动 ------------"
	${KAFKA_HOME}/bin/kafka-server-start.sh -daemon ${KAFKA_HOME}/config/server.properties
;;
stop)
	echo "---------- Kafka 停止 ------------"
	${KAFKA_HOME}/bin/kafka-server-stop.sh
;;
restart)
	$0 stop
	$0 start
;;
status)
	echo "---------- Kafka 状态 ------------"
	count=$(ps -ef | grep kafka | egrep -cv "grep|$$")
	if [ "$count" -eq 0 ];then
        echo "kafka is not running"
    else
        echo "kafka is running"
    fi
;;
*)
    echo "Usage: $0 {start|stop|restart|status}"
esac

chmod +x /etc/init.d/kafka
chkconfig --add kafka

insert image description here
Please add a picture description

5.7 Start kafka test

service kafka start


定义 
--zookeeper 集群服务器地址,如果有多个 IP 地址使用逗号分割,一般使用一个 IP 即可
--replication-factor:定义分区副本数,1 代表单副本,建议为 2 
--partitions:定义分区数 
--topic:定义 topic 名称
任何一台机子创建topic
kafka-topics.sh --create --zookeeper 192.168.10.40:2181,192.168.10.50:2181,192.168.10.60:2181 --replication-factor 2 --partitions 3 --topic test

insert image description here

查看当前服务器中的所有 topic
kafka-topics.sh --list --zookeeper 192.168.10.40:2181,192.168.10.50:2181,192.168.10.60:2181

insert image description here

查看某个 topic 的详情
kafka-topics.sh  --describe --zookeeper 192.168.10.40:2181,192.168.10.50:2181,192.168.10.60:2181

insert image description here

192.168.10.40发布消息
kafka-console-producer.sh --broker-list 192.168.10.40:9092,192.168.10.50:9092,192.168.10.60:9092  --topic test
192.168.10.50消费消息
kafka-console-consumer.sh --bootstrap-server 192.168.10.40:9092,192.168.10.50:9092,192.168.10.60:9092 --topic test --from-beginning

Please add a picture description

修改分区数
kafka-topics.sh --zookeeper 192.168.10.40:2181,192.168.10.50:2181,192.168.10.60:2181 --alter --topic test --partitions 6

Please add a picture description
Please add a picture description

删除 topic
kafka-topics.sh --delete --zookeeper 192.168.10.40:2181,192.168.10.50:2181,192.168.10.60:2181 --topic test

Please add a picture description
Please add a picture description

Guess you like

Origin blog.csdn.net/m0_75015568/article/details/130081887