kafka集群部署（kafka+Raft模式）

「这是我参与11月更文挑战的第16天，活动详情查看：2021最后一次更文挑战」

kafka+zk模式

1. 集群规划

主机名	ip地址	node.id	process.roles
kafka1	192.168.56.107	1	broker,controller
kafka2	192.168.56.108	2	broker,controller
kafka3	192.168.56.109	3	broker,controller

2. Raft配置文件

vi config/kraft/server.properties

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#
# This configuration file is intended for use in KRaft mode, where
# Apache ZooKeeper is not present.  See config/kraft/README.md for details.
#

############################# Server Basics #############################

# The role of this server. Setting this puts us in KRaft mode
# 标识该节点所承担的角色，在KRaft模式下需要设置这个值
process.roles=broker,controller

# The node id associated with this instance's roles
# 节点的ID，和节点所承担的角色相关联，唯一 每个服务器不一样
node.id=1

# The connect string for the controller quorum
# controller quorum 连接的集群地址字符串。和配置zk连接差不多，只是格式不一样 每个服务器一样
[email protected]:9093,[email protected]:9093,[email protected]:9093

############################# Socket Server Settings #############################

# The address the socket server listens on. It will get the value returned from
# java.net.InetAddress.getCanonicalHostName() if not configured.
#   FORMAT:
#     listeners = listener_name://host_name:port
#   EXAMPLE:
#     listeners = PLAINTEXT://your.host.name:9092
## 本机ip+端口 每个服务器不一样
listeners=PLAINTEXT://192.168.56.107:9092,CONTROLLER://192.168.56.107:9093
inter.broker.listener.name=PLAINTEXT

# Hostname and port the broker will advertise to producers and consumers. If not set,
# it uses the value for "listeners" if configured.  Otherwise, it will use the value
# returned from java.net.InetAddress.getCanonicalHostName().
## 本机ip+端口 每个服务器不一样
advertised.listeners=PLAINTEXT://192.168.56.107:9092

# Listener, host name, and port for the controller to advertise to the brokers. If
# this server is a controller, this listener must be configured.
controller.listener.names=CONTROLLER

# Maps listener names to security protocols, the default is for them to be the same. See the config documentation for more details
listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL

# The number of threads that the server uses for receiving requests from the network and sending responses to the network
num.network.threads=3

# The number of threads that the server uses for processing requests, which may include disk I/O
num.io.threads=8

# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400

# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400

# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600


############################# Log Basics #############################

# A comma separated list of directories under which to store log files
# 数据日志目录
log.dirs=/tmp/kraft-combined-logs

# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=1

# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1

############################# Internal Topic Settings  #############################
# The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state"
# For anything other than development testing, a value greater than 1 is recommended to ensure availability such as 3.
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1

############################# Log Flush Policy #############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here:
#    1. Durability: Unflushed data may be lost if you are not using replication.
#    2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
#    3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to excessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.

# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000

# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000

############################# Log Retention Policy #############################

# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.

# The minimum age of a log file to be eligible for deletion due to age
log.retention.hours=168

# A size-based retention policy for logs. Segments are pruned from the log unless the remaining
# segments drop below log.retention.bytes. Functions independently of log.retention.hours.
#log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000
复制代码

process.roles :

如果process.roles = broker, 服务器在KRaft模式中充当 broker。
如果process.roles = controller, 服务器在KRaft模式下充当 controller。
如果process.roles = broker,controller，服务器在KRaft模式中同时充当 broker controller。
如果process.roles 没有设置。那么集群就假定是运行在ZooKeeper模式下。

同时充当Broker和Controller的节点称为“组合”节点。

controller.quorum.voters：

系统中的所有节点都必须设置 controller.quorum.voters 配置。
这个配置标识有哪些节点是 Quorum 的投票者节点。所有想成为控制器的节点都需要包含在这个配置里面。这类似于在使用ZooKeeper时，使用ZooKeeper.connect配置时必须包含所有的ZooKeeper服务器。
然而，与ZooKeeper配置不同的是，controller.quorum.voters 配置需要包含每个节点的id。格式为: id1@host1:port1,id2@host2:port2。

3. 生成一个唯一的集群ID

在1.0和2.0的版本里面，集群ID是自动生成的，存储数据目录是自动生成的。那为什么在3.0会这样做呢？

社区的的思考是这样子的，即自动格式化有时候会掩盖一些异常，比如，在Unix中，如果一个数据目录不能被挂载，它可能显示为空白，在这种情况下，自动格式化将是将会带来一些问题。这个特性对于 Controller 服务器维护元数据日志特别重要，因为如果三个 Controller 节点中有两个能够从空白日志开始，那么可能会在日志中没有任何内容的情况下，选出一个Leader，这会导致所有的元数据丢失(KRaft 仲裁后发生截断)。一旦发生这个问题，将会是不可逆的故障。

首先是使用bin目录下的kafka-storage.sh工具为你的新集群生成一个唯一的ID

[root@localhost kafka_2.13-3.0.0]# bin/kafka-storage.sh random-uuid
PYfQjCKRQZWpOAa_SkxHNA
复制代码

4. 格式化存储数据的目录

接着是格式化存储目录。如果是单节点模式运行，你需要在机器上执行如下命令。如果是多个节点，则应该在每个节点上都分别运行format命令，以便格式化每台机器上的。请确保为每个集群使用相同的集群ID。

[root@localhost kafka_2.13-3.0.0]# bin/kafka-storage.sh format -t PYfQjCKRQZWpOAa_SkxHNA -c ./config/kraft/server.properties
Formatting /tmp/kraft-combined-logs
复制代码

注：目前不能在不重新格式化目录的情况下在ZooKeeper模式和KRaft模式之间来回转换。

meta.properties 文件内容

[root@localhost kafka_2.13-3.0.0]# cat /tmp/kraft-combined-logs/meta.properties 
#
#Wed Oct 20 09:33:33 UTC 2021
cluster.id=PYfQjCKRQZWpOAa_SkxHNA
version=1
node.id=2
[root@localhost kafka_2.13-3.0.0]# 
复制代码

5. 启动

最后，可以在每个节点上启动Kafka服务器了。

bin/kafka-server-start.sh ./config/kraft/server.properties 
复制代码

注：就像基于ZooKeeper的集群一样，kraft模式可以连接到端口9092(或配置的任何端口)来执行相关操作，如创建删除Topic等

6. 关闭集群

依次在 kafka1、kafka2、kafka3 节点上关闭 kafka

bin/kafka-server-stop.sh stop
复制代码

查看日志文件

KRaft模式下，原先保存在Zookeeper上的数据，全部转移到了一个内部的Topic：@metadata上了。比如Broker信息，Topic信息等等。所以我们需要有一个工具查看当前的数据内容。

Kafka-dump-log.sh是一个之前就有的工具，用来查看Topic的的文件内容。这工具加了一个参数–cluster-metadata-decoder用来查看元数据日志，如下所示:

参数	描述	例子
--deep-iteration
--files <String: file1, file2, ...>	必需; 读取的日志文件	–files 0000009000.log
--key-decoder-class	如果设置，则用于反序列化键。这类应实现kafka.serializer。解码器特性。自定义jar应该是在kafka/libs目录中提供
--max-message-size	最大的数据量,默认：5242880
--offsets-decoder	if set, log data will be parsed as offset data from the__consumer_offsets topic
--print-data-log	打印内容
--transaction-log-decoder	if set, log data will be parsed as transaction metadata from the__transaction_state topic
--value-decoder-class [String]	if set, used to deserialize the messages. This class should implement kafka. serializer.Decoder trait. Custom jar should be available in kafka/libs directory. (default: kafka.serializer. StringDecoder)
--verify-index-only	if set, just verify the index log without printing its content.

查询Log文件

bin/kafka-dump-log.sh --files /tmp/kraft-combined-logs/first-kraft-0/00000000000000000000.log	
复制代码

查询Log文件具体信息

bin/kafka-dump-log.sh --files /tmp/kraft-combined-logs/first-kraft-0/00000000000000000000.log --print-data-log
复制代码

查询index文件具体信息

bin/kafka-dump-log.sh --files /tmp/kraft-combined-logs/first-kraft-0/00000000000000000000.index
复制代码

配置项为log.index.size.max.bytes；来控制创建索引的大小;

查询timeindex文件

bin/kafka-dump-log.sh --files /tmp/kraft-combined-logs/first-kraft-0/00000000000000000000.timeindex
复制代码

查看元数据

bin/kafka-metadata-shell.sh  --snapshot /tmp/kraft-combined-logs/__cluster_metadata-0/00000000000000000000.log
复制代码

Kafka Raft元数据模式

Apache Kafka 不依赖 Apache Zookeeper的版本，被社区称之为Kafka Raft 元数据模式，简称KRaft(craft)模式。该模式在2.8当中已经发布了体验版本。可以初步体验KRaft的运行效果，但是还不建议在生产环境中使用。未来3.0会出一个稳定的release版本。

KRaft运行模式的Kafka集群，不会将元数据存储在Apache ZooKeeper中。即部署新集群的时候，无需部署ZooKeeper集群，因为 Kafka 将元数据存储在 Controller 节点的 KRaft Quorum中。 KRaft可以带来很多好处，比如可以支持更多的分区，更快速的切换Controller，也可以避免Controller缓存的元数据和Zookeeper存储的数据不一致带来的一系列问题。

注：Kafka在2.8版本中正式废弃了Zookeeper，拥抱Raft

为什么要干掉zk？

Kafka作为一个消息队列，竟然要依赖一个重量级的协调系统ZooKeeper。同样作为消息队列，人家RabbitMQ早早的就实现了自我管理。
Zookeeper非常笨重，还要求奇数个节点的集群配置，扩容和缩容也不方便。Zk的配置方式，也和kafka的完全不一样，要按照调优Kafka，竟然还要兼顾另外一个系统。
Kafka要想往轻量级，开箱即用的方向发展，就不得不干掉Zk。
由于Zk和Kafka毕竟不是在一个存储体系里面，当Topic和Partition的数量上了规模，数据同步问题就变的显著起来。Zk是可靠，但是它慢啊，完全不如放在Kafka的日志存储体系里面，这对标榜速度的Kafka来说，是不得不绕过的一环。

会有哪些改变

部署更简单。首先，部署变的更加简单。对于一些不太追求高可用的系统，甚至一个进程就能把可爱的kafka跑起来。我们也不需要再申请对zookeeper友好的SSD磁盘，也不用再关注zk的容量是不是够用了。

监控更便捷。

其次，由于信息的集中，从Kafka获取监控信息，就变得轻而易举，不用再到zk里转一圈了。与grafana/kibana/promethus等系统的集成，指日可待。

速度更快捷。

最重要的当然是速度了。Raft比ZK的ZAB协议更加易懂，也更加高效，partition的主选举将变得更快捷，controller的调度速度将上一个档次。

总结

因为管理Apache Zookeeper的复杂性，这就导致Kafka 经常被认为是一个重量级的基础设施。这通常会导致项目在开始时选择更轻量级的消息队列，比如ActiveMQ或RabbitMQ这样的传统消息队列，然后在规模变大时迁移到Kafka。

现在这种情况已经改变了。KRaft模式提供了一种很棒的、轻量级的方式来开始使用Kafka，或者可以使用它作为ActiveMQ或RabbitMQ等消息队列的替代方案。轻量级的单进程部署也更适合于边缘场景和那些使用轻量级硬件的场景。

kafka系列

kafka(一) 一一 kafka集群部署（kafka+zk模式）
文章持续更新中...

kafka(二) 一一 kafka集群部署（kafka+Raft模式）