kafka implementation mechanism

See original: https://www.jianshu.com/p/3e54a5a39683

 

Message middleware -Kafka data store (a)

 

Abstract: The message store is very important for each one message queue, then Kafka in this regard is how to design efficient to do it?
Kafka The distributed message queue using the file system and operating system page cache (page cache) and cache messages are stored, abandoned the Java heap caching mechanism, while random writes changed the order of writing, combined with the Zero-Copy feature greatly improves IO performance. Instituted disk's file system, I believe that many students know about the hard disk to store all know: "linear writing speed of a SATA RAID-5 disk array can reach hundreds of M / s, and random write speed is only 100 KB / s, linear writing speed is randomly written a thousand times, "which can be seen on the disk write speed key is speed message depends on our use. In view of this, Kafka design is based data storage implemented on the basis of the additional files, since the sequence is added, to provide a persistent message by O (1) The disk data structures, and even in such a structure for tens of TB-level message store can maintain stable performance over time. In the ideal case, as long as the disk space is large enough to have been able to append messages. In addition, Kafka can be configured to let users decide for themselves already place orders persistence time saved messages, message processing to provide a more flexible way. This article will focus on the message storage structure, storage of data in Kafka and how to find a message via offset and so on.

A, Kafka introduces several important concepts

(1) Broker: messaging middleware processing node, a node is a Broker Kafka, one or more Broker cluster may be composed of a Kafka;
(2) Topic: abstract theme classification of a set of messages, such as for example a log page view, click log and so can be divided into abstract categories in the form of topic. Physically, Topic different messages stored separately, a Topic logically stored in the message while the message, but the user need only specify Topic to such producers or consumers do not care about the data on one or more of the data stored in the broker HE at;
(. 3) the partition: each subject is further divided into one or several partitions (partition). Each partition corresponds to a folder on a local disk, the partition naming rule is then subject name "-" connector, and then followed by the division number, the division number from zero to the total number of partitions minus -1;
(. 4) logsegment: Per partitions are divided into a plurality of segments logs (logsegment) composition, the log segment is a minimum unit slice Kafka log object; logsegment be a logical concept, corresponding to a particular log file ( ".log" data file), and two index files ( ".index" and ".timeindex", respectively, and the offset index file message timestamp index file);
(. 5) offset: each partition in an ordered sequence by the immutable message composition, these messages are sequentially added to the partition. Each message has a sequence number known as continuous offset- offset for the partition uniquely identifies the message (the message does not indicate the physical location on the disk);
(. 6) the Message: the message is stored in Kafka smallest basic unit, that is, a commit log, the message header and a variable length message body composed of a fixed length;

Two, Kafka log data storage structure

Kafka messages based on the theme (Topic) as the basic organizational unit, independent of each other between the various topics. Here an abstract theme only logical, and in the actual data file is stored in a message store is physically Kafka or more partitions (the Partition) configuration, each partition corresponding to a file on the local disk folders, each folder contains log index file ( ".index" and ".timeindex") data and log files ( ".log") in two parts. The number of partitions you can specify when you create the theme, can also be modified after creation Topic. (Ps: Partition Topic number can only increase and not decrease, this is beyond the scope of this reduction in space, we can think at first).
In Kafka used precisely because of the design model partition (the Partition), the broken up into a plurality of partitions by topic (Topic) message, and save the distribution process to achieve high throughput of messages on different nodes Kafka Broker. Producers and consumers which can operate in parallel multi-thread, and each thread is processing a data partition.

 
Topic Patition relationship with the structure of FIG Kafka (official website) .png


Meanwhile, Kafka in order to achieve high availability cluster, may be provided in each Partition with one or more copies (Replica), a copy of the partition Broker distributed on different nodes. At the same time, from a copy of the copy will be selected as a Leader, Leader responsible for reading and writing with a copy of the client. The other copies will synchronize data from the replica as Leader Follower.

 

Log files are stored in the analysis 2.1Kafka partitions / replicas

After the completion of Kafka set up a cluster on three virtual machines (Kafka Broker node number is three), you can create a theme and by a specified number of partitions, and copy the following commands in Kafka Broker node / bin:

./kafka-topics.sh --create --zookeeper 10.154.0.73:2181 --replication-factor 3 --partitions  3 --topic kafka-topic-01

After you create a theme, partitions and copies can be found in the state of the subject matter (the way the main topic lists all partitions corresponding copy and ISR list information):

./kafka-topics.sh --describe --zookeeper 10.154.0.73:2181 --topic kafka-topic-01
Topic:kafka-topic-01    PartitionCount:3    ReplicationFactor:3 Configs:
    Topic: kafka-topic-01   Partition: 0    Leader: 1   Replicas: 1,2,0 Isr: 1,2,0
    Topic: kafka-topic-01   Partition: 1    Leader: 2   Replicas: 2,0,1 Isr: 2,1,0
    Topic: kafka-topic-01   Partition: 2    Leader: 0   Replicas: 0,1,2 Isr: 1,2,0

By implementing a simple Kafka Producer the demo, to complete the message sent by the producer to Kafka Broker functionality. After using Producer generate a lot of messages, you can see three virtual machines deployed cluster directory there are three partitions in Kafka's config / server.properties configuration file specified log data storage directory "log.dirs", at the same time each partition presence directory number corresponding to the log file and log data of the index file, as follows:

#1、分区目录文件
drwxr-x--- 2 root root 4096 Jul 26 19:35 kafka-topic-01-0
drwxr-x--- 2 root root 4096 Jul 24 20:15 kafka-topic-01-1
drwxr-x--- 2 root root 4096 Jul 24 20:15 kafka-topic-01-2

#2、分区目录中的日志数据文件和日志索引文件
-rw-r----- 1 root root 512K Jul 24 19:51 00000000000000000000.index
-rw-r----- 1 root root 1.0G Jul 24 19:51 00000000000000000000.log -rw-r----- 1 root root 768K Jul 24 19:51 00000000000000000000.timeindex -rw-r----- 1 root root 512K Jul 24 20:03 00000000000022372103.index -rw-r----- 1 root root 1.0G Jul 24 20:03 00000000000022372103.log -rw-r----- 1 root root 768K Jul 24 20:03 00000000000022372103.timeindex -rw-r----- 1 root root 512K Jul 24 20:15 00000000000044744987.index -rw-r----- 1 root root 1.0G Jul 24 20:15 00000000000044744987.log -rw-r----- 1 root root 767K Jul 24 20:15 00000000000044744987.timeindex -rw-r----- 1 root root 10M Jul 24 20:21 00000000000067117761.index -rw-r----- 1 root root 511M Jul 24 20:21 00000000000067117761.log -rw-r----- 1 root root 10M Jul 24 20:21 00000000000067117761.timeindex 

As can be seen from the above, each partition corresponds to a physical folder, named after a title based access rules partition "-" connector, and then followed by the section number, the section number starting from 0, the maximum number of partitions The total number minus 1. 1 to multiple copies, copies of each partition have partition distributed on different clusters of the agent in order to improve usability. From the storage perspective, each copy is logically partition can be abstracted as a log (the Log) objects, i.e., objects are partitioned copy of the corresponding log. The figure is the distribution of a physical view of a main / backup copy of the three nodes consisting of Kafka Broker cluster partition:

 

 
Kafka partition and the physical profile of the copy .png

2.2Kafka log memory structure and the index data files

在Kafka中,每个Log对象又可以划分为多个LogSegment文件,每个LogSegment文件包括一个日志数据文件和两个索引文件(偏移量索引文件和消息时间戳索引文件)。其中,每个LogSegment中的日志数据文件大小均相等(该日志数据文件的大小可以通过在Kafka Broker的config/server.properties配置文件的中的“log.segment.bytes”进行设置,默认为1G大小(1073741824字节),在顺序写入消息时如果超出该设定的阈值,将会创建一组新的日志数据和索引文件)。
Kafka将日志文件封装成一个FileMessageSet对象,将偏移量索引文件和消息时间戳索引文件分别封装成OffsetIndex和TimerIndex对象。Log和LogSegment均为逻辑概念,Log是对副本在Broker上存储文件的抽象,而LogSegment是对副本存储下每个日志分段的抽象,日志与索引文件才与磁盘上的物理存储相对应;下图为Kafka日志存储结构中的对象之间的对应关系图:

 
Kafka日志存储结构的映射关系.png


为了进一步查看“.index”偏移量索引文件、“.timeindex”时间戳索引文件和“.log”日志数据文件,可以执行下面的命令将二进制分段的索引和日志数据文件内容转换为字符型文件:

 

# 1、执行下面命令即可将日志数据文件内容dump出来
./kafka-run-class.sh kafka.tools.DumpLogSegments --files /apps/svr/Kafka/kafkalogs/kafka-topic-01-0/00000000000022372103.log --print-data-log > 00000000000022372103_txt.log #2、dump出来的具体日志数据内容 Dumping /apps/svr/Kafka/kafkalogs/kafka-topic-01-0/00000000000022372103.log Starting offset: 22372103 offset: 22372103 position: 0 CreateTime: 1532433067157 isvalid: true keysize: 4 valuesize: 36 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 sequence: -1 isTransactional: false headerKeys: [] key: 1 payload: 5d2697c5-d04a-4018-941d-881ac72ed9fd offset: 22372104 position: 0 CreateTime: 1532433067159 isvalid: true keysize: 4 valuesize: 36 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 sequence: -1 isTransactional: false headerKeys: [] key: 1 payload: 0ecaae7d-aba5-4dd5-90df-597c8b426b47 offset: 22372105 position: 0 CreateTime: 1532433067159 isvalid: true keysize: 4 valuesize: 36 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 sequence: -1 isTransactional: false headerKeys: [] key: 1 payload: 87709dd9-596b-4cf4-80fa-d1609d1f2087 ...... ...... offset: 22372444 position: 16365 CreateTime: 1532433067166 isvalid: true keysize: 4 valuesize: 36 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 sequence: -1 isTransactional: false headerKeys: [] key: 1 payload: 8d52ec65-88cf-4afd-adf1-e940ed9a8ff9 offset: 22372445 position: 16365 CreateTime: 1532433067168 isvalid: true keysize: 4 valuesize: 36 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 sequence: -1 isTransactional: false headerKeys: [] key: 1 payload: 5f5f6646-d0f5-4ad1-a257-4e3c38c74a92 offset: 22372446 position: 16365 CreateTime: 1532433067168 isvalid: true keysize: 4 valuesize: 36 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 sequence: -1 isTransactional: false headerKeys: [] key: 1 payload: 51dd1da4-053e-4507-9ef8-68ef09d18cca offset: 22372447 position: 16365 CreateTime: 1532433067168 isvalid: true keysize: 4 valuesize: 36 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 sequence: -1 isTransactional: false headerKeys: [] key: 1 payload: 80d50a8e-0098-4748-8171-fd22d6af3c9b ...... ...... offset: 22372785 position: 32730 CreateTime: 1532433067174 isvalid: true keysize: 4 valuesize: 36 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 sequence: -1 isTransactional: false headerKeys: [] key: 1 payload: db80eb79-8250-42e2-ad26-1b6cfccb5c00 offset: 22372786 position: 32730 CreateTime: 1532433067176 isvalid: true keysize: 4 valuesize: 36 magic: 2 compresscodec: NONE producerId: -1 producerEpoch: -1 sequence: -1 isTransactional: false headerKeys: [] key: 1 payload: 51d95ab0-ab0d-4530-b1d1-05eeb9a6ff00 ...... ...... #3、同样地,dump出来的具体偏移量索引内容 Dumping /apps/svr/Kafka/kafkalogs/kafka-topic-01-0/00000000000022372103.index offset: 22372444 position: 16365 offset: 22372785 position: 32730 offset: 22373467 position: 65460 offset: 22373808 position: 81825 offset: 22374149 position: 98190 offset: 22374490 position: 114555 ...... ...... #4、dump出来的时间戳索引文件内容 Dumping /apps/svr/Kafka/kafkalogs/kafka-topic-01-0/00000000000022372103.timeindex timestamp: 1532433067174 offset: 22372784 timestamp: 1532433067191 offset: 22373466 timestamp: 1532433067206 offset: 22373807 timestamp: 1532433067214 offset: 22374148 timestamp: 1532433067222 offset: 22374489 timestamp: 1532433067230 offset: 22374830 ...... ...... 

由上面dump出来的偏移量索引文件和日志数据文件的具体内容可以分析出来,偏移量索引文件中存储着大量的索引元数据,日志数据文件中存储着大量消息结构中的各个字段内容和消息体本身的值。索引文件中的元数据postion字段指向对应日志数据文件中message的实际位置(即为物理偏移地址)。
下面的表格先列举了Kakfa消息体结构中几个主要字段的说明:

Kafka消息字段 各个字段说明
offset 消息偏移量
message size 消息总长度
CRC32 CRC32编码校验和
attributes 表示为独立版本、或标识压缩类型、或编码类型
magic 表示本次发布Kafka服务程序协议版本号
key length 消息Key的长度
key 消息Key的实际数据
valuesize 消息的实际数据长度
playload 消息的实际数据

1.日志数据文件

Kafka将生产者发送给它的消息数据内容保存至日志数据文件中,该文件以该段的基准偏移量左补齐0命名,文件后缀为“.log”。分区中的每条message由offset来表示它在这个分区中的偏移量,这个offset并不是该Message在分区中实际存储位置,而是逻辑上的一个值(Kafka中用8字节长度来记录这个偏移量),但它却唯一确定了分区中一条Message的逻辑位置,同一个分区下的消息偏移量按照顺序递增(这个可以类比下数据库的自增主键)。另外,从dump出来的日志数据文件的字符值中可以看到消息体的各个字段的内容值。

2.偏移量索引文件

如果消息的消费者每次fetch都需要从1G大小(默认值)的日志数据文件中来查找对应偏移量的消息,那么效率一定非常低,在定位到分段后还需要顺序比对才能找到。Kafka在设计数据存储时,为了提高查找消息的效率,故而为分段后的每个日志数据文件均使用稀疏索引的方式建立索引,这样子既节省空间又能通过索引快速定位到日志数据文件中的消息内容。偏移量索引文件和数据文件一样也同样也以该段的基准偏移量左补齐0命名,文件后缀为“.index”。
从上面dump出来的偏移量索引内容可以看出,索引条目用于将偏移量映射成为消息在日志数据文件中的实际物理位置,每个索引条目由offset和position组成,每个索引条目可以唯一确定在各个分区数据文件的一条消息。其中,Kafka采用稀疏索引存储的方式,每隔一定的字节数建立了一条索引,可以通过“index.interval.bytes”设置索引的跨度;
有了偏移量索引文件,通过它,Kafka就能够根据指定的偏移量快速定位到消息的实际物理位置。具体的做法是,根据指定的偏移量,使用二分法查询定位出该偏移量对应的消息所在的分段索引文件和日志数据文件。然后通过二分查找法,继续查找出小于等于指定偏移量的最大偏移量,同时也得出了对应的position(实际物理位置),根据该物理位置在分段的日志数据文件中顺序扫描查找偏移量与指定偏移量相等的消息。下面是Kafka中分段的日志数据文件和偏移量索引文件的对应映射关系图(其中也说明了如何按照起始偏移量来定位到日志数据文件中的具体消息)。

 
Kafka中log segment file中的index与data file对应关系图.png

 

3.时间戳索引文件

从上面一节的分区目录中,我们还可以看到存在一些以“.timeindex”的时间戳索引文件。这种类型的索引文件是Kafka从0.10.1.1版本开始引入的的一个基于时间戳的索引文件,它们的命名方式与对应的日志数据文件和偏移量索引文件名基本一样,唯一不同的就是后缀名。从上面dump出来的该种类型的时间戳索引文件的内容来看,每一条索引条目都对应了一个8字节长度的时间戳字段和一个4字节长度的偏移量字段,其中时间戳字段记录的是该LogSegment到目前为止的最大时间戳,后面对应的偏移量即为此时插入新消息的偏移量。
另外,时间戳索引文件的时间戳类型与日志数据文件中的时间类型是一致的,索引条目中的时间戳值及偏移量与日志数据文件中对应的字段值相同(ps:Kafka也提供了通过时间戳索引来访问消息的方法)。

三、 总结

从全文来看,Kafka高效数据存储设计的特点在于以下几点:
(1)、Kafka把主题中一个分区划分成多个分段的小文件段,通过多个小文件段,就容易根据偏移量查找消息、定期清除和删除已经消费完成的数据文件,减少磁盘容量的占用;
(2)、采用稀疏索引存储的方式构建日志的偏移量索引文件,并将其映射至内存中,提高查找消息的效率,同时减少磁盘IO操作;
(3)、Kafka将消息追加的操作逻辑变成为日志数据文件的顺序写入,极大的提高了磁盘IO的性能;
任何一位使用Kafka的同学来说,如果能够掌握其数据存储机制,对于大规模Kafka集群的性能调优和问题定位都大有裨益。鉴于篇幅所限,对Kafka的日志管理器、日志加载/恢复和日志清理将在篇幅(二)中进行介绍。
限于笔者的才疏学浅,对本文内容可能还有理解不到位的地方,如有阐述不合理之处还望留言一起探讨。后续还会根据自己的实践和研发,陆续发布关于Kafka分布式消息队列的其他相关技术文章,敬请关注。

Guess you like

Origin www.cnblogs.com/bjxdd/p/12059900.html