Detailed explanation of message log files in Kafka: Topic message classification, partition partition, segment segmentation, offset offset index file

1. Introduction to Kafka

Kafka is a high-throughput distributed publish-subscribe messaging system that runs in a cluster based on zookeeper coordination. It supports partitions and multiple replicas, and has very good load balancing capabilities, processing performance, and fault tolerance. . Kafka uses a publish/subscribe model. The message producer sends the message to the Kafka message center (broker), and then the consumer reads the message from the center. Its logical architecture is shown in the figure below:
Insert image description here
Broker is a server node in the Kafka cluster. Each Broker is an independent server that receives messages from producers and stores them in the message queue. At the same time, it also handles requests from consumers and sends messages back to consumers. The Broker only stores messages and notifies Consumers registered in the system. The Consumer actively pulls messages from the broker based on monitoring and configuration.

ZooKeeper is used to manage the configuration, status and metadata information of the cluster to ensure the normal operation of the distributed messaging system.

2. Message log files in Kafka

1. Classification of messages

The data unit of Kafka is called a message, and a message can be regarded as a "data row" or a "record" in the database. Kafka categorizes, organizes and manages messages according to topics, and each topic is independent of each other and does not affect each other. Topics are specified by the business system to distinguish message types. Producers and consumers are connected through topics. The messages produced by the Producer are put into a topic, and the messages of the topic are consumed by the designated Consumer or Consumer Group.

2. Partitioned storage of messages

Physically, the messages of different topics are stored separately. The messages of each topic can be divided into logical partitions of multiple partitions for storage. Each partition can be understood as an independent message log, which can only store messages of the same topic. It belongs to this topic. The most fine-grained logical storage. In Kafka, each partition corresponds to an independent file directory. The file directory naming rule is: topic name + partition serial number.

When messages under the same topic are submitted by the producer, Kafka will allocate the messages to the corresponding partition of the topic according to the partition strategy (such as range allocation, round-robin allocation, sticky allocation). Different partition messages under the same topic are completely different. of.

3.The identity identifier offset of the message

In a single partition, the stored messages are ordered. When each message is added to the partition, a unique offset is allocated sequentially and incrementally in the partition unit to distinguish each different message in the partition. The offset is also called an offset. The sequential number is equivalent to the ID of the message. The length is 20 digits. If there are less than 20 digits, 0 is added. It is the unique number of the message in this partition. Kafka guarantees that the messages in the same partition are ordered, but the same topic Messages in different partitions are out of order.

4. Segmentation of message log

In order to prevent the continuous appending of message logs from making the file too large and resulting in low retrieval efficiency, a Partiton is divided into multiple Segments to organize data. On the disk, each Segment consists of a message log file that stores messages and two index files. Composed, each log file contains one or more messages. The naming rule for each log file is "{baseOffset}.log", where baseOffset is the offset of the first message in the log file.

In a Segment, the message log is written additionally. If the log file or index file exceeds a certain size or the current time-file creation time is greater than the specified time interval (the above conditions are set by parameters), the log will be split. files and index files to generate a new Segment. The new Segment uses the latest Offset as its name. The starting sequence number of the first message stored in the first Segment is 0, so the file name is named with a 20-bit length of 0.

5. The relationship between Topic, partition and Segment

The following is the logical relationship diagram between Topic, partition, Segment and log files:
Insert image description here

In the above figure, the offset of the first message stored in Segment0 of partition0 is 0, the offset of the last message is 123456788, the offset of the initial message of the second segment is 123456789, and the offset of the initial message of the last segment is xxxxxxxxxxxxxxxxxxxxx.

3. Index file of message log file in Kafka

1. Kafka index file

Kafka's log files are usually very large, and each message is not of fixed length. Reading and processing may consume a lot of time and resources. In order to improve the reading and processing speed, Kafka creates two index files for each log file, respectively. They are the offset index file (file suffix ".index") and the timestamp index file (file suffix ".timeindex"). Both index files are sparse indexes, which does not guarantee that each message has a corresponding index entry in the index file. Therefore, the size of the index file can be greatly reduced, thereby enabling cache loading of the index file and improving query speed.

The log files and index files of different versions of Kafka are somewhat different, but the basic information recorded is similar. The implementation mechanisms of log files and index files in different versions are similar.

2. offset index file

The ".index" offset index file is used to establish a mapping relationship between the message offset offset and the physical address of the message stored in the log file. When the length of the written message exceeds a certain amount (specified by the parameter), the offset Shifting the index file will add an offset index entry, which includes the offset of the message and its location in the physical file.

Since the log file name is prefixed with the base offset of the stored message, when the consumer wants to read the message, it first obtains the log file name list in the partition and sorts it, and uses the dichotomy method to find the corresponding log file according to the offset of the message (assumed to be x). , after finding the corresponding log file, you can use binary search in the corresponding offset index file to quickly locate the largest index entry that is not larger than x (assuming its offset is y), and get the position p where y is stored in the log data file. , scanning the log file sequentially starting from p until the message with offset x is found.

3.timestamp index file

The ".timeindex" file stores the mapping relationship between the timestamp of the message and the offset offset of the message. It fragments the message according to the timestamp and records the timestamp of the last message in each fragment and the corresponding offset offset. Quantity for quick search of messages in chronological order. When the length of the message written by Kafka exceeds a certain amount (specified by the parameter) or the timestamp of the new message and the timestamp of the previous index entry exceed a certain length (specified by the parameter), a timestamp index entry is added to the timestamp index file.

When you need to query log messages with a specified timestamp, use the dichotomy method to first find the largest index item x in the timestamp index file that is not greater than the target timestamp, get the offset y corresponding to the index item, and then query the offset based on y Index the file to read the log file location p where the message is located.

4. Index file summary

The offset of messages in the same Kafka partition increases in order. To improve performance, the log files in the same partition are split into multiple segments. Each segment saves a certain amount of message data, and each segment has Offset index files and timestamp index files, these two types of index files are coefficient indexes.

The offset index file stores the mapping relationship between the offset of the message and the actual storage file location, which is used to access the message by the offset of the message. The timestamp index file stores the mapping relationship between the timestamp and the offset, which is used to access the message by time. When used, the timestamp index needs to be combined with the offset index to truly access the data.

Since the index file is a sparse index, in most cases only the approximate location of the message data can be found through the index. In the end, it is necessary to read the messages sequentially starting from the approximate location to find the message you are looking for. Since the message offset is incremented and written sequentially to the log file, the message log is partitioned and segmented, and the index file is small and can be cached, the overall efficiency of this indexing mechanism is quite high.

5. Summary

This article introduces in detail the logical concepts of Kafka's message log files, physical storage partitions, log file segments, and index files, as well as their related relationships. It also introduces in detail the logical structure and indexing mechanism of offset index files and timestamp index files. . It is helpful to understand OPIC message classification, partition partition, segment segmentation, offset offset index file and other related concepts.

6. Reference materials

  1. The relationship, difference and significance of existence between topic, partition, broker, consumerGroup and consumer in kafka
  2. How does kafka locate a message through offset?
  3. kafka file storage message synchronization mechanism
  4. How to interpret Kafka’s indexing mechanism
Writing a blog is not easy, please support :

If you have learned something from reading this article, please like, comment, and collect it. Thank you for your support!

About Lao Yuan’s paid column

  1. The paid column " https://blog.csdn.net/laoyuanpython/category_9607725.html Using PyQt to develop graphical interface Python applications" specifically introduces the basic tutorial of PyQt graphical interface development based on Python. The corresponding article directory is " https://blog.csdn" .net/LaoYuanPython/article/details/107580932 Using PyQt to develop graphical interface Python application column directory ";
  2. Paid column " https://blog.csdn.net/laoyuanpython/category_10232926.html moviepy audio and video development column " introduces in detail the class-related methods of moviepy audio and video editing synthesis processing and the use of related methods to process related editing and synthesis scenes, corresponding articles The directory is " https://blog.csdn.net/LaoYuanPython/article/details/107574583 moviepy audio and video development column article directory ";
  3. The paid column " https://blog.csdn.net/laoyuanpython/category_10581071.html OpenCV-Python Beginners' Problem Set " is " https://blog.csdn.net/laoyuanpython/category_9979286.html OpenCV-Python Graphics and Image Processing" "The accompanying column is the author's personal reflection on some problems encountered in learning OpenCV-Python graphics and image processing. The relevant information is basically the result of repeated research by old apes, which will help beginners of OpenCV-Python learn more deeply. To understand OpenCV, the corresponding article directory is " https://blog.csdn.net/LaoYuanPython/article/details/109713407 OpenCV-Python Beginners' Problem Set Column Directory "
  4. The paid column " https://blog.csdn.net/laoyuanpython/category_10762553.html Getting Started with Python Crawler" introduces what you need to know about crawler development from the perspective of a novice Internet front-end developer, including the basic knowledge of getting started with crawlers, and crawling Get CSDN article information, blogger information, like articles, comments and other practical content.

The first two columns are suitable for novice readers who have a certain Python foundation but no relevant knowledge. For the third column, please combine " https://blog.csdn.net/laoyuanpython/category_9979286.html OpenCV-Python Graphics and Image Processing " learning to use.

For colleagues who lack Python foundation, you can learn Python from scratch through Lao Yuan's free column " https://blog.csdn.net/laoyuanpython/category_9831699.html Column: Python Basic Tutorial Directory ".

Readers who are interested and willing to support Lao Yuan are welcome to purchase the paid column.

Lao Yuan Python, learn Python from Lao Yuan!

☞ ░Go to LaoYuanPython blog directory https://blog.csdn.net/LaoYuanPython

Guess you like

Origin blog.csdn.net/LaoYuanPython/article/details/132911303