Kafka data storage form and data cleaning

In Kafka, data exists in the form of logs

Kafka's Storage Log

In Kafka. Data storage on disk:

  • The data in Kafka is stored in /export/server/kafka_2.12-2.4.1/data
  • The message is saved in the folder with: "topic name-partition ID"
  • The data folder contains the following

insert image description here

These correspond to:

file name illustrate
00000000000000000000.index Index file, the search data according to the offset is operated through the index file
00000000000000000000.log log data file
00000000000000000000.timeindex time index
leader-epoch-checkpoint Persist the LEO corresponding to each partition leader (log end offset, offset of the next message to be written in the log file)
  • The file name of each log file is the starting offset, because the starting offset of each partition is 0, so the log files of the partition start with 0000000000000000000.log
  • The default maximum for each log file is: log.segment.bytes =1024 1024 1024 is 1G
  • To simplify finding messages based on offset, the Kafka log file name is designed as the starting offset.

log observation mode

In order to facilitate test observation, create a new topic: "test_10m", the maximum size of each log data file of this topic is 10M

bin/kafka-topics.sh --create --zookeeper node1.itcast.cn --topic test_10m --replication-factor 2 --partitions 3 --config segment.bytes=10485760

Using the production data in the previous theme of the previous producer program, you can observe the following:
the maximum size of each log file is 10M
insert image description here

insert image description here

log write mode

  • New messages are always written to the last log file
  • If the file reaches the specified size (default: 1GB), it will be rolled into a new file

Log read and write mode

  • According to the offset, you first need to find the segment segment that stores the data (note: offset specifies the global offset of the partition)
  • Then find the segment segment offset relative to the file according to the global partition offset
  • Finally, read the message according to the "segment segment offset"
  • In order to improve query efficiency, each file will maintain a corresponding range of memory, and a simple binary search is used when searching

delete message

  • In Kafka, messages are cleaned up periodically. Delete log files one segment at a time.
  • Kafka's log manager will determine which files can be deleted according to Kafka's configuration

data squeeze problem

Kafka consumers consume data very fast, but if there are some external IOs or network congestion when processing Kafka messages, data backlogs (or data accumulation) in Kafka will result. If the data has been backlogged, the real-time performance of the data will be greatly affected.

When Kafka has a data backlog problem, first find the cause of the data backlog.

Here are a few class scenarios where a data backlog occurs in an enterprise:

  1. Failed to write data into MySQL

Description of the problem:
One day, the operation and maintenance personnel found the developer and said that there was a data backlog in a partition of a certain topic. This topic was very important, and users began to complain. The operation and maintenance was very nervous, so I quickly restarted the machine. After restarting, it still doesn't help.

Problem Analysis:
The code for consuming this topic is relatively simple, mainly to consume topic data, and then make judgments and perform database operations. The operation and maintenance found the backlog topic through kafka-eagle, and found that a certain partition of the topic had a backlog of hundreds of thousands of messages.
Finally, by checking the logs, it is found that the offset of the consumption partition has not been submitted due to an error reported when the data is written to MySQL, so the data backlog is serious.

  1. Consumption failed due to network delay

Description of the problem:
The system developed based on Kafka has been running smoothly for two months. Suddenly, one day, it is found that there is a data backlog in the messages in a certain topic, and there are about tens of thousands of messages that have not been consumed.

Problem analysis:
By checking the application log, it is found that there are a large number of consumption timeout failures. After finding out the reason, because the network was jittering that day, the problem was solved by checking the consumer timeout configuration of Kafka to 50ms, and then changing the consumption time to 500ms.

data cleaning

Kafka's messages are stored on the disk. In order to control the space occupied by the disk, Kafka needs to constantly clean up some past messages. Each partition of Kafka has a lot of log files, which is also for the convenience of log cleaning. In Kafka, two log cleaning methods are provided:

  1. Log deletion (Log Deletion): directly delete the unqualified logs according to the specified policy.
  2. Log Compaction : Consolidate according to the key of the message. For those with the same key but different values, only the last version is kept.

In Kafka's broker or topic configuration:

configuration item configuration value illustrate
log.cleaner.enable true (default) Turn on the automatic log cleaning function
log.cleanup.policy delete (default) delete log
log.cleanup.policy compaction compressed log
log.cleanup.policy delete,compact Support deletion and compression at the same time

log deletion

Log deletion is performed periodically in units of segments (segment logs).

Scheduled log deletion task
There will be a special log deletion task in the Kafka log manager to periodically detect and delete log segment files that do not meet the retention conditions. This period can be configured through the broker-side parameter log.retention.check.interval.ms , the default value is 300,000, which is 5 minutes. There are three retention strategies for current log segments:

  1. Time-Based Retention Policy
  2. Retention policy based on log size
  3. Retention policy based on log start offset

insert image description here

Time-Based Retention Policy

The following three configurations can specify that if the message in Kafka exceeds the specified threshold, the log will be automatically cleaned up

  • log.retention.hours
  • log.retention.minutes
  • log.retention.ms

Among them, the priority is log.retention.ms > log.retention.minutes > log.retention.hours. By default, in the broker, the configuration is as follows:
log.retention.hours=168
That is, the default log retention time is 168 hours, which is equivalent to 7 days.

When deleting log segments:

  1. Remove the log segments to be deleted from the jump list of log segments maintained in the log file object to ensure that no thread reads these log segments
  2. Add the suffix ".deleted" to the log segment file (including the index file corresponding to the log segment)
  3. Kafka's background timing task will periodically delete these files with the suffix ".deleted". The delayed execution time of this task can be set through the file.delete.delay.ms parameter. The default value is 60000, which is 1 minute.

Retention policy based on log size

The log deletion task will check whether the size of the current log exceeds the set threshold to find the file collection of log segments that can be deleted. It can be configured through the broker-side parameter log.retention.bytes, and the default value is -1, which means infinity. If the size is exceeded, the excess will be automatically deleted.

Note : log.retention.bytes configures the total size of the log file, not the size of a single log segment. A log file contains multiple log segments

Retention policy based on log start offset

Each segment log has its start offset, if the start offset is less than logStartOffset, then these log files will be marked for deletion.

log compression

Log Compaction is a way to clean up obsolete data other than the default log deletion. It will only keep one version of the data corresponding to the same key.
insert image description here

  • After the Log Compaction is executed, the offset will no longer be continuous, but the Segment can still be queried
  • Before and after Log Compaction is executed, the offset of each message in the log segment remains unchanged. Log Compaction will generate a new segment file.
  • Log Compaction is for the key. When using it, pay attention that the key of each message is not empty.
  • Based on Log Compaction, the latest update of the key can be retained, and the latest state of consumers can be restored based on Log Compaction.

Guess you like

Origin blog.csdn.net/weixin_45970271/article/details/126575666