Kafka Learning Road (2) Kafka Architecture

1. Architecture of Kafka

As shown in the figure above, a typical Kafka cluster contains several Producers (which can be Page Views generated by the web front-end, or server logs, system CPU, Memory, etc.), and several brokers (Kafka supports horizontal expansion, generally the more brokers there are, the more The higher the cluster throughput), several Consumer Groups, and a Zookeeper cluster. Kafka manages the cluster configuration through Zookeeper, elects leaders, and rebalances when the Consumer Group changes. Producers use push mode to publish messages to brokers, and consumers use pull mode to subscribe and consume messages from brokers.

2. Topics and Partition

Topic can be regarded as a queue logically, and each consumption must specify its topic, which can be simply understood as specifying which queue to put this message into. In order to increase the throughput of Kafka linearly, the Topic is physically divided into one or more Partitions, each Partition physically corresponds to a folder, and all messages and index files of this Partition are stored in this folder. When creating a topic, you can specify the number of partitions at the same time. The more the number of partitions, the greater the throughput, but the more resources are required, which will also lead to higher unavailability. Kafka is receiving messages from producers. After that, messages are stored in different partitions according to the balancing strategy. Because each message is appended to the Partition, it belongs to sequential writing to disk, so the efficiency is very high (it has been verified that sequential writing to disk is more efficient than random writing to memory, which is a very important guarantee for Kafka's high throughput rate) .

For traditional message queues, messages that have been consumed are generally deleted, while the Kafka cluster retains all messages, whether they are consumed or not. Of course, because of disk limitations, it is impossible to keep all data permanently (and in fact it is not necessary), so Kafka provides two strategies to delete old data. One is based on time, and the other is based on Partition file size. For example, you can configure $KAFKA_HOME/config/server.properties to let Kafka delete data from a week ago, or delete old data when the Partition file exceeds 1GB. The configuration is as follows:

# The minimum age of a log file to be eligible for deletion
log.retention.hours=168
# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824
# The interval at which log segments are checked to see if they can be deleted according to the retention policies
log.retention.check.interval.ms=300000
# If log.cleaner.enable=true is set the cleaner will be enabled and individual logs can then be marked for log compaction.
log.cleaner.enable=false

Because the time complexity of Kafka reading a specific message is O(1), which is independent of file size, deleting expired files here has nothing to do with improving Kafka performance. The deletion strategy you choose is only related to the disk and your specific needs. In addition, Kafka will keep some metadata information for each Consumer Group - the position of the currently consumed message, that is, the offset. This offset is controlled by the Consumer. Under normal circumstances, the Consumer will increment the offset after consuming a message. Of course, the Consumer can also set the offset to a smaller value and re-consume some messages. Because the offet is controlled by the Consumer, the Kafka broker is stateless. It does not need to mark which messages have been consumed, nor does it need to use the broker to ensure that only one Consumer in the same Consumer Group can consume a certain message, so there is no need to Lock mechanism, which also provides a strong guarantee for Kafka's high throughput rate.

3. Producer message routing

When the Producer sends a message to the broker, it will choose which Partition to store it in according to the Paritition mechanism. If the Partition mechanism is set properly, all messages can be evenly distributed to different Partitions, thus achieving load balancing. If a topic corresponds to a file, the I/O of the machine where the file is located will become the performance bottleneck of the topic. With Partition, different messages can be written to different Partitions of different brokers in parallel, which greatly improves the performance. throughput. You can specify the default number of Partitions for a new topic through the configuration item num.partitions in $KAFKA_HOME/config/server.properties, or specify it through parameters when creating a topic, or modify it through the tools provided by Kafka after the topic is created.

When sending a message, you can specify the key of the message, and the Producer judges which Partition the message should be sent to based on the key and the Partition mechanism. The paritition mechanism can be specified by specifying the Producer's paritition.class parameter, which must implement the kafka.producer.Partitioner interface.

四、Consumer Group

When using the Consumer high level API, a message of the same topic can only be consumed by one Consumer in the same Consumer Group, but multiple Consumer Groups can consume the message at the same time.

This is the method that Kafka uses to implement the broadcast (sent to all Consumers) and unicast (sent to a certain Consumer) of a Topic message. A Topic can correspond to multiple Consumer Groups. If you need to implement broadcasting, as long as each Consumer has an independent Group. To achieve unicast as long as all consumers are in the same group. With Consumer Group, consumers can be freely grouped without the need to send messages to different topics multiple times.

In fact, one of Kafka's design philosophies is to provide both offline and real-time processing. According to this feature, a real-time stream processing system such as Storm can be used to process messages online in real time, while a batch processing system such as Hadoop can be used for offline processing, and data can be backed up to another data center in real time at the same time. The Consumers used in the three operations belong to different Consumer Groups.

五、Push vs. Pull

As a messaging system, Kafka follows the traditional way of choosing to push messages from the Producer to the broker and the Consumer to pull from the broker. Some logging-centric systems, such as Facebook's Scribe and Cloudera's Flume, use the push model. In fact, push mode and pull mode have their own advantages and disadvantages.

The push mode is difficult to adapt to consumers with different consumption rates, because the message sending rate is determined by the broker. The goal of the push mode is to deliver messages as quickly as possible, but this can easily cause the Consumer to be too late to process messages, typically resulting in denial of service and network congestion. In the pull mode, messages can be consumed at an appropriate rate according to the consumer's consumption capacity.

For Kafka, pull mode is more appropriate. The pull mode simplifies the design of the broker. The consumer can independently control the rate of consuming messages, and the consumer can control the consumption method by itself—either in batches or one by one, and at the same time, it can choose different submission methods to achieve different transmission semantics.

六、Kafka delivery guarantee

There are several possible delivery guarantees:

At most once messages may be lost, but never repeated

At least one message is never lost, but may be retransmitted

Exactly once each message will definitely be transmitted once and only once, which is what the user wants in many cases.

When the Producer sends a message to the broker, once the message is committed, it will not be lost due to the existence of replication. However, if the Producer encounters a network problem after sending data to the broker and the communication is interrupted, the Producer cannot determine whether the message has been committed. Although Kafka cannot determine what happened during a network failure, the Producer can generate something similar to a primary key, idempotent retries multiple times in the event of a failure, thus achieving Exactly once.

The next discussion is the delivery guarantee semantics of messages from broker to consumer. (only for Kafka consumer high level API). After the Consumer reads the message from the broker, it can choose to commit, which will save the offset of the message read by the Consumer in the Partition in Zookeeper. The next time the Consumer reads the Partition, it will start reading from the next entry. If not committed, the start position of the next read will be the same as the start position after the previous commit. Of course, the Consumer can be set to autocommit, that is, the Consumer will automatically commit as soon as it reads the data. If only this process of reading messages is discussed, then Kafka ensures Exactly once. However, in actual use, the application does not end when the Consumer reads the data, but needs to perform further processing, and the order of data processing and commit largely determines the delivery guarantee semantics of the message from the broker and the consumer.

Kafka guarantees At least once by default , and allows At most once by setting the Producer to commit asynchronously. Exactly once requires cooperation with external storage systems. Fortunately, the offset provided by Kafka can be used very directly and easily.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325770670&siteId=291194637