[Big Data - Kafka] Kafka Analysis (1): Kafka Background and Architecture Introduction

Kafka is a distributed messaging system developed by LinkedIn, written in Scala, and widely used for its horizontal scalability and high throughput. Currently, more and more open source distributed processing systems such as Cloudera, Apache Storm, and Spark support integration with Kafka. InfoQ has been paying close attention to the application and development of Kafka. The "Kafka Analysis" column will analyze Kafka in depth from the aspects of architecture design, implementation, application scenarios, and performance.

Background introduction

Kafka creation background

Kafka is a messaging system originally developed from LinkedIn and used as the basis for LinkedIn's Activity Stream and operational data processing pipelines. It is now used by many different types of companies  as many types of data pipelines and messaging systems.

Activity flow data is the most common piece of data that almost any site uses when reporting on its website usage. Activity data includes page views, information on content viewed, and searches. This kind of data is usually processed by first writing various activities in the form of logs to some kind of file, and then periodically performing statistical analysis on these files. Operational data refers to server performance data (CPU, IO usage, request time, service logs, etc.). There are a wide variety of statistical methods for operational data.

 

In recent years, activity and operational data processing has become a vital part of the characteristics of website software products, requiring a slightly more complex infrastructure to support it.

Introduction to Kafka

Kafka is a distributed, publish/subscribe based messaging system. The main design goals are as follows:

  • The message persistence capability is provided with a time complexity of O(1), and the access performance of constant time complexity can be guaranteed even for data above TB level.
  • High throughput. Even on very cheap commercial machines, a single machine can support the transmission of more than 100K messages per second.
  • It supports message partitioning between Kafka Servers and distributed consumption, while ensuring the sequential transmission of messages in each Partition.
  • Both offline data processing and real-time data processing are supported.
  • Scale out: Supports online horizontal scaling.

Why use a messaging system

  • decoupling

    It is extremely difficult to predict what needs the project will encounter in the future at the beginning of the project. The message system inserts an implicit, data-based interface layer in the middle of the process, and both processes must implement this interface. This allows you to extend or modify both processes independently, as long as you make sure they obey the same interface constraints.

  • redundancy
     

    In some cases, the process of processing the data fails. It will be lost unless the data is persisted. Message queues avoid the risk of data loss by persisting data until they have been fully processed. In the "insert-get-delete" paradigm used by many message queues, before removing a message from the queue, your processing system needs to explicitly indicate that the message has been processed, thus ensuring that your data is kept safe until you are done using it.

  • Extensibility

    Because message queues decouple your processing, it is easy to increase the frequency of message enqueuing and processing by adding additional processing. No need to change the code, no need to adjust the parameters. Expansion is as easy as turning up the power button.

  • Flexibility & Peak Handling

    In the case of a surge in traffic, the application still needs to continue to function, but such bursts of traffic are not common; it is undoubtedly a huge waste to invest resources to be on standby at any time according to the standard of being able to handle such peak traffic. Using message queues enables critical components to withstand sudden access pressures without completely crashing due to sudden overloaded requests.

  • recoverability

    When a part of the system fails, it does not affect the entire system. Message queues reduce the coupling between processes, so even if a process processing a message hangs, messages added to the queue can still be processed after the system is restored.

  • order guarantee

    In most use cases, the order of data processing is important. Most message queues are inherently ordered and guarantee that data will be processed in a specific order. Kafka guarantees the ordering of messages within a Partition.

  • buffer

    In any critical system, there will be elements that require different processing times. For example, loading an image takes less time than applying filters. Message queues use a buffer layer to help tasks perform most efficiently - writes to the queue are processed as quickly as possible. This buffering helps control and optimize the speed at which data flows through the system.

  • Asynchronous communication

    Many times the user does not want or need to process the message immediately. Message queues provide asynchronous processing mechanisms that allow users to put a message on the queue, but not process it immediately. Put as many messages as you want into the queue, and then process them when needed.

Common Message Queue Comparison

  • RabbitMQ

    RabbitMQ is an open source message queue written in Erlang. It supports many protocols: AMQP, XMPP, SMTP, STOMP. Because of this, it is very heavyweight and more suitable for enterprise-level development. At the same time, the Broker architecture is implemented, which means that messages are queued in a central queue before being sent to the client. It has good support for routing, load balancing or data persistence.

  • Redis

    Redis is a NoSQL database based on Key-Value pairs, with active development and maintenance. Although it is a Key-Value database storage system, it supports MQ functions, so it can be used as a lightweight queue service. For the enqueue and dequeue operations of RabbitMQ and Redis, each is executed 1 million times, and the execution time is recorded every 100,000 times. The test data is divided into four different sizes of 128Bytes, 512Bytes, 1K and 10K. Experiments show that: when entering the queue, when the data is relatively small, the performance of Redis is higher than that of RabbitMQ, and if the data size exceeds 10K, Redis is unbearably slow; when leaving the queue, regardless of the data size, Redis shows very good performance , while the dequeue performance of RabbitMQ is much lower than that of Redis.

  • ZeroMQ

    ZeroMQ is known as the fastest message queuing system, especially for high-throughput demand scenarios. ZeroMQ can implement advanced/complex queues that RabbitMQ is not good at, but developers need to combine multiple technical frameworks by themselves. The technical complexity is a challenge to the successful application of this MQ. ZeroMQ has a unique non-middleware model, you don't need to install and run a message server or middleware because your application will play the server role. All you need is a simple reference to the ZeroMQ library, which can be installed using NuGet, and you can happily send messages between applications. But ZeroMQ only provides non-persistent queues, which means that data will be lost if it goes down. Among them, Twitter's Storm versions earlier than 0.9.0 used ZeroMQ as the data stream transmission by default (Storm has supported both ZeroMQ and Netty as transmission modules since version 0.9).

  • ActiveMQ

    ActiveMQ is a sub-project under Apache. Similar to ZeroMQ, it can implement queues in broker and peer-to-peer technology. At the same time, similar to RabbitMQ, it can efficiently implement advanced application scenarios with a small amount of code.

  • Kafka/Jafka

    Kafka is a sub-project under Apache. It is a high-performance cross-language distributed publish/subscribe message queue system. Jafka is incubated on top of Kafka, which is an upgraded version of Kafka. It has the following characteristics: fast persistence, message persistence can be performed under O(1) system overhead; high throughput, a throughput rate of 10W/s can be achieved on an ordinary server; a complete distributed system, Broker , Producer, and Consumer all natively and automatically support distributed, automatic load balancing; support Hadoop data parallel loading, for log data and offline analysis systems like Hadoop, but require real-time processing constraints, this is a feasible solution. . Kafka unifies online and offline message processing through Hadoop's parallel loading mechanism. Apache Kafka is a very lightweight messaging system relative to ActiveMQ, and besides being very performant, it is a distributed system that works well.

Kafka Architecture

Terminology

  • Broker

    A Kafka cluster contains one or more servers, which are called brokers

  • Topic

    Every message published to a Kafka cluster has a category called a topic. (Physically, messages of different topics are stored separately. Logically, although messages of a topic are stored on one or more brokers, users only need to specify the topic of the message to produce or consume data without caring where the data is stored.)

  • Partition

    Partition is a physical concept, and each Topic contains one or more Partitions.

  • Producer

    Responsible for publishing messages to Kafka broker

  • Consumer

    Message consumer, a client that reads messages from the Kafka broker.

  • Consumer Group

    Each Consumer belongs to a specific Consumer Group (a group name can be specified for each Consumer, if no group name is specified, it belongs to the default group).

Kafka topology

As shown in the figure above, a typical Kafka cluster contains several Producers (which can be Page View generated by the web front-end, or server logs, system CPU, Memory, etc.), and several brokers (Kafka supports horizontal expansion, generally the more brokers, the more The higher the cluster throughput), several Consumer Groups, and a Zookeeper cluster. Kafka manages the cluster configuration through Zookeeper, elects leaders, and rebalances when the Consumer Group changes. Producers use push mode to publish messages to brokers, and consumers use pull mode to subscribe and consume messages from brokers.

Topic & Partition

Topic can be regarded as a queue logically, and each consumption must specify its topic, which can be simply understood as specifying which queue to put this message into. In order to increase the throughput of Kafka linearly, the Topic is physically divided into one or more Partitions, each Partition physically corresponds to a folder, and all messages and index files of this Partition are stored in this folder. If two topics, topic1 and topic2 are created, and there are 13 and 19 partitions respectively, a total of 32 folders will be generated on the entire cluster (the cluster used in this article has a total of 8 nodes, where topic1 and topic2 replication-factor are both 1), as shown in the figure below.

Each log file is a sequence of log entrie, each log entrie contains a 4-byte integer value (value N+5), 1-byte "magic value", and 4-byte CRC check code , followed by a message body of N bytes. Each message has a unique 64-byte offset under the current Partition, which indicates the starting position of the message. Messages stored on disk have the following format:

message length : 4 bytes (value: 1+4+n)
"magic" value : 1 byte
crc : 4 bytes
payload : n bytes

This log entry is not composed of a file, but is divided into multiple segments, each segment is named after the offset of the first message of the segment and suffixed with ".kafka". In addition, there will be an index file, which indicates the offset range of the log entry contained in each segment, as shown in the following figure.

Because each message is appended to the Partition, it belongs to sequential write to disk, so the efficiency is very high (it has been verified that sequential write to disk is more efficient than random write to memory, which is a very important guarantee for Kafka's high throughput rate) .

For traditional message queues, messages that have been consumed are generally deleted, while the Kafka cluster retains all messages, whether they are consumed or not. Of course, because of disk limitations, it is impossible to keep all data permanently (and in fact it is not necessary), so Kafka provides two strategies to delete old data. One is based on time, and the other is based on Partition file size. For example, you can configure $KAFKA_HOME/config/server.properties to let Kafka delete data from a week ago, or delete old data when the Partition file exceeds 1GB. The configuration is as follows.

  
# The minimum age of a log file to be eligible for deletion
log.retention.hours=168
# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824
# The interval at which log segments are checked to see if they can be deleted according to the retention policies
log.retention.check.interval.ms=300000
# If log.cleaner.enable=true is set the cleaner will be enabled and individual logs can then be marked for log compaction.
log.cleaner.enable=false

It should be noted here that because the time complexity of Kafka reading a specific message is O(1), which is independent of the file size, deleting expired files here has nothing to do with improving Kafka performance. The deletion strategy you choose is only related to the disk and your specific needs. In addition, Kafka will keep some metadata information for each Consumer Group - the position of the currently consumed message, that is, the offset. This offset is controlled by the Consumer. Under normal circumstances, the Consumer will increment the offset after consuming a message. Of course, the Consumer can also set the offset to a smaller value and re-consume some messages. Because the offet is controlled by the Consumer, the Kafka broker is stateless. It does not need to mark which messages have been consumed, nor does it need to use the broker to ensure that only one Consumer in the same Consumer Group can consume a certain message, so there is no need to Lock mechanism, which also provides a strong guarantee for Kafka's high throughput rate.

Producer message routing

When the Producer sends a message to the broker, it will choose which Partition to store it in according to the Paritition mechanism. If the Partition mechanism is set properly, all messages can be evenly distributed to different Partitions, thus achieving load balancing. If a topic corresponds to a file, the I/O of the machine where the file is located will become the performance bottleneck of the topic. With Partition, different messages can be written to different Partitions of different brokers in parallel, which greatly improves the throughput. You can specify the default number of partitions for a new topic through the configuration item num.partitions in $KAFKA_HOME/config/server.properties, or specify it through parameters when creating a topic, or modify it through the tools provided by Kafka after the topic is created.

When sending a message, you can specify the key of the message, and the Producer judges which Partition the message should be sent to based on the key and the Partition mechanism. The paritition mechanism can be specified by specifying the Producer's paritition.class parameter, which must implement the kafka.producer.Partitioner interface. In this example, if the key can be parsed as an integer, the corresponding integer is divided by the total number of Partitions, and the message will be sent to the Partition corresponding to the number. (Each Partition will have a serial number, and the serial number starts from 0)

import kafka.producer.Partitioner;
import kafka.utils.VerifiableProperties;

public class JasonPartitioner<T> implements Partitioner {

    public JasonPartitioner(VerifiableProperties verifiableProperties) {}

    @Override
    public int partition(Object key, int numPartitions) {
        try {
            int partitionNum = Integer.parseInt((String) key);
            return Math.abs(Integer.parseInt((String) key) % numPartitions);
        } catch (Exception e) {
            return Math.abs(key.hashCode() % numPartitions);
        }
    }
}

If the class in the above example is used as partition.class, and the following code is used to send 20 messages (keys are 0, 1, 2, 3) to topic3 ​​(containing 4 Partitions).

public void sendMessage() throws InterruptedException{
  for(int i = 1; i <= 5; i++){
        List messageList = new ArrayList<KeyedMessage<String, String>>();
        for(int j = 0; j < 4; j++){
            messageList.add(new KeyedMessage<String, String>("topic2", j+"", "The " + i + " message for key " + j));
        }
        producer.send(messageList);
    }
  producer.close();
}

Then the message with the same key will be sent and stored in the same partition, and the key's serial number is exactly the same as the partition serial number. (Partition serial number starts from 0, and the key in this example also starts from 0). The following figure shows the list of messages printed after calling Consumer through a Java program.

Consumer Group

(All descriptions in this section are based on the Consumer hight level API rather than the low level API).

When using the Consumer high level API, a message of the same topic can only be consumed by one Consumer in the same Consumer Group, but multiple Consumer Groups can consume the message at the same time.

This is the method that Kafka uses to implement the broadcast (sent to all Consumers) and unicast (sent to a certain Consumer) of a Topic message. A Topic can correspond to multiple Consumer Groups. If you need to implement broadcasting, as long as each Consumer has an independent Group. To achieve unicast as long as all consumers are in the same group. With Consumer Group, consumers can be freely grouped without the need to send messages to different topics multiple times.

In fact, one of Kafka's design philosophies is to provide both offline and real-time processing. According to this feature, a real-time stream processing system such as Storm can be used to process messages online in real time, while a batch processing system such as Hadoop can be used for offline processing, and data can be backed up to another data center in real time at the same time. The Consumers used in the three operations belong to different Consumer Groups. The following figure is a simplified deployment diagram of Kafka in Linkedin.

The following example shows the features of Kafka Consumer Group more clearly. First create a Topic (named topic1, containing 3 Partitions), then create a Consumer instance belonging to group1, and create three Consumer instances belonging to group2, and finally send messages with keys 1, 2, and 3 to topic1 through the Producer . It turns out that the consumers belonging to group1 received all the three messages, and the three consumers in group2 received messages with keys 1, 2, and 3 respectively. As shown below.

Push vs. Pull

As a messaging system, Kafka follows the traditional way of choosing to push messages from the Producer to the broker and the Consumer to pull from the broker. Some logging-centric systems, such as Facebook's Scribe and Cloudera's Flume , use the push mode. In fact, push mode and pull mode have their own advantages and disadvantages.

The push mode is difficult to adapt to consumers with different consumption rates, because the message sending rate is determined by the broker. The goal of the push mode is to deliver messages as quickly as possible, but this can easily cause the Consumer to be too late to process messages, typically resulting in denial of service and network congestion. In the pull mode, messages can be consumed at an appropriate rate according to the consumer's consumption capacity.

For Kafka, pull mode is more appropriate. The pull mode simplifies the design of the broker. The consumer can independently control the rate of consuming messages, and the consumer can control the consumption method by itself—either in batches or one by one, and at the same time, it can choose different submission methods to achieve different transmission semantics.

Kafka delivery guarantee

There are several possible delivery guarantees:

  • At most once messages may be lost, but never repeated
  • At least one message is never lost, but may be retransmitted
  • Exactly once each message will definitely be transmitted once and only once, which is what the user wants in many cases.

    When the Producer sends a message to the broker, once the message is committed, it will not be lost due to the existence of replication. However, if the Producer encounters a network problem after sending data to the broker and the communication is interrupted, the Producer cannot determine whether the message has been committed. Although Kafka cannot determine what happened during a network failure, the Producer can generate something similar to a primary key, idempotent retries multiple times in the event of a failure, thus achieving Exactly once. As of now (Kafka 0.8.2 version, 2015-03-04), this feature has not been implemented, and it is hoped that it will be implemented in future versions of Kafka. (So ​​by default, a message from the Producer to the broker ensures At least once, and At most once can be achieved by setting the Producer to send asynchronously).

    The next discussion is the delivery guarantee semantics of messages from broker to consumer. (only for Kafka consumer high level API). After the Consumer reads the message from the broker, it can choose to commit, which will save the offset of the message read by the Consumer in the Partition in Zookeeper. The next time the Consumer reads the Partition, it will start reading from the next entry. If not committed, the start position of the next read will be the same as the start position after the previous commit. Of course, the Consumer can be set to autocommit, that is, the Consumer will automatically commit as soon as it reads the data. If only this process of reading messages is discussed, then Kafka ensures Exactly once. However, in actual use, the application does not end when the Consumer reads the data, but needs to perform further processing, and the order of data processing and commit largely determines the delivery guarantee semantics of the message from the broker and the consumer.

  • After reading the message, commit first and then process the message. In this mode, if the Consumer crashes before it has time to process the message after commit, it will not be able to read the message that has just been submitted but not processed after restarting the work next time, which corresponds to At most once

  • After reading the message, process it first and then commit it. In this mode, if the Consumer crashes before committing after processing the message, the message that has just not been committed will be processed when the work is restarted next time. In fact, the message has already been processed. This corresponds to At least once. In many usage scenarios, messages have a primary key, so the processing of messages is often idempotent, that is, processing this message multiple times is equivalent to processing it only once, so it can be considered as Exactly once. (I think this statement is far-fetched, after all, it is not a mechanism provided by Kafka itself, and the primary key itself cannot fully guarantee the idempotency of the operation. In fact, we say that the delivery guarantee semantics is to discuss how many times it is processed, not the processing result. How, because of the variety of processing methods, we should not regard the characteristics of the processing process - such as whether it is idempotent or not, as a feature of Kafka itself)

  • If you must do Exactly once, you need to coordinate the output of the offset and the actual operation. The classic approach is to introduce two-phase commit. It would be more concise and general if the offset and the operation input could be stored in the same place. This way may be better, as many output systems may not support two-phase commit. For example, after the Consumer gets the data, it may put the data in HDFS. If the latest offset and the data itself are written to HDFS together, it can ensure that the output of the data and the update of the offset are either completed or not completed, which indirectly achieves Exactly once. (As far as the high level API is concerned, the offset is stored in Zookeeper and cannot be stored in HDFS, while the offset of the low level API is maintained by itself and can be stored in HDFS)

In short, Kafka guarantees At least once by default, and allows At most once by setting the Producer to commit asynchronously. Exactly once requires cooperation with external storage systems. Fortunately, the offset provided by Kafka can be used very directly and easily.

About the Author

Guo Jun (Jason), master, engaged in the research and development of big data platform, proficient in distributed message systems such as Kafka and stream processing systems such as Storm.

Contact: Sina Weibo: Guo Jun_Jason WeChat: habren Blog: http://www.jasongj.com

Next notice

The next article will explain in depth how Kafka does Replication and Leader Election. In versions prior to Kafka 0.8, if a broker goes down or there is a problem with the disk, the data of all partitions on the broker will be lost. After Kafka 0.8, the Replication mechanism was added, which can back up the data of each Partition in multiple copies. Even if some brokers are down, the availability of the system and the integrity of the data can be guaranteed.

Reprinted in: https://www.cnblogs.com/licheng/p/6443590.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324132376&siteId=291194637