Kafka basics and core concepts

In this article, we will try to answer what is apache kafka.

kafka is a distributed streaming platform or distributed message submission log

distributed

Kafka is a working cluster consisting of one or more nodes. These nodes can be located in different data centers. We can distribute data/load among different nodes of the Kafka cluster, and it is inherently scalable, available and fault-tolerant.

streaming platform

Kafka stores data as a continuous stream of records that can be processed in different ways.

commit log

When you push data to Kafka, it appends them to a stream of records, such as logs to a log file, which can be "replayed" or read from any point in time.

Is kafka a message queue?

It can certainly act as a message queue, but is not limited to that. It can act as a FIFO queue, a publish/subscribe messaging system, and a real-time streaming platform. And thanks to Kafka's persistent storage capabilities, it can even be used as a database.

To summarize, Kafka is commonly used for real-time streaming data pipelines, i.e. transferring data between systems, building systems that transform continuously flowing data, and building event-driven systems.

We will now get into core Kafka concepts.

information

Messages are the atomic unit of data in Kafka. Suppose you are building a log monitoring system, and you push each log record into Kafka, and your log message is a JSON with this structure.

{
  "level" : "ERROR",
  "message" : "NullPointerException"
}

When you push this JSON to Kafka, you are actually pushing 1 message. Kafka saves this JSON as a byte array, and that byte array is the message to Kafka. This is that atomic unit, a JSON with two keys "level" and "message". But that doesn't mean you can't push anything else to Kafka, you can push Strings, Integer, JSON in different schemas, and everything else, but we typically push different types of messages to different topics.

Messages may have an associated "key", which is simply some metadata used to determine the destination partition for the message.

theme

Topic, as its name suggests, is a logical classification of messages in Kafka and is a stream of data of the same type. Going back to our previous logging system example, let's assume that our system generates application logs, ingress logs, and database logs and pushes them to Kafka for consumption by other services. Now, these three types of logs can be logically divided into three topics, appLogs, ingressLogs, and dbLogs. We can create these three topics in Kafka and whenever there is an application log message, we push it to the appLogs topic and for database logs, we push it to the dbLogs topic. In this way we have logical isolation between messages, a bit like using different tables to store different types of data.

Partition

Partitioning is similar to sharding in a database and is the core concept behind Kafka's scalability. Suppose our system becomes very popular and therefore has millions of log messages per second. So now the node where the appLogs topic is located cannot save all the incoming data. We initially solved this problem by adding more storage to our nodes, i.e. vertical scaling. But as we all know, vertical scaling has its limitations and once we reach that threshold, we need to scale horizontally, which means we need to add more nodes and split the data between nodes. When we split a topic's data into multiple streams, we call all of these smaller streams "partitions" of that topic.

This diagram depicts the concept of partitioning, where a single topic has 4 partitions, and all partitions contain a different set of data. The blocks you see here are the different messages in this partition. Suppose the subject is an array, now due to memory constraints we split the single array into 4 different smaller arrays. When we write a new message to a topic, the relevant partition is selected and the message is added to the end of the array.

The offset of a message is the array index of the message. The numbers on the blocks in this diagram represent the offsets, the first block will be at the 0th offset and the last block will be at the (n-1)th offset. The performance of your system also depends on how you set up your partitions, which we'll look at later in this article. (Note that on Kafka it is not an actual array, but a symbolic array)

producer

Producers are Kafka clients that publish messages to Kafka topics. Additionally, one of the core responsibilities of the producer is to decide which partition to send messages to. Based on various configurations and parameters, the producer decides the target partition, let's take a deeper look.

  1. No key specified => When no key is specified in the message, the producer will randomly decide the partitions and try to balance the total number of messages across all partitions.
  2. Specify key => When a message specifies a key, the producer uses consistent hashing to map the key to a partition. If you don’t know what consistent hashing is, don’t worry, in short, it is a hashing mechanism that always generates the same hash for the same key, and it minimizes re-hashing scenarios Or the redistribution of keys that add nodes to the cluster. So, assuming in our logging system, we use the source node ID as the key, then the logs of the same node will always go to the same partition. This is very relevant to the ordering guarantees of messages in Kafka, as we will see shortly.
  3. Specify partition => You can also hardcode the target partition.
  4. Custom partition logic => We can write some rules based on partitions.

consumer

So far, we have generated messages, which we read using a Kafka consumer. Consumers read messages from partitions in an ordered manner. So if you insert 1, 2, 3, 4 into the topic, the consumer will read it in the same order. Since each message has an offset, every time a consumer reads a message, it stores the offset value into Kafka or Zookeeper, indicating that this is the last message read by the consumer. So in case the consumer node fails, it can go back and resume from where it last read. Additionally, if at any point in time the consumer needs to go back in time and read old messages, it can do so by resetting the offset position.

consumer group

A consumer group is a collection of consumers that work together to read messages from a topic. There are some very interesting concepts here, let’s take a look at them.

  1. Fan out exchange => A single topic can be subscribed by multiple consumer groups. Suppose you are building an OTP service.

  2. Now you need to send text and email OTP. So your OTP service can put the OTP into Kafka, and then both the SMS Service consumer group and the Email Service consumer group can receive the message, and then send SMS and email.

  3. Sequence Guarantee => Now that we know that topics can be partitioned and multiple consumers can consume from the same topic, you may ask how to maintain the order of messages on the consumer side. good question. A partition cannot be read by multiple consumers in the same consumer group. This is only enabled by consumer groups, only one consumer in the group can read data from a single partition.

So your producer produced 6 messages. Each message is a key-value pair, with key "A" having the value "1", "C" having the value "1", "B" having the value "1", and "C" having the value "2" ... ... "B" value is "2". (Note that by keys I mean the message keys we discussed earlier, not the JSON or Map keys). Our topic has 3 partitions, and since consistently hashed messages with the same key always go into the same partition, all messages with "A" as key will be grouped together, as will B and C. Now there is only one consumer per partition, and they can only get messages sequentially. So the consumer will receive A1 before A2 and B1 before B2, so the order is maintained. Going back to our logging system example, the key is the source node ID, then all logs for node 1 will always go to the same partition. Since messages are always sent to the same partition, we will maintain the order of messages.

This will not be possible if the same partition has multiple consumers in the same group. If you read the same partition in different consumers in different groups, the messages will also end up in order for each consumer group.

So for 3 partitions you can have up to 3 consumers, if you have 4 consumers one consumer will be idle. But with 3 partitions you can have 2 consumers, then one consumer will read from one partition and one consumer will read from both partitions. If one consumer goes down in this situation, the last surviving consumer will end up reading data from all three partitions, and when new consumers are added back, the partitions will be split among the consumers again, This is called rebalancing.

Broker

A broker is a single Kafka server. The broker receives messages from producers, assigns offsets to them, and commits them to the partitioned log, which basically writes the data to disk, which gives Kafka durability.

cluster

A Kafka cluster is a set of broker nodes that work together to provide scalability, availability, and fault tolerance. One node in the cluster works as a controller, which basically assigns partitions to brokers and monitors if the broker is unable to perform certain management work.

In a cluster, partitions are replicated to multiple brokers based on the topic's replication factor to provide failover capabilities. What I mean is that for a topic with a replication factor of 3, each partition of the topic will exist on 3 different brokers. When a partition is replicated to 3 brokers, one of the brokers will act as the leader of the partition and the remaining two will be followers. Data is always written to the leader broker and then replicated to followers. This way we won't lose data, we won't lose cluster availability, and if the leader goes down, another leader will be elected.

Let's look at a practical example. I'm running a 5-node Kafka cluster locally and I run this command

bin/kafka-topics.sh --bootstrap-server 192.168.49.2:30092 --topic applog --partitions 5 --replication-factor 3 --create

The cluster will

  1. Create topic
  2. Create 5 partitions for this topic
  3. and replicate the data of all 5 topics into a total of 3 nodes

Let's take partition 0 as an example. The leader of this partition is node 2. The data for this partition is replicated on nodes 2,5 and 1. So one partition is replicated on 3 nodes and this behavior is repeated for all 5 partitions. And if you see, all the leader nodes are different for each partition. Therefore, in order to properly utilize the nodes, the Kafka controller broker distributes the partitions evenly across all nodes. You can also observe that replication is also evenly distributed and no nodes are overloaded. All this is done by the controller Broker with the help of Zookeeper or KRaft (3.3.1 production available).

Since you now understand clustering, you can see that we can have more partitions on a topic, and for each partition we can add a dedicated consumer node so that we can scale horizontally.

Something more advanced

Beyond that, there are some slightly more advanced things you should know, just to briefly touch on them.

Producer

You can send data to kafka in 3 ways

  1. Send and forget
  2. Send synchronously
  3. Send asynchronously

They all have their own performance and consistency pitfalls.

You can also configure confirmation characteristics on the producer.

  • ACK 0: Don’t wait for acknowledgment | FASTEST
  • ACK 1: Consider sending an acknowledgment when the leader broker receives the message|FASTER
  • ACK All: Consider sending an acknowledgment when all replicas receive the message | FAST

You can compress and batch messages on the producer before sending them to the broker.

It provides high throughput and reduces disk usage, but increases CPU usage.

Avro serializer/deserializer

If you use Avro as a serializer/deserializer instead of plain JSON, you will have to declare your schema up front, which gives better performance and saves storage space.

Consumer

Round robin polling

Kafka consumers continuously poll data from brokers and vice versa.

Partition allocation strategy can be configured

Range: Consumer gets consecutive partitions

Loop method: write data to the partition in a loop

Sticky: Rebalancing keeps most allocations intact while creating minimal impact

Cooperative sticky: Sticky partitioning method but allows cooperative rebalancing

batch size

We can configure how many records and how much data is returned per polling call.

commit offset

When reading a message, we can update the consumer's offset position, which is called a commit offset. Autocommit can be enabled, or the application can commit offsets explicitly. This can be done synchronously and asynchronously.

End

Kafka is a great piece of software with tons of features that can be used for a variety of use cases. Kafka is well suited for modern distributed systems because it is designed to be distributed. It was originally created at LinkedIn and is currently maintained by Confluent. It’s used by top tech companies like Uber, Netflix, Activision, Spotify, Slack, Pinterest, Coursera and more.

Reference documentation

Kafka basics and core concepts
from big data to artificial intelligence

Guess you like

Origin blog.csdn.net/weixin_39636364/article/details/128086565
Recommended