Introduction to kafka

Introduction
Apache Kafka® is a distributed streaming platform. What is its exact meaning?


We think streaming platforms have three key features:


1. It allows you to publish and subscribe to streaming recordings. In this respect, it is similar to a message queue or enterprise messaging system.
2. It allows you to store stream records in a fault-tolerant way.
3. It lets you process stream records as they happen.

What are the advantages of Kafka?

It is used by two broad categories of applications:

building real-time streaming data pipelines that fetch data between systems or applications in real-time.
 
Build real-time applications that transform or respond to streaming data.

To understand how Kafka does these things, let's dive into Kafka's capabilities from the bottom up.


First some concepts:

Kafka runs as a cluster on one or more servers.
A Kafka cluster stores stream records in categories called topics.
Each record consists of key, value, timestamp.

Kafka has 4 core APIs: The

       Producer API allows an application to publish a stream of records to one or more Kafka topics.

       The consumer API allows an application to subscribe to one or more topics and process stream records produced for them.

      The Streaming API allows an application to act like a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, efficiently converting input streams to output streams.

       The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database can capture every change to a table.


       In Kafka, communication between client and server is done through a simple, high-performance, language-agnostic tcp protocol. The protocol is versioned, maintaining backward compatibility with older versions. We provide a java client for Kafka, but there are also clients available in multiple languages.



Topics and Logs


Let's first dive into the core abstraction Kafka provides for stream logging - topics.

        A subject is a category or name of a published record. Topics in Kafka are often multiple subscribers; that is, a topic can have 0, 1, or more consumers subscribed to data written to it.

For each topic, the Kafka cluster maintains a partitioned log that looks like this:




each partition is an ordered, immutable series of records that are continuously appended to a structured commit log. Each record in a partition is assigned a serialized identification number, called an offset, that identifies the uniqueness of each record in the partition.

The Kafka server saves all published messages, whether they are consumed or not - using a configurable retention period. For example, if the retention policy is set to 2 days, then two days after the record is published, it can be consumed, after which it will be discarded to free up space. Kafka's performance is efficient and durable relative to data size, so storing data for long periods of time is not an issue.



In fact, the only metadata kept on a per-consumer basis is that consumer's offset or location in that log. This offset is controlled by the consumer: normally the consumer will offset linearly as it reads records, but in fact, since the consumer controls the position, it can consume the records in any order it likes. For example, a consumer can reset to an old offset to reprocess previous data or jump to the most recent record and start consuming from now.

This combination of features means that Kafka's consumers are very lightweight - they can come and go without any major impact on the cluster and other consumers. For example, you can use our command line tool to track the content of any topic without changing the content consumed by any existing consumers.



        Partitions in the log serve multiple purposes. First, it allows the log size to exceed the size on a single server. Each independent partition must fit into the server that hosts it, but a topic can have many partitions, so it can handle any amount of data. The second is that they operate as a parallel unit.

The partitions of the distributed


        log are distributed across the servers of the Kafka cluster, and each server processes data and requests partition sharing. Each partition is duplicated on a configurable number of servers for fault tolerance. Each partition has one server as leader and 0 or more servers as followers. The leader handles all read and write requests for the partition, while followers passively replicate the leader. If the leader fails, one of the followers will automatically become a new leader. Each server acts as a leader for some of its partitions and followers for others, so the load is well balanced within the cluster.

Producers


      Producers publish data to topics of their choosing. The producer is responsible for choosing which records to assign to which partitions of the topic. This can be done in a round-robin fashion, just for load balancing, or it can be done based on some semantic partitioning function (like a certain key in a record).

Consumer

       Consumers label themselves a consumer group name, and within each subscribing consumer group, each record published to a topic is delivered to a consumer instance. Consumer instances can be in distributed processes or in distributed machines.

        If all consumer instances have the same consumer group, then records are effectively load balanced across consumer instances.

        If all consumer instances have different consumer groups, then each record will be broadcast to all consumer processes.





         A Kafka cluster with two servers hosts 4 partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.

        More commonly, however, we find that topics have a small group of consumers, one for each "logical subscriber". For scalability and fault tolerance, each group consists of multiple consumer instances. This is just publish-subscribe semantics, the subscribers are clusters of consumers rather than a single process.


       Kafka consumption is achieved by splitting the partitions in the log on consumer instances so that at any point in time, each instance is a fair share of mutually exclusive consumers in the partition. The process of maintaining membership in a group is handled dynamically by the Kafka protocol. If new instances join the group, they will take over some of the other members' partitions in the group; if an instance dies, its partitions will be distributed among the remaining instances.


       Kafka only provides the total order of records within a partition, not the different partitions of a topic. Per-partition sorting plus the ability to partition data by key is sufficient for most applications. But if you require a total ordering on the records, you can fetch over a topic with only one partition, but that means only one consumer process per consumer group.

 Guarantees

A high-level Kafka guarantees that

  
       messages sent by a producer to a particular topic partition will be appended in the order in which they were sent. Then, if a record M1 is sent by the same producer that sent record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.

    Consumer instances are read in the order in which records are stored in the log.
    For a topic with replication factor n, we will tolerate up to n-1 server failures without losing any records committed to the journal.

More details about these guarantees will be given in the Design section of the document.


How does Kafka as a messaging system

compare Kafka's concept of streams with traditional enterprise-level messaging systems?

       There are two traditional modes of messaging: queues and publish-subscribe. In a queue, a pool of consumers can read from the server and send each record to one of them; in a publish-subscribe model records are broadcast to all consumers. Either of these two modes has strengths and weaknesses. The advantage of a queue is that it allows you to split data processing across multiple consumer instances, and it allows you to scale processing. Unfortunately, the queue is not multi-subscriber - once the process has read the data it goes away. A publish-subscriber allows you to broadcast data to multiple processes, but there is no scalable way to handle it as each message reaches each consumer.


The concept of consumer groups in Kafka is summarized into these two concepts. Like queues, consumer groups allow you to split sets of processing processes (consumer group members) Like publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.



The advanced thing about the Kafka model is that every topic has these properties - it scales processing and it's also multi-subscriber - there is no need to choose one or the other.

Kafka has stronger ordering guarantees than traditional messaging systems.



A traditional queue keeps records in order on the server, and if multiple consumers consume from the queue then the server hands over the records in the order they are saved. However, although the server hands out records in order, records are delivered to consumers asynchronously, so they may arrive out of order to different consumers. This effectively means that in the case of parallel consumption, the order of records is lost. Messaging systems usually solve this problem with a concept of "exclusive consumers", which allow only one process to consume from the queue, but of course this means that there is no parallelism in the processing.







Kafka does this better. Through a concept of parallelism—partitioning—in topics, Kafka is able to provide ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning partitions in a topic to consumers in a consumer group so that each partition is consumed by exactly one consumer of a group. By doing this we ensure that this consumer is the only reader of this partition and consumes data in order. Since there are many partitions, the load is still balanced across many consumer instances. Note, however, that there cannot be more consumer instances in a consumer group than partitions.

Kafka as a storage system
 
Any message queue that allows publishing messages decoupled from message consumption is an efficient fast message storage system. The difference with Kafka is that it is a very good storage system.



Data written to Kafka is written to disk and replicated for fault tolerance. Kafka allows the producer to wait for an acknowledgment so that a write is not considered complete until it is fully replicated and guaranteed to be durable, even if the server fails the write.


The on-disk architecture used by Kafka scales very well - it performs the same whether you have 50KB or 50TB of persistent data on the server.



Because it takes storage seriously and allows customers to control where they read. You can think of Kafka as a special-purpose distributed file system that focuses on high-performance, low-latency commit log storage, replication, and propagation.

To learn more about the design of Kafka's commit log storage and replication, read this page.

Kafka stream processing

is not enough to just read, write, and store data streams, the goal is to achieve real-time stream processing.


In Kafka, a stream processor can take a continuous stream of data from an input topic, do some processing on this input, and produce a continuous stream of data to an output topic.


For example, a retail application might take in an input stream of sales and logistics, and output a reordering and price adjustment calculation for this data.


Simple processing is possible directly using the producer and consumer APIs. However, for more complex transformations, Kafka provides a fully integrated streaming API. This allows building applications without trivial processing, computing aggregates of streams or joining streams.

This tool helps solve the challenges faced by such applications: processing non-sequential data, input reprocessing after code changes, performing state calculations, and more.

This streaming API is built on the original core provided by Kafka: it uses producer and consumer APIs as input, Kafka as stable storage, and uses the same infrastructure for fault tolerance between stream processing instances.
put the pieces together


This combination of messaging, storage, and stream processing may seem unusual, but is essential to Kafka's role as a streaming platform.

Distributed file systems like HDFS allow storing static files for batches. Such a system can efficiently store and process past historical data.


Traditional enterprise messaging systems allow future messages arriving after subscription to be processed. Applications built in this way process data as it arrives.


       Kafka combines two capabilities that are critical to the use of Kafka as a platform for streaming applications and for streaming data pipelines.


        By combining storage and low-latency subscriptions, streaming applications can treat past and future data in the same way. That is, an application can process historical, stored data, rather than ending when it reaches the last record, it can continue processing as future data arrives. This is a broad concept of stream processing, including batch processing and message-driven applications.

Likewise, for streaming data pipelines, the combination of subscribing to real-time events enables very low-latency pipelines using Kafka; but the ability to store data reliably makes it possible to use it as critical data. Delivery of data or integration with offline systems that only load data periodically must be guaranteed, or maintenance time may be extended. Stream processing tools make it possible to transform data as it arrives.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326220693&siteId=291194637