Kafka Core Concepts

English text of this article comes from kafka official presentation documents


Here Insert Picture Description Apache Kafka® is a distributed workflow platform. What does it mean specifically?
Flow platform has three key capabilities:

  • Publish and subscribe flow records, similar to a message queue, or enterprise information systems.
  • Recording persistent stream for recovery.
  • Rapid processing flow record.

Kafka program is mainly used for two types of applications:

  • Construction reliable real-time streaming data path between systems and applications
  • Establish real-time streaming applications to convert or process the data stream

To understand how Kafka achieve the above functions, we have the following in-depth exploration and the ability to Kafka.
First, here are some concepts:

  • Kafka is running on one or a cluster of multiple servers, capable of containing a plurality of data centers.
  • Kafka clustered storage stream recorded name for the theme ( Topics )
  • Each record is a primary key value, the time stamp of three parts.

Kafka has four core API:

Here Insert Picture Description- Producers API for providing the application sends to the one or more flow records relating to Kafka.

  • Consumer API for providing the application to subscribe to one or more topics, and the process of production to a corresponding topic in the stream recording.
  • Streaming API for providing streaming application to consumer input stream or a plurality of topics, and generates an output stream relating to one or more output can be input stream into an output stream.
  • Connector API for producers or consumers to build and run reusable, and can be connected to existing applications or data system. All changes such as a connector used to obtain a traditional database table.

Kafka with a simple, high-performance, language-independent ( Language agnostic ) of the TCP protocol to communicate with the client and the server. This is the version of the protocol for the old version can also be backward compatible. We provide a Java client Kafka, but there are other many languages the customer's end.

Theme and log

We begin to understand the flow records provided by Kafka - theme.
We publish classified records or polymeric called themes. Theme in Kafka generally are multi-subscription, it is a theme there will be zero, one or more consumer subscription data to be written.
Each theme, Kafka cluster maintains a log block in the following figure:

Here Insert Picture DescriptionEach block is sequentially recorded immutable sequence, and continuously added to the structure of the commit log. The recording block is assigned a sequence offset of the call id number ( offset ) as uniquely identifies each record in the block.
Kafka cluster will be configured based on the retention of persistent record of all released, regardless of whether the consumer. For example retention policy is set to 2 days, a record within two days of release, all can be consumed, and then will be discarded to free up space. Kafka amount of data storage efficiency is very high, so there is no long-term data storage problems.

Here Insert Picture DescriptionIn fact, the only metadata when the consumer is the consumer to be offset position in the log. In fact, since the offset is controlled by the consumer so consumers can follow any sort of consumer willingness to record normal circumstances consumers will be offset to the read record: offset is controlled by the consumer. For example, consumers can reset the offset to reprocess past data has been processed, or skip some history from "now" to start processing the nearest record straight.
These characteristics mean that Kafka mixed consumer is lightweight, can simply be added or removed without affecting the number of clusters or produce other consumers. For example, you can use the command-line tool "to view the end of the content (tail ')" for the content on any topic without the need to care about what exactly is consumer spending.
Chunked log service has several purposes. First, this can be extended, without limitation, a single server. Separate block on the host server is limited to the case, but a theme can contain many sub-blocks, so it can be a large amount of data. Secondly, we can improve the efficiency of parallel computing, comparison in terms of a single file.

distributed

Chunked log is deployed on the server Kafka distributed cluster, each server will be responsible for their own data and request block. Each block will be stored plurality of parts to achieve fault tolerance over a specified number of machines.
Each block there is a server as a "leader" master server and 0 or more servers as "followers" from the server. The primary server to accept all read and write requests, from the server synchronous replication. When the primary server fails, a station from the server automatically becomes the new master. Each server may have been some of their own internal primary server to block and other block from the server so that the cluster load balancing.

Location Copy

Kafka mirroring (MirrorMaker) provide geographic replication cluster. Mirroring able to copy the message into a plurality of data centers or the cloud. Can be active or passive backup and recovery scenarios. Or location data need to be closer to the user in some scenes, or support data localization requirements.

Producers

Producers publish data to the target theme. Producers are responsible for choosing the theme of the record is assigned to a specific block. Polling distribution form (round-robin fashion) to ensure load balancing, or semantic function block (e.g., the record in the primary key block) basis. We will introduce more usage block later.

consumer

Consumers must be specified consumer group names, each publication instance of a consumer to record the theme will be delivered to each consumer group's subscription. Consumers instance can be a separate process or a separate server.
If all consumers are of the same consumer groups, the data can be recorded instance on balanced consumption of all consumers.
If all consumer instances belong to different consumer groups, each data record will be broadcast to all consumers process.

Here Insert Picture DescriptionKafka a cluster of two servers comprising four sub-blocks (P0-P3) and two consumer clusters. A consumer cluster contains two instances of consumers, consumer cluster B contains four instances consumers.
Our theme have a small amount of normal consumption clusters, each corresponds to a single "logical subscriber." Each cluster consists of multiple instances of consumers to ensure scalability and fault tolerance. In the publication subscription mode subscribers is certainly better than a single cluster process.
Methods to achieve internal consumption of Kafka, is the log block to all consumers instances, each instance is an "equitable distribution" of independent consumer at any point in time. Membership in the group is to use Kafka protocol dynamically maintained. When a new instance of the cluster will be added to share part of the block from the cluster members. If an instance is not valid, it will be distributed to partition the examples also exist.
Kafka only maintain a sort ordering within the block recording can not be maintained between a number of different sub-blocks of a subject. Using the primary key as a data block during the sort able to meet most applications. If you need to ensure that a single subject all the records sorted controlled, it is only one block. Each cluster can only consume a consumer process.

Multi-tenant

Kafka support multi-tenant solution. Can open multi-tenant set in the configuration, the subject can produce or consume data. Also support quotas operations. Administrator can configure the number of requests and fixed force, thereby controlling the proxy resource used by the client. For more information you can view secure documents .

Guarantee

High-grade Kafka provides the following protection:

  • Sent by the producer to a specified topic block will be transmitted in the order added tail. Similarly M1M2 transmitted as message producers, M1 sent first, then the lower offset M1 and M2 will be earlier than in the log.
  • Examples of consumer reads in order to log the message store.
  • Theme is divided into N replication set to support the N-1 of a server failure, without losing any records submitted to the log.

Kafka as a message system

Compared with the traditional concept of flow of Kafka's enterprise messaging system, how is it?
There are two models of traditional messaging services: queues and publish - subscribe . Consumers can read data from a cell station in a queue server, a record of a single consumer spending. Posted subscription model will broadcast messages to all consumers. Both models have advantages and disadvantages. Queue advantage is the ability to split the data processing to a different consumer example, processing capacity can be improved. However, the queue can not exist more consumers, once a consumer spending data, this data does not exist. Published data subscription model can broadcast multiple treatments, but because each data will be sent to each subscriber, it is difficult to improve the processing efficiency.
Kafka in the consumer group covering these two concepts. When using a queue, the consumer group may be dispersed in the process set handler (consumption by members of a group). When using publish-subscribe, Kafka message can be broadcast to multiple consumer groups.
Advantages Kafka model is that each topic are both two characteristics, it can be easily extended also to have multiple subscribers, to be compatible.
Kafka than traditional message for message ordering system to provide greater protection.
Traditional queue sort the records stored on the server, the queue when multiple consumer consumption data from the server in order to give back to the stored data. Although the server in order to return data, but delivered to the consumer is asynchronous when it reaches different consumers, probably out of order. This means that the record of sort in parallel computing is lost. Traditional messaging system to solve this problem using the concept of "consumer monopoly", allowing only a single-queue process to consume information, but this is not the existence of parallel computing.
Kafka do better. Within the theme of parallel existence called the block concept, Kafka pool to take advantage of the consumer process while ensuring sort and load balancing. Topic segment designation consumer group of consumers, so that each block is consumed group of individual consumers. In this way we can guarantee that a consumer is the only person reading a single block, it is possible to spend in order. Because it is possible to block many instances to ensure that consumers can maintain multiple load balancing. Remember in the number of consumers spend group of no more than the number of instances of block.

Kafka as a storage system

Any message queue, as long as the isolation of information dissemination and consumption of information, in fact, played a role in the process of message storage system. For Kafka, it is a good storage system.
Kafka writes the data written to disk, and copy the improved fault tolerance. Kafka allows producers to wait for the written response, but to wait until the data can be considered fully complete copy writing is completed. Such timely write current server failure can guarantee normal storage.
Kafka's disk structure scalability is also very good, regardless of the data on the server is 50KB or 50TB, Kafka can maintain consistent performance.
Kafka well for storing information, and allows the client to control the reading position, it can be considered a special Kafka distributed storage system, capable of providing a high-performance, low-latency log storage, copying set, propagation.
About Kafka commit log storage and replication design, you can be here to see more.

Kafka processing system as stream

Just read and write data flow and storage is not enough, we hope to be able to support real-time stream processing.
Kafka concept is capable of continuously flowing the processor receives data from the input stream relating, processing the input stream, and generating continuous output data stream to output the theme.
For example, retail sales and applications stream as input data, and adjusts the output data stream based on the incremental price of subscription and the input data is calculated.
When you just do some simple processing can be used directly producer and consumer API. But when you need to deal with complex transformations, Kafka provides a complete integration of streaming API . Such applications can be constructed to process important to the polymerization process or data stream associated with the combined stream of data.
These features can help solve the complex problems facing certain applications: order data processing, re-processing the input data when the code changes, the implementation of stateful computing.
API based on the core flow characteristic Kafka Construction: API using an input producers and consumers, as used Kafka state storage, a mechanism for using the same group Examples of stream processing to ensure high fault tolerance.

Integrated Systems

The news service, storage and streaming together look special, but in fact this is the important role Kafka in a stream processing platform.
Such as HDFS distributed file system to store static files for batch processing. Such a system in the storage and processing of historical data in the past is very effective.
Traditional enterprise messaging system allows you to handle future information generated after subscription. Applications constructed in this manner can be processed in the future when the message arrives.
Kafka integration of these features above, a combination of these characteristics are equally important to Kafka as internet streaming data and streaming applications or pipes.
Relying consolidate storage and low latency subscription, stream processing program can be used in the same way at the same time dealing with the past and future data. A single application capable of handling data storage history, but it does not end after a record after last, will keep running in order to continue processing the data arrives in the future. This is the basic concept of the process stream, it can be included in the batch, and the same message-driven application.
Data flow conduit similar to mixing with the realtime event subscription, latency that can be applied to Kafka pipe. But the ability to store reliable data makes Kafka can be used for transmission of important data or integrated data only temporarily to load or off-line system maintenance downtime of. Stream handling characteristics can be switched so that the data arrival.
Kafka provide protection, APIs, this feature can be in the document viewing in.

Guess you like

Origin blog.csdn.net/pluto4596/article/details/89434848