Kafka Series 1: Kafka Overview

Kafka Series 1: Kafka Overview

Kafka is one of the current distributed messaging middleware system is the most popular, by virtue of its high-throughput design, the developer won the favorite in scenarios log collection system and messaging system. Benpian to talk about some of the knowledge related to Kafka. Include the following:

Features Introduction Kafka Kafka Kafka Kafka basic concepts of architecture

Kafka's core concepts partition Partition replicate Replication message consumer group offset consumption

Kafka engineering applications

About Kafka

Kafka Features

Kafka was originally developed by the Linkedin, is a distributed, multi-copy partition, and more subscribers, zookeeper coordination of distributed log-based systems (also can be used as MQ system), can be used for common web / nginx logs, access logs, news services, etc., Linkedin contribution in 2010 to the Apache Foundation and become the top open source projects. Compared to other middleware message queue, the main design goals of Kafka, i.e. the following features:

Time complexity is O (1) provides the message persistence ability, even above the level of the TB data access performance can be ensured a constant time.

High throughput. Even on a very inexpensive stand-alone IBM can do to support transmission of 100K messages per second.

Support Kafka news partition between the Server, and distributed consumption, while ensuring sequential transmission of messages within each partition.

At the same time support for offline data processing and real-time data processing.

Scale out: the level of support online expansion

Kafka basic concepts

Broker

Kaka cluster of one or more servers called Broker. Broker Topic stored data.

If a topic partition there are N, N clusters have a broker, each broker then stored in a partition in the topic.

If a topic has N partition, there is the cluster (N + M) th broker, which then stores the N broker a partition in the topic, the remaining partition the M broker does not store data in the topic.

If there is a topic of the N Partition, broker cluster number less than N, then a broker that stores a plurality of topic or partition. In the actual production environment, try to avoid this from happening, this situation is likely to lead to Kafka cluster data is not balanced.

Topic

Kafka published to each message has a category is a logical concept.

Topic different physically separate storage of messages, the message is logically a Topic although stored on one or more Broker, but the user need only specify message Topic to production or consumption data without concern for where the data are stored in

Partition

Topic physical partition, can be divided into a plurality of the Partition Topic, at least one Partition.

Partition for each data segment using a plurality of file storage, each Partition is an ordered queue, the data between different Partition is disordered.

Partition each message is assigned a sequential ID (i.e., offset).

Producer

And message data producer. Producer messages published to the topic of Kafka.

After receiving the message Producer Broker published, Broker message appended to the file in the current segment for the additional data.

Producer transmitted message is stored in a Partition, Producer stored data can also specify Partition.

Consumer

Consumer news and data. Consumer data read from the Broker.

Consumer can consume data from multiple topic is.

Consumer Group

Every consumer belongs to a particular consumer group.

Group name can be specified for each Consumer, if the group name specified in the Default group.

A Topic can have multiple consumer groups, Topic of the message will be copied to all consumer groups, but each group of consumers will only send messages to the group a consumer.

Consumer groups are the means used to achieve Kafka broadcast and unicast of a Topic of the message.

Leader

Partition each of a plurality of copies, wherein there is only one as a leader.

Partition Leader is responsible for reading and writing of the current data.

Follower

Follower follows the Leader, all write requests are routed through Leader, data changes will be broadcast to all Follower, Leader and Follower maintain data synchronization.

If the Leader fails, from the Follower elect a new Leader.

If the Follower and Leader hang up, stuck or slow sync, Leader will put this Follower from distributed message components Kafka "in sync replicas" ## high throughput is how it works

Kafka is one of the current distributed messaging middleware system is the most popular, by virtue of its high-throughput design, the developer won the favorite in scenarios log collection system and messaging system. Benpian to talk about some of the knowledge related to Kafka. Include the following:

Features Introduction Kafka Kafka Kafka Kafka basic concepts of architecture

Kafka's core concepts partition Partition replicate Replication message consumer group offset consumption

Kafka engineering applications

About Kafka

Kafka Features

Kafka was originally developed by the Linkedin, is a distributed, multi-copy partition, and more subscribers, zookeeper coordination of distributed log-based systems (also can be used as MQ system), can be used for common web / nginx logs, access logs, news services, etc., Linkedin contribution in 2010 to the Apache Foundation and become the top open source projects. Compared to other middleware message queue, the main design goals of Kafka, i.e. the following features:

Time complexity is O (1) provides the message persistence ability, even above the level of the TB data access performance can be ensured a constant time.

High throughput. Even on a very inexpensive stand-alone IBM can do to support transmission of 100K messages per second.

Support Kafka news partition between the Server, and distributed consumption, while ensuring sequential transmission of messages within each partition.

At the same time support for offline data processing and real-time data processing.

Scale out: the level of support online expansion

Kafka basic concepts

Broker

Kaka cluster of one or more servers called Broker. Broker Topic stored data.

If a topic partition there are N, N clusters have a broker, each broker then stored in a partition in the topic.

If a topic has N partition, there is the cluster (N + M) th broker, which then stores the N broker a partition in the topic, the remaining partition the M broker does not store data in the topic.

If there is a topic of the N Partition, broker cluster number less than N, then a broker that stores a plurality of topic or partition. In the actual production environment, try to avoid this from happening, this situation is likely to lead to Kafka cluster data is not balanced.

Topic

Kafka published to each message has a category is a logical concept.

Topic different physically separate storage of messages, the message is logically a Topic although stored on one or more Broker, but the user need only specify message Topic to production or consumption data without concern for where the data are stored in

Partition

Topic physical partition, can be divided into a plurality of the Partition Topic, at least one Partition.

Partition for each data segment using a plurality of file storage, each Partition is an ordered queue, the data between different Partition is disordered.

Partition each message is assigned a sequential ID (i.e., offset).

Producer

And message data producer. Producer messages published to the topic of Kafka.

After receiving the message Producer Broker published, Broker message appended to the file in the current segment for the additional data.

Producer transmitted message is stored in a Partition, Producer stored data can also specify Partition.

Consumer

Consumer news and data. Consumer data read from the Broker.

Consumer can consume data from multiple topic is.

Consumer Group

Every consumer belongs to a particular consumer group.

Group name can be specified for each Consumer, if the group name specified in the Default group.

A Topic can have multiple consumer groups, Topic of the message will be copied to all consumer groups, but each group of consumers will only send messages to the group a consumer.

Consumer groups are the means used to achieve Kafka broadcast and unicast of a Topic of the message.

Leader

Partition each of a plurality of copies, wherein there is only one as a leader.

Partition Leader is responsible for reading and writing of the current data.

Follower

Follower follows the Leader, all write requests are routed through Leader, data changes will be broadcast to all Follower, Leader and Follower maintain data synchronization.

If the Leader fails, from the Follower elect a new Leader.

If the Follower and Leader hang up, stuck or slow sync, Leader will be removed from the Follower "in sync replicas" list, re-create a Follower.

Kafka architecture

Kafka generally be deployed in a cluster arrangement, a typical Kafka cluster architecture as shown below:

Core Concepts of Kafka

Partition Partition

Several features of partition

Kafka is the basic partition of the memory cell, there will be one or more in a Topic Partition, different Partition may be located on a different server nodes, a physical Partition corresponding to a folder.

Partition the Segment contains one or more, each in turn contains a Segment data file and a corresponding index file.

For write operations, each time a Segment will write in the Partition; For read operations, the read order will be different within the same Segment Partition.

Logically, as it can be a very long Partition array, by using this index to access the data array (offset).

It is one of the high-throughput design partitioning method for high throughput Kafka design, embodied in these points:

Since different Partition may be located on different machines, and therefore the parallel processing can be achieved between machine.

Partition since a folder corresponding to a plurality of Partition also be located on the same server, so you can make different Partition on the same server corresponding to different disks, parallel processing between disk.

It is generally parallel to increase the throughput of the system by increasing the number of Partition, but also increases a slight delay.

But these types of situations need to pay attention to the following:

When there are multiple consumers a Topic, a message will only be a consumer group in consumption of a consumer;

Since the message is allocated in units of Partition, when not considering Rebalance, with a Partition data will only be consumed a consumer, so if the number of consumers than Partition, it will not consume the presence of some consumers Topic of the situation, this time to increase consumer does not improve the throughput of the system;

The producer and Broker perspective, writes for different Partition is fully parallel, but for consumers it will depend on the number of concurrent Partition of. Partition the actual number needs to be configured according to the calculated throughput of the system design.

copy

Copy the principle Kafka to maintain information using the zookeeper cluster members, each Broker instance will be set a unique identifier, Broker at startup their own unique identification to register by creating a temporary node zookeeper way, Kafka in other Zookeeper component monitors inside / broker / ids path, so when there is a cluster Broker to join or leave, other components will be notified. Data replication between clusters, in Kafka is a leader in providing election by Zookeeper's data replication scheme. The basic principle is: first elect a leader, as other copies Follower, all writes are first sent to leader, then the message to the leader Follower. Replication is one of the core architecture of Kafka, because it can also ensure the availability of Kafka as a whole when individual nodes are not available. Kafka in the copy operation is against partition. There are multiple copies of a partition, copy is saved on the Broker, each Broker can save thousands Topic belong to different partitions and copy. There are two types of copy:

leader copy: Each partition will have to request all producers and consumers will have been leader;

follower replica: not process client requests, it is the responsibility of the message data copied from the leader, and the leader of a state consistent with its own;

If the leader node goes down, then it was elected leader of a follower will continue to provide services;

Replication factor: There are several copies of a partition.

Way messaging

From the producer's point of view, a message is sent to the Broker in three ways:

Send now: send messages only, do not care about the result of messages sent. Is essentially an asynchronous transmission mode, the first message stored in the buffer, the batch reached after setting transmission conditions. Of course, this is a way kafka highest throughput, and the fitting parameters acks = 0, so that the producer does not need to wait for a response from the server, the network can support a maximum transmission speed of the message. But the news is not the most reliable kind of way, because for the message failed to send did not do any treatment.

Synchronous transmission: Producer Gets Future object returned after sending a message to see whether the transmission was successful based on the results of the object. If the service requires the message must be transmitted in order, then you can use a synchronous manner, and only one partation, binding parameter values ​​so that retries the transmission retry fails, provided max_in_flight_requests_per_connection = 1, the closing can be controlled producer before noon to the server should only send one message, flush immediately after the message is sent successfully, so that the control message transmission sequence.

Will trigger the execution of a callback function when the producer sends a message to registered callback function passed as parameters into the producer receives the server's response Kafka: asynchronous transmission. If a business needs to know whether the message is sent successfully, and do not care about the order of the message, you can use asynchronous callbacks + way to send a message, with the parameter retries = 0, and sends the message failed records to a log file.

Confirmation message

Count how successful it after the delivery message to Broker, Kafka confirmed that there are three modes:

Ranging from Broker delivery confirmation is considered successful;

By the leader to confirm the successful delivery;

All of the leader and follower have confirmed only considered successful.

Comparison of three modes, then sequentially decreased performance, but in turn increase the reliability.

The message retransmission mechanism

Is received from the Broker is recoverable temporary abnormal, the message will be re-Broker producers, the limit value of the number of times of retransmission retries object attributes determine initialized by the producers, the producers will default retry waiting 100ms, you can be modified by retry.backoff.ms property.

Batch send

When multiple messages to be transmitted to the same partition, put them in the same producer in a batch, to increase throughput Kafka batch concept, but also increases the delay. Control of the batch is achieved by two main attributes of objects constructed Manufacturer:

batch.size: sent when the number of message buffers for each partition reaches this value, it will trigger a request to the network, all messages in the batch will be sent;

linger.ms: Each message maximum time in the cache, if more than this time limit will be ignored batch.size, immediately sent by the client message.

Consumer group

Kafka consumer group is scalable and provides fault tolerance mechanisms consumption, in a consumer group can have a plurality of consumers, they share a unique identifier, i.e., the packet ID. All consumer groups in the coordination of all partitions of news consumption under their subscribed threads, but a partition can only be consumed by a consumer in the same consumer group.

Broadcast and Unicast

A Topic can have multiple consumer groups, Topic of the message will be copied to all consumer groups, but each group of consumers will only send messages to a consumer group in a certain consumer. If you want to achieve broadcast, just each consumer is assigned a separate set of interfaces that if consumers want to achieve as a unicast, you need to put all consumers are set in the same consumer group in

Rebalancing

Consumer groups, there are new consumers to join or leave the consumers, partition ownership will be transferred from one consumer to another consumer re-balancing protocol defines how all consumers in a consumer group agreed to assign topics under each partition trigger rebalancing, there are three scenarios:

First, in the consumer's group membership changes

Second, the number of subscriptions happen table theme more

Three is the number of partitions subscribe to a topic changed

Consumption offset

Kafka has a special theme called _consumer_offset to save the message, the consumer each time the consumer will send a message to this topic in the offset of each partition, the message contains the offset of each partition. If the consumer has been in operation, offset no effect; if consumers crashes or new consumers to join consumer groups to trigger rebalancing operations, the consumer partition after re-balanced if not before that, submit partial shift amount comes in handy. Maintenance messages offset to avoid missing messages are repeated consumption and consumption to ensure ExactlyOnce essential message, the following is submitted by offsets different ways:

Automatic commit: Kafka periodically automatically submit the default offset, the default time interval is filed 5 seconds. This method causes a problem of repetitive processing messages;

Manual submission: consumers need to turn auto-configuration is submitted prior to manual submission, and then to submit commitSync offset method. After processing by the developer to ensure that recording call commitSync method, iterative process to reduce the number of messages, but may reduce the throughput of the consumer;

Asynchronous Submit: Use commitASync method to submit a final offset. Consumers simply submit a request to send, without the need to wait for a response immediately Broker.

Kafka engineering applications

Kafka is mainly used for three scenarios:

Based on Kafka user behavior data collection

Based on Kafka log collection

Based on the flow of clipping Kafka

Based on Kafka user behavior data collection

To obtain the necessary data to analyze user behavior, and need several steps:

Front data (Buried) reported

Receiver front-end data request

Back through the Kafka consumer news, for falling if necessary

Analysis of user behavior

Based on Kafka log collection

Various application systems utilized when the output log Kafka high throughput platform as a data buffer, the log output to the unified Kafka, and then open the way to a unified interface to a variety of consumer services through Kafka. Unified platform to do logging program to collect important system log concentrated Kafka, and then import the consumer ElasticSearch, HDFS, Storm and other specific log data for real-time search analysis, off-line statistics, data backup, large data analysis Wait.

Based on the flow of clipping Kafka

In order for the system is still available in high-volume scenarios, you can join as a buffer message queue message flow in the system's key business sectors, in order to avoid problems caused by high traffic overwhelmed the entire application generated in a short time.

发布了78 篇原创文章 · 获赞 9 · 访问量 6191

Guess you like

Origin blog.csdn.net/WANXT1024/article/details/104401718