Kafka general introduction

Kafka general introduction

Background

 

kafka was originally developed by the Linkedin, using Scala language, Kafka is a distributed, partitioned, multi-copy, multi subscriber logging system (MQ distributed system), can be used for web / nginx logs, search logs, monitoring logs, access logs, and so on.

 

kafka currently supports a variety of client languages: java, python, c ++, php and so on.

 

 

 

kafka Glossary and work:

 

Producer: news producer, is to kafka broker message client.

Consumer: news consumers get news to kafka broker clients

Topic: Message topic.

Consumer Group (CG): This is the topic kafka used to implement a broadcast message (sent to all consumer) and unicast (issued any consumer) means. A topic may have a plurality of CG. topic of the message copy (not true copy, is conceptual) to all of the CG, but each will only CG message to a consumer in the CG. If you want a broadcast, as long as each consumer has an independent CG on it. To achieve as long as all of unicast consumer in the same CG. The consumer can also be grouped by freely without CG message is transmitted multiple times to a different topic.

Broker: a kafka server is a broker. A cluster composed of a plurality of broker. A broker can receive a plurality of topic.

Partition: In order to achieve scalability, a very large topic may be distributed to a plurality Broker (i.e., server), the topic can be divided into a plurality of partition, each partition is an ordered queue. partition each message is assigned a sequential id (offset). kafka order to ensure that only one partition in a message to the consumer, does not guarantee a whole topic (s partition between) sequence.

 Offset: kafka stored files are named according to offset.kafka to do with the name of offset benefits are easy to find. For example, you want to find the location in 2049, just find 2048.kafka of files. Of course, the first offset is 00000000000.kafka

 

kafka characteristics :( behind the proposed re-modify write a more abstract)

l by O (1) disk data structure provides message persistence, in this arrangement even when the number of TB for message storage stability can be maintained long.

l High throughput: even a very ordinary hardware kafka also can support hundreds of thousands of messages per second.

l support synchronous and asynchronous replication two kinds of HA

l Consumer client pull, random read, using the system call sendfile, zero-copy, pull-up data quantities

l consumption state stored in the client

l Write message storage order

l data migration, expansion transparent to users

l Hadoop support parallel data loading.

l Support online and offline scenarios.

l persistence: The persistence data to the hard disk replication and prevent data loss.

l scale out: machine expansion without downtime.

l regularly delete mechanism to support the segment file partitions set retention time.

 

Reliability (consistency)

kafka (MQ) to achieve reliable message transfer between the producer and the distribution to the consumer. Traditional MQ systems typically are confirmed by the broker between the consumer and implemented (ack) mechanism, distribution and status messages stored in the broker.

 

Even so consistency is difficult to ensure (refer to the original). kafka approach is to save the state by the consumer himself, nor any confirmation. Although this consumer burden even heavier, but in fact is more flexible.

 

Because regardless of any reason on consumer result in the need to re-process the message, you can get from the broker again.

 

kafak system expansion

kafka zookeeper to use dynamic cluster expansion, do not need to change the client (producer and consumer) configurations. broker will be registered in the zookeeper and maintaining associated metadata (topic, partition information, etc.) update.

 

And the client registers associated watcher on the zookeeper. Once zookeeper changes, the client can promptly perceive and adjust accordingly. This ensures that when adding or removing broker, broker still among the automatic load balancing.

 

kafka design goals

High throughput is one of its core design.

 

Disk Data Persistence: the message is not in memory Cache, to write directly to the disk, make full use of sequential read and write performance of the disk.

l zero-copy: IO reduction steps.

l Support Bulk data transmission and pulling.

l support data compression.

l Topic is divided into a plurality of partition, parallel processing capabilities improve.

Producer load balancing and HA mechanism

The producer algorithm specified by the user, sends a message to the specified partition.

A plurality partiiton, each partition has its own replica, each replica distributed over different nodes Broker.

Selecting a plurality of partition requires a lead partition, lead partition is responsible for reading and writing by zookeeper responsible for fail over.

By dynamically adding zookeeper management broker and consumer and leave.

Consumer's pull mechanism

Because of persistent data will kafka broker, broker cahce no pressure, therefore, consumer consumption data more suitable way to take a pull, especially specific as follows:

 

Kafka simplify design, reduce the difficulty.

Consumer The consumer the capability of independent control message pulling speed.

consumer to choose according to their consumption patterns, such as batch, repeated consumption, start spending from the development of partition or position (offset) and so on.

Consumer and topic relationships and mechanisms

Topic kafka essentially supports only a part of each consumer group consumer;. Conversely, each group may have a plurality of the consumer for a particular Topic message.

Each consumer will only be a consumer group Subscribe to this Topic in, this message will not be sent to a group of more consumer; then a group in all consumer spending will be staggered throughout Topic.

If all have the same consumer group, JMS queue, and this situation like pattern; message will load balancing among consumers.

If all the consumer has a different group, that this is the "publish - subscribe"; the message will be broadcast to all consumers.

 

 

 

In kafka in a partition of the message will only be a consumer of the consumer group (the same time); each group in consumer news consumption independent of each other; we can say that a group is a "subscription" who

 

A Topic Each partions, a consumer will only be a consumer "subscribers" in, but a consumer can consume multiple partitions in a message at the same time.

 

kafka can only be guaranteed when a partition of the message is the consumer is a consumer order. In fact, from the point of view Topic, when there are multiple partitions, the message is still the global order.

 

 

 

Typically, a consumer group will contain more, so not only can improve concurrency spending power topic of messages, but also to improve the "fault tolerant" nature, if a consumer group in failure,

 

Then the consumer will have partitions design principles other consumer automatically takes over .kafka decision for a topic, the same group can not have more than at the same time the number of partitions of consumer spending,

 

Otherwise it will mean that some consumer will not get the message.

 

Producer balancing algorithm

Any broker kafka cluster, metadata can provide information to the producer, these metadata with "survival of the servers in the cluster list" / "partitions leader List"

And other information (see zookeeper in the node information) acquired metadata when producer confidence, and producer will all Topic partition leader holding socket connection.;

Message sent by the producer to the Broker socket directly, not through any intermediate "routing layer." In fact, on which the message is routed to the Partition, the client has decided producer.

May be employed such as "random" "key-hash" "polling" and the like, if there is a topic in the plurality of partitions, then the producer side to achieve "balanced message distribution" is necessary.

In the producer side of the profile, the developer can specify the partition routing manner.

 

Consumer balancing algorithm

When a group, there are consumer join or leave, will trigger partitions balanced. Balanced ultimate goal is to enhance the ability of concurrent consumption of the topic.

1) If topic1, having the following partitions: P0, P1, P2, P3

2) added group, the following consumer: C0, C1

3) The first partition of the partitions to sort index number: P0, P1, P2, P3

4) The consumer.id sort: C0, C1

5) calculating multiples: M = [P0, P1, P2, P3] .size / [C0, C1] .size, the present embodiment the value M = 2 (rounded up)

6) Then allocate partitions: C0 = [P0, P1], C1 = [P2, P3], i.e. Ci = [P (i * M), P ((i + 1) * M -1)]

 

replica mechanism between the inner kafka broker cluster broker

In kafka, replication strategy is based on partition, instead Topic; kafka copying data to each of a plurality of partition server, any of a plurality of partition has a leader and follower (can not);

 

The number may be set .leader backup process all requests by the read-write broker configuration file, follower and leader needs to be synchronized .Follower as a "consumer",

 

Consumer information and stored in a local log; leader responsible for keeping track of all the follower state, if the follower "backward" too much or fail, leader replicas will remove it from the synchronization list.

 

When all follower will save a success message, this message was only considered to be "committed", then the consumer can consume it at this time, this synchronization strategy, requires must have a good network environment between the follower and the leader.

 

Even if there is only one instance of replicas to survive, you can still guarantee the normal send and receive messages, as long as zookeeper cluster can survive. (Note: Unlike other distributed storage, such as hbase need to "majority" survival job)

 

 

Kafka according to current understanding of the function modules, in order to facilitate learning and to use, the composition is divided roughly into the following kafka module will follow based on the incoming message process module and sequentially introduced

 

 

 

 

 

 

 

Reference: https: //blog.csdn.net/lizhitao/article/details/23743821

 

Guess you like

Origin www.cnblogs.com/zhy-heaven/p/10993876.html