Introduction to the basic principles of kafka

Kafka is what? Kafka official with the words:

Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.

Generally it means that this is a real-time data processing system may scale, highly reliable, but also fast abnormal, have been used by many companies.

So what is the real-time data processing system? As the name suggests, the real-time data processing system once the data is generated, it is necessary to quickly turn processed by the system.

For real-time data processing, our most common, is the messaging middleware , and also called MQ (the Message Queue, the message queue), also called Message Broker is.

This article, I'll messaging middleware perspective, we find out the internal structure of Kafka's see how it is done scale, highly reliable, while also abnormal fast.

Why messaging middleware

The main role of messaging middleware of two things:

  • Production and consumption decoupled messages.
  • buffer.

Imagine a scenario where you create an order of operations, after the order created, it needs to trigger a series of other operations, such as customer orders for statistical data, send text messages to the user, send messages to the user, and so on, like this:

createOrder(...){
 ...
 statOrderData(...); sendSMS(); sendEmail(); }

Code written seems to be no problem, but after a while, you give the introduction of a system user behavior analysis service, it needs after the order is created, to operate a user behavior analysis, and with the gradual growth of the system, create after the operation of the order will be triggered more and more, and gradually expanded into the code like this:

createOrder(...){
 ...
 statOrderData(...); sendSMS(); sendEmail(); // new operation  statUserBehavior(...); doXXX(...); doYYY(...); // more and more operations  ... }

Resulting in more expansion crux of the code, the production and consumption of messages are coupled together. createOrder method is not only responsible for the production of "order has been created" message, but also to deal with this message.

This is like the BBC reporter, after Real Madrid to get to know the Champions League, pick up the phone, open the Real Madrid fans contacts, give the fans one by one, call and tell them, Real Madrid won the championship.

In fact, BBC's correspondent only need to publish on their official website this news, then the fans themselves access BBC, to get on top of the news; Or fans subscribe to the BBC, then the subscription system will take the initiative to publish in the official website of the news pushed to fans .

Similarly, createOrder also need a kind of like the BBC's official website carrier, messaging middleware that is, after the order is created, put a theme of "orderCreated" message into the messaging middleware opinions ok, do not care needs to put the who sent the message. This completes the production of the message.

As for the need to trigger service operation after an order is created, you only need to subscribe to the theme of "orderCreated" message, the new "orderCreated" message appears in messaging middleware, will receive this message, and then make the appropriate treatment .

Thus, by using the messaging middleware, the above code also reduces to:

createOrder(...){
 ...
 sendOrderCreatedMessage(...); }

Later if there is a new need to do, this string of code need not be modified, just give the message by subscribing after the order creation.

In addition, such a decoupling, consumers more flexibility in the consumption data, do not always produce the news should immediately to deal with (although usually the consumer side there will be a buffer mechanism thread pool, etc.), can be free of her own a time, and then taken up messaging middleware for processing data here. This is caused by a buffer messaging middleware.

Kafka Generation - Message Queue

From the above description, we can see that the reason messaging middleware can decouple production and consumption of news, mainly because it provides a place to store the message - the message is put to producers, consumers and removed from message deal with.

Then the place to store the message, what should adopt a data structure it?

In most cases, we want to send a message to come, you can be treated (FIFO), in line with most of the business logic, we will give priority to set the message a few cases. In any case, it is for the messaging middleware, a FIFO queue, are very suitable data structure:

Source: LinkedIn.com

 

How then to ensure that the message may be consumer order it?

When consumers get the message over, every time data index = 0 returns in the past, and then delete the piece of data index = 0?

Obviously not, because subscribed to this news the number of consumers likely to be 0, it may be 1, and may be greater than 1. If every consumer completely deleted, then the other subscribed to this news consumers get less than this news.

In fact, Kafka data will be persistent storage (storage as for how long, which is configurable), consumers will end a record offset , indicating that the consumer where the current consumption of data, so the next time the consumer wants continue to spend, continue to consume only from a position offset + 1 is just fine.

Consumers can even adjust the offset value, before re-consumption data.

Then this is Kafka yet? No, this is just a very common message queue, let's call it Kafka generation now.

Kafka realized this generation messaging middleware with a message queue, there are many problems this simple realization:

  • Topic cohabitation . Imagine, just subscribe to a topic as "A" consumers have to have ABCDEFG ... in a wide variety of topic such as the queue inside to look for the topic A message, so performance is not it slow?
  • Low throughput . We all messages are placed in a queue, and request more than one, it is certainly not cope.

Thus we come out of a Kafka II .

Kafka II - Partition

To solve the two problems Kafka generation, it is very simple - Distribution of storage .

Kafka introduced II Partition concepts, i.e. using a plurality of queues, each message queue inside is the same topic:

Source: LinkedIn.com

 

Partition is designed to solve two problems mentioned above:

  • Pure Topic queue . A queue is only one topic, consumers no longer have to worry about the encounter messages topic is not himself want.
  • Improve throughput . Different topic of messages to different queues to store, to no longer have an enemy ten a.

A queue only one topic, topic but one kind of message, but the key value according to custom, dispersed to the plurality of queues. That is, the map p1 and p2, the same topic may all queues. But this is a relatively high-level applications, and have the opportunity to join in the discussions.

Kafka II perfect enough yet? Of course not, although we improved performance by Partition, but we overlooked a very important issue - high availability .

In case the machine hang up how to do? Single-point system is not always reliable. Problem standby node and data backup we have to consider.

Kafka three generations - Broker Clusters

Obviously, in order to solve the availability problem, we need a cluster .

Kafka support for clustering is also very friendly. In Kafka, the cluster in each instance is called Broker , like this:

Source: sookocheff.com

 

Each partition is no longer only one, but there is a Leader (red) and a plurality of Replica (blue), according to the producer's message topic and key value, which determines the message to be sent after the partition (say P1), You will find the corresponding partition leader (ie broker2 where p1), and then the message to the leader, leader responsible for writing messages, and synchronize with the rest of the replica.

Once a leader of a partition hung up, then simply promoted a replica out, make it a leader ok, the system can still operate normally.

By designing Broker cluster, we not only solved the problem of system availability, but also further enhance the throughput of the system, because the replica can also provide data lookup function for consumers.

 


Link: https: //zhuanlan.zhihu.com/p/37405836
Source: know almost

Guess you like

Origin www.cnblogs.com/Allen-rg/p/10956618.html