1. Overview of Kafka
1.1 Definition
Kafka is an open source distributed event streaming platform ( Event Streaming Platform
), which is widely used in high-performance data pipelines, streaming
analytics, data integration, and mission-critical applications.
1.2 Comparison of message queues
At present, the more common message queue products mainly include Kafka
, RabbitMQ
, RocketMQ
and so on.
In big data scenarios, it is mainly used Kafka
as a message queue. JavaEE
Mainly used in development RabbitMQ
, RocketMQ
.
Several common MQ
comparisons:
RabbitMQ | RocketMQ | Kafka | |
---|---|---|---|
Company/Community | Rabbit | Ali | Apache |
Development language | Erlang | Java | Scala&Java |
protocol support | AMQP,XMPP,SMTP,STOMP | custom protocol | custom protocol |
availability | high | high | high |
Stand-alone throughput | generally | high | very high |
message delay | microsecond level | Millisecond | within milliseconds |
message reliability | high | high | generally |
Pursuit of usability: Kafka
, RocketMQ
,RabbitMQ
The pursuit of reliability: RabbitMQ
,RocketMQ
Pursue throughput capacity: RocketMQ
、Kafka
Pursue low message latency: RabbitMQ
、Kafka
1.3 Application Scenarios of Traditional Message Queuing
The main application scenarios of traditional message queues include: caching/traffic peak shaving, decoupling and asynchronous communication.
1) Caching/traffic peak elimination
For example, the concurrent volume of Double Eleven has reached 200 million per second, but the processing speed of the business system is only 10 million per second.
The number of requests far exceeds the capacity of the system, and the system will crash and crash at this time.
If you use a message queue to receive these requests and cache them in the message queue, the system only needs to consume data at its own processing speed.
It just takes a little more time, but the availability of the entire business system is guaranteed.
It helps to control and optimize the speed of data flow through the system, and solve the situation that the processing speed of production messages and consumption messages is inconsistent.
2) Decoupling
Regardless of how the provider side and the consumer side change, there is no need for multiple sets of implementations. You only need to interact with the message queue.
The coupling degree and development cost of the system are greatly reduced.
Development is allowed to extend or modify the processing on both sides independently, as long as they ensure that they obey the same interface constraints.
3) Asynchronous communication
For example, in a recharge process, recharge is the most important task and must be executed immediately, while sending text messages is relatively less important.
In this way, we don't need to sequentially execute the two processes of recharging and sending SMS to increase the pressure on the system.
After the recharge is successful, the request for sending text messages can be written into the message queue, so that the consumer service can slowly consume these requests.
Even if the message is lost and the SMS is not successfully sent, it will not affect the core business (recharge), let alone cause system abnormalities.
Allows the user to put a message into the queue, but not process it immediately, and then process them when needed.
2. Two publishing modes of Kafka
2.1 Point-to-point mode
The consumer actively pulls the data, and clears the message after the message is received.
2.2 Publish/Subscribe Mode
There can be multiple topic
topics (browse, like, favorite, comment, etc.)
After the consumer consumes the data, other consumers can continue to consume without deleting the data. As for when to delete the data, it will be dealt with later.
Therefore, each consumer is independent of each other and can consume data.
Since this mode is more adaptable to more complex business environments, the publish-subscribe mode is used in most cases.
3. The infrastructure of Kafka
3.1 Infrastructure
1. In order to facilitate expansion and improve throughput, one topic
is divided into multiple partition
(partitions), and each partition is stored on a different Kafka
node.
The advantage of partitioning is that if a Broker
node can only store 1T
data, but there is a large amount of data at this time 2T
, partitioning can be used at this time to store data
Stored Broker
on two nodes respectively.
2. With the partition design, the concept of consumer group is proposed, and each consumer in the group consumes in parallel.
partition
3. In order to improve availability, add several copies for each , and only one copy is Leader
, the others are Follower
, consumers
Only consume Leader
data. If Leader
it hangs, there will be Follower
elected as the new one Leader
.
4. ZK
Record who is leader
, but Kafka2.8.0
it can also be configured not to use in the future ZK
, and it ZK
is also a trend not to use it in the future, because it has been
It has become a bottleneck of Kafka.
3.2 Role Description
Producer
: The message producer is the client thatKafka broker
sends messages .Consumer
: Message consumer, the client thatKafka broker
fetches messages.Consumer Group
(CG
): consumer group,consumer
consisting . Each consumer in a consumer group is responsible for consuming data from different partitions, and a partition can only be consumed by one consumer in the group; consumer groups do not affect each other. All consumers belong to a consumer group, that is, a consumer group is a logical subscriber.Broker
: AKafka
server is just onebroker
. A cluster consists of multiplebroke
r's. Abroker
can hold multipletopic
.Topic
: It can be understood as a queue, and both the producer and the consumer are facing a topic.Partition
: partition. In order to achieve scalability, a very largetopic
can be distributed to multiplebroker
(ie servers), atopic
can be divided into multiplepartition
, eachpartition
is an ordered queue.Replica
: copy.topic
Each partition of a has several replicas, consistingLeader
of one and several .Follower
Leader
: The "primary" of multiple replicas per partition, the objects that producers send data to, and the objects that consumers consume data fromLeader
.
List.Replica
: copy.topic
Each partition of a has several replicas, consistingLeader
of one and several .Follower
Leader
: The "primary" of multiple replicas per partition, the objects that producers send data to, and the objects that consumers consume data fromLeader
.Follower
: The "slave" node in multiple copies of each partition, synchronizes data fromLeader
in , and maintainsLeader
synchronization with data.Leader
When a failure occurs,Follower
one becomes the new oneLeader
.