Kafka for message queue

Table of contents

1. Overview of kafka

1. kafka definition

2. Introduction to Kafka

3. Features of kafka

4. Kafka system architecture

2. Overview of Zookeeper

1. Zookeeper working mechanism

2. Features of Zookeeper

3. Zookeeper data structure

4. Zookeeper application scenarios

5. Zookeeper election mechanism

3. Overview of message queues

1. Why do you need a message queue

2. The benefits of using message queues

3. Two modes of message queue

4. Install Zookeeper

1. Unzip and install the zookeeper software package

2. Modify the configuration file

3. Specify the corresponding node number for each machine

4. Start zookeeper

5. After opening, check the zookeeper status of the three nodes 

5. Install kafka 

1. Unzip and install the zookeeper software package

2. Modify the configuration file

3. Add relevant commands to the system environment

4. Start kafka

5. Create topic

6. Test topic 

6. Summary

1、zookeeper

2. Message queue MQ

3. Kafka architecture


1. Overview of kafka

1. kafka definition

Kafka is a distributed publish/subscribe-based message queue (MQ, Message Queue), which is mainly used in the field of big data real-time processing.

2. Introduction to Kafka

Kafka was originally developed by Linkedin. It is a distributed, partition-supporting, replica-based distributed message middleware system based on Zookeeper coordination. Its biggest feature is that it can process large amounts of data in real time. To meet various demand scenarios, such as hadoop-based batch processing system, low-latency real-time system, Spark/Flink streaming processing engine, nginx access log, message service, etc., written in scala language, Linkedin contributed to it in 2010 Apache Foundation and become a top open source project.

3. Features of kafka

  1. Provides high throughput for both publish and subscribe

    The design goal of Kafka is to provide message persistence capability with a time complexity of O(1), which can guarantee constant-time access performance even for data above TB level. Even on very cheap commercial machines, a single machine can support the transmission of 100K messages per second.

  2. message persistence

    Persist messages to disk, so can be used for batch consumption such as ETL as well as real-time applications. Prevent data loss by persisting data to hard disk and replication.

  3. distributed

    It supports message partitioning and distributed consumption between servers, and ensures the sequential transmission of messages in each partition. In this way, it is easy to expand outward, and there will be multiple producers, brokers and consumers, all of which are distributed. Machines can be expanded without downtime.

  4. The consumption message adopts the pull mode

    The state of the message being processed is maintained on the consumer side, not the server side. The broker has no state, and the consumer saves the offset by itself.

  5. Supports online and offline scenarios.

    Both offline data processing and real-time data processing are supported.

4. Kafka system architecture

1)Broker

A kafka server is a broker. A cluster consists of multiple brokers. A broker can accommodate multiple topics.

2) Topic (theme)

It can be understood as a queue, and both the producer and the consumer are facing a topic.

Similar to the table name of the database or the index of ES.

Messages of physically different topics are stored separately.

3) Partition (partition, to achieve data fragmentation)

  • In order to achieve scalability, a very large topic can be distributed to multiple brokers (ie servers), a topic can be divided into one or more partitions, and each partition is an ordered queue. Kafka only guarantees that the records in the partition are in order, but does not guarantee the order of different partitions in the topic.
  • Each topic has at least one partition. When the producer generates data, it will select a partition according to the allocation strategy, and then append the message to the end of the queue of the specified partition.

Partation data routing rules:

  1. If patition is specified, use it directly;
  2. If no patition is specified but a key is specified (equivalent to an attribute in the message), a patition is selected by performing hash modulo on the value of the key;
  3. Both patition and key are not specified, and a patition is selected by polling.

Notice:

  • Each message will have a self-incrementing number, which is used to identify the offset of the message, and the identification sequence starts from 0.
  • Data in each partition is stored using multiple segment files.
  • If the topic has multiple partitions, the order of the data cannot be guaranteed when consuming data. In the scenario where the order of consumption of messages is strictly guaranteed (such as flash sales of products and grabbing red envelopes), the number of partitions needs to be set to 1.

The relationship between broker, topic and partition:

  • Broker stores topic data. If a topic has N partitions and the cluster has N brokers, each broker stores a partition of the topic.
  • If a topic has N partitions and the cluster has (N+M) brokers, then there are N brokers that store a partition of the topic, and the remaining M brokers do not store the partition data of the topic.
  • If a topic has N partitions and the number of brokers in the cluster is less than N, then one broker stores one or more partitions of the topic. In the actual production environment, try to avoid this situation, which can easily lead to data imbalance in the Kafka cluster.

Reason for partition:

  • It is convenient to expand in the cluster. Each Partition can be adjusted to adapt to the machine where it is located, and a topic can be composed of multiple Partitions, so the entire cluster can adapt to data of any size;
  • Concurrency can be improved because it can be read and written in units of Partition.

4) Replication (copy)

Copy, in order to ensure that when a node in the cluster fails, the partition data on the node will not be lost, and Kafka can still continue to work. Kafka provides a copy mechanism. Each partition of a topic has several copies, and a leader and several followers.

5)Leader

Each partition has multiple copies, one and only one of which is the leader, and the leader is the partition currently responsible for reading and writing data.

6)Follower

  • Followers follow the Leader, all write requests are routed through the Leader, data changes are broadcast to all Followers, and Followers and Leaders maintain data synchronization. Follower is only responsible for backup, not for reading and writing data.
  • If the Leader fails, a new Leader is elected from the Followers.
  • When the Follower hangs, gets stuck, or is too slow to synchronize, the Leader will delete the Follower from the ISR (a set of Followers maintained by the Leader that is synchronized with the Leader) list, and create a new Follower.

7)Producer

  • The producer is the publisher of the data, and this role publishes the message push to the topic of Kafka.
  • After the broker receives the message sent by the producer, the broker appends the message to the segment file currently used for appending data.
  • The message sent by the producer is stored in a partition, and the producer can also specify the partition of the data storage.

8)Consumer

Consumers can pull data from brokers. Consumers can consume data from multiple topics.

9)Consumer Group(CG)

  • A consumer group consists of multiple consumers.
  • All consumers belong to a consumer group, that is, a consumer group is a logical subscriber. A group name can be specified for each consumer, and if no group name is specified, it belongs to the default group.
  • Collecting multiple consumers together to process the data of a certain topic can improve the consumption capacity of data faster.
  • Each consumer in the consumer group is responsible for consuming data from different partitions. A partition can only be consumed by one consumer in the group to prevent data from being read repeatedly.
  • Consumer groups do not affect each other.

10) offset offset

  • A message can be uniquely identified.
  • The offset determines the location of the read data, and there will be no thread safety issues. The consumer uses the offset to determine the message to be read next time (that is, the consumption location).
  • After the message is consumed, it is not deleted immediately, so that multiple businesses can reuse Kafka messages.
  • A certain service can also achieve the purpose of re-reading messages by modifying the offset, which is controlled by the user.
  • The message will eventually be deleted, and the default life cycle is 1 week (7*24 hours).

11)Zookeeper

  • Kafka uses Zookeeper to store the meta information of the cluster.
  • Since the consumer may experience failures such as power outages and downtime during the consumption process, after the consumer recovers, it needs to continue to consume from the location before the failure. Therefore, the consumer needs to record which offset it consumes in real time, so that it can continue to consume after the failure recovers.
  • Before Kafka version 0.9, the consumer saved the offset in Zookeeper by default; starting from version 0.9, the consumer saved the offset in a built-in Kafka topic by default, which is __consumer_offsets.
  • That is to say, the role of zookeeper is that when the producer pushes data to the kafka cluster, it is necessary to find out where the nodes of the kafka cluster are, and these are all found through zookeeper. Which piece of data the consumer consumes also needs the support of zookeeper. The offset is obtained from zookeeper, and the offset records where the last consumed data was consumed, so that the next piece of data can be consumed next.

2. Overview of Zookeeper

1. Zookeeper working mechanism

Zookeeper is understood from the perspective of design patterns: it is a distributed service management framework designed based on the observer pattern. It is responsible for storing and managing the data that everyone cares about, and then accepts the registration of observers. Once the state of these data changes, Zookeeper will will be responsible for notifying those observers registered with Zookeeper to react accordingly.

That is to say Zookeeper = file system + notification mechanism.

 2. Features of Zookeeper

1) Zookeeper: a leader (Leader), a cluster of multiple followers (Follower).

2) As long as more than half of the nodes in the Zookeeper cluster survive, the Zookeeper cluster can serve normally. So Zookeeper is suitable for installing an odd number of servers.

3) Global data consistency: Each server saves a copy of the same data, and the data is consistent no matter which server the client connects to.

4) The update requests are executed sequentially, and the update requests from the same Client are executed sequentially in the order in which they are sent, that is, first in first out.

5) Data update atomicity, a data update either succeeds or fails.

6) Real-time, within a certain time range, the Client can read the latest data.

3. Zookeeper data structure

The structure of the ZooKeeper data model is very similar to the Linux file system, which can be regarded as a tree as a whole, and each node is called a ZNode. Each ZNode can store 1MB of data by default, and each ZNode can be uniquely identified by its path.

4. Zookeeper application scenarios

The services provided include: unified naming service, unified configuration management, unified cluster management, dynamic online and offline server nodes, soft load balancing, etc.

1) Unified naming service

In a distributed environment, it is often necessary to uniformly name applications/services for easy identification. For example: IP is not easy to remember, but domain name is easy to remember.

2) Unified configuration management

(1) In a distributed environment, configuration file synchronization is very common. It is generally required that in a cluster, the configuration information of all nodes is consistent, such as Kafka cluster. After the configuration file is modified, it is hoped that it can be quickly synchronized to each node.

(2) Configuration management can be implemented by ZooKeeper. Configuration information can be written to a Znode on ZooKeeper. Each client server listens to this Znode. Once the data in Znode is modified, ZooKeeper will notify each client server.

3) Unified cluster management

(1) In a distributed environment, it is necessary to know the status of each node in real time. Some adjustments can be made according to the real-time status of the nodes.

(2) ZooKeeper can realize real-time monitoring of node status changes. Node information can be written to a ZNode on ZooKeeper. Listening to this ZNode can obtain its real-time status changes.

4) The server goes online and offline dynamically

The client can gain real-time insight into the changes of the server going online and offline.

5) Soft load balancing

Record the number of visits of each server in Zookeeper, and let the server with the least number of visits handle the latest client requests.

5. Zookeeper election mechanism

Start the election mechanism for the first time:

Suppose there are 5 servers:

1) Server 1 starts and initiates an election. Server 1 votes for itself. At this time, server 1 has one vote, less than half (3 votes), the election cannot be completed, and the status of server 1 remains LOOKING;

2) Server 2 is started, and another election is initiated. Servers 1 and 2 cast their own votes and exchange ballot information: At this time, server 1 finds that the myid of server 2 is larger than the one currently voted for (server 1), and changes the vote to recommend server 2. At this time, server 1 has 0 votes and server 2 has 2 votes. If there is no more than half of the results, the election cannot be completed, and the status of servers 1 and 2 remains LOOKING.

3) Server 3 starts and initiates an election. At this point, both servers 1 and 2 will change their votes to server 3. The result of this vote: Server 1 has 0 votes, Server 2 has 0 votes, and Server 3 has 3 votes. At this time, server 3 has more than half of the votes, and server 3 is elected as the leader. Servers 1 and 2 change the status to FOLLOWING, and server 3 changes the status to LEADING;

4) Server 4 starts and initiates an election. At this time, servers 1, 2, and 3 are no longer in the LOOKING state, and the ballot information will not be changed. The result of exchanging ballot information: Server 3 has 3 votes, and Server 4 has 1 vote. At this time, server 4 obeys the majority, changes the ballot information to server 3, and changes the state to FOLLOWING;

5) Server 5 is started, and it is the same as server 4 as a younger brother.

 Not the first time to start the election mechanism:

1. When a server in the ZooKeeper cluster has one of the following two situations, it will start to enter the Leader election:

1) The server is initialized and started.

2) The connection to the Leader cannot be maintained while the server is running.

2. When a machine enters the Leader election process, the current cluster may also be in the following two states:

1) There is already a Leader in the cluster.

  • For the case where a leader already exists, when the machine tries to elect a leader, it will be informed of the leader information of the current server. For this machine, it only needs to establish a connection with the leader machine and perform state synchronization.

2) There is indeed no Leader in the cluster.

  • Suppose ZooKeeper consists of 5 servers, the SIDs are 1, 2, 3, 4, and 5, and the ZXIDs are 8, 8, 8, 7, and 7, and the server with SID 3 is the leader. At some point, servers 3 and 5 fail, so a Leader election begins.

  • Election Leader rules:

    1. The big EPOCH wins directly
    2. EPOCH is the same, the one with the larger transaction id wins
    3. The transaction id is the same, the one with the larger server id wins

3. Overview of message queues

1. Why do you need a message queue

The main reason is that in a high-concurrency environment, synchronous requests are too late to process, and requests often block. For example, a large number of requests access the database concurrently, resulting in row locks and table locks. In the end, too many request threads will accumulate, which will trigger too many connection errors and cause an avalanche effect.

We use message queues to ease the pressure on the system by processing requests asynchronously. Message queues are often used in scenarios such as asynchronous processing, traffic peak shaving, application decoupling, and message communication.

Currently, the more common MQ middleware are: ActiveMQ, RabbitMQ, RocketMQ, Kafka, etc.

2. The benefits of using message queues

1) Decoupling

Allows you to extend or modify the processing on both sides independently, as long as they adhere to the same interface constraints.

2) Recoverability

When a part of the system fails, it does not affect the whole system. The message queue reduces the coupling between processes, so even if a process that processes messages hangs up, the messages added to the queue can still be processed after the system recovers.

3) Buffer

It helps to control and optimize the speed of data flow through the system, and solve the situation that the processing speed of production messages and consumption messages is inconsistent.

4) Flexibility & peak processing capacity

In the case of a surge in traffic, the application still needs to continue to function, but such bursts of traffic are uncommon. It is undoubtedly a huge waste to invest resources on standby at all times to handle such peak access. The use of message queues can enable key components to withstand sudden access pressure without completely crashing due to sudden overload requests.

5) Asynchronous communication

Many times, users don't want to and don't need to process messages right away. Message queues provide an asynchronous processing mechanism that allows users to put a message into a queue without processing it immediately. Put as many messages on the queue as you want, and process them when needed.

3. Two modes of message queue

1) Point-to-point mode (one-to-one, consumers actively pull data, and the message is cleared after the message is received)

  • The message producer produces the message and sends it to the message queue, and then the message consumer takes out the message from the message queue and consumes the message. After the message is consumed, there is no more storage in the message queue, so it is impossible for the message consumer to consume the message that has already been consumed. The message queue supports multiple consumers, but for a message, only one consumer can consume it.

2) Publish/subscribe mode (one-to-many, also known as observer mode, consumers will not clear messages after consuming data)

  • A message producer (publish) publishes a message to a topic, and multiple message consumers (subscribe) consume the message. Unlike the peer-to-peer method, messages published to a topic will be consumed by all subscribers.
  • The publish/subscribe mode defines a one-to-many dependency relationship between objects, so that whenever the state of an object (target object) changes, all objects (observer objects) that depend on it will be notified and automatically updated.

4. Install Zookeeper

Prepare 3 servers as a Zookeeper cluster:

192.168.80.5

192.168.80.8

192.168.80.9

operate on three machines

1. Unzip and install the zookeeper software package

Upload the zookeeper installation package to opt

tar zxf apache-zookeeper-3.5.7-bin.tar.gz
mv apache-zookeeper-3.5.7-bin /usr/local/zookeeper-3.5.7
cd /usr/local/zookeeper-3.5.7/conf/
cp zoo_sample.cfg zoo.cfg

2. Modify the configuration file

vim zoo.cfg

3. Specify the corresponding node number for each machine

mkdir data logs
echo 1 > data/myid

4. Start zookeeper

cd /usr/local/zookeeper-3.7.1/bin
./zkServer.sh start

5. After opening, check the zookeeper status of the three nodes 

5. Install kafka 

operate on three machines

1. Unzip and install the zookeeper software package

tar zxf kafka_2.13-2.7.1.tgz
mv kafka_2.13-2.7.1 /usr/local/kafka

2. Modify the configuration file

cd /usr/local/kafka/config/
vim server.properties

3. Add relevant commands to the system environment

vim /etc/profile
export KAFKA_HOME=/usr/local/kafka
export PATH=$PATH:$KAFKA_HOME/bin

source /etc/profile   #刷新变量

4. Start kafka

cd /usr/local/kafka/config/
kafka-server-start.sh -daemon server.properties
netstat -antp | grep 9092

5. Create topic

[root@localhost bin]# pwd
/usr/local/kafka/bin
[root@localhost bin]# kafka-topics.sh --create --zookeeper \
> 192.168.80.5:2181,192.168.80.8:2181,192.168.80.9:2181 \
> --partitions 3 \
> --replication-factor 2 \
> --topic test
Created topic test.
[root@localhost bin]# kafka-topics.sh 
--describe --zookeeper 192.168.80.5:2181

6. Test topic 

6. Summary

1、zookeeper

zookeeper: Distributed system management framework, role: file system + notification mechanism

Essence: Store and manage metadata of distributed applications, and notify clients if the status of application services changes.

2. Message queue MQ

Web application middleware: nginx tomcat apache haproxy squid varnish

MQ message queue middleware: redis kafka rabbitMQ rocketMQ activeMQ

3. Kafka architecture

broker: kafka server, a kafka consists of multiple brokers.

topic: a message queue, both producers and consumers are oriented to topics.

producer: The producer push pushes the message data to the topic of the broker.

consumer: Consumer pull pulls message data from the broker's topic.

partition: Partition, a topic can be divided into one or more partition partitions to speed up message transmission (reading and writing).

  • The message data in the partition is ordered, and the partitions are out of order. Only one partition can be used in orderly scenarios such as seckill and red envelopes.

Copy: backup the partition, the leader is responsible for reading and writing, and the follower is responsible for backup.

offset: offset, which records the location of the consumer's consumption message, and records where the consumer's last consumed data is, so that the next -a piece of data can continue to be consumed.

zookeeper: Save the meta information of the kafka cluster and save the offset. Combined with kafka, when the producer pushes data to the kafka cluster, it needs to find the location of kafka through zk, and which data consumers consume also needs the support of zk, because the offset can be obtained from zk.

Guess you like

Origin blog.csdn.net/TTSuzuka/article/details/128290880
Recommended