Kafka learning-the second basic concept

1 Overview

Kafka was originally developed by Linkedin. It is a distributed, partitioned, multi-copy, multi-subscriber. It is based on the distributed logging system coordinated by zookeeper (also can be used as an MQ system). It is commonly used for web / nginx logs and access. Logs, messaging services, etc., Linkedin contributed to the Apache Foundation in 2010 and became a top-level open source project.

The main application scenarios are: log collection system and message system.

The main design goals of Kafka are as follows:

  • The message persistence capability is provided with a time complexity of O (1), and constant-time access performance can be guaranteed even for data above the terabyte level.
  • High throughput. Even on very cheap commercial machines, it can achieve the transmission of 100K messages per second in a single machine.
  • Supports message partitioning between Kafka Servers and distributed consumption, while ensuring the sequential transmission of messages within each partition.
  • It also supports offline data processing and real-time data processing.
  • Scale out: Support online horizontal expansion.

2. Advantages of kafka

2.1 Decoupling

It is extremely difficult to predict what needs the project will encounter in the beginning of the project. The message system inserts an implicit, data-based interface layer in the middle of the process, and the process on both sides must implement this interface. This allows you to independently extend or modify the processing on both sides, as long as you ensure that they adhere to the same interface constraints.

2.2 Redundancy (copy)

In some cases, the process of processing data will fail. Unless the data is persisted, it will cause loss. The message queue persists the data until they have been completely processed, which avoids the risk of data loss. In the "insert-get-delete" paradigm adopted by many message queues, before deleting a message from the queue, your processing system needs to clearly indicate that the message has been processed, so as to ensure that your data is safely saved Until you finish using it.

2.3 Scalability

Because the message queue decouples your processing, it is easy to increase the frequency of message enqueuing and processing, as long as you add additional processing. No code changes or parameter adjustments are required. Expansion is as simple as turning up the power button.

2.4 Flexibility & Peak Processing Capability

In the case of a rapid increase in access, applications still need to continue to play a role, but such burst traffic is not common; it is undoubtedly a huge waste to invest resources on standby to be able to handle such peak access as a standard. The use of message queues can enable critical components to withstand sudden access pressures, without completely crashing due to sudden overloaded requests.

2.5 Recoverability

When a part of the system fails, it will not affect the entire system. The message queue reduces the coupling between processes, so even if a process that processes messages hangs up, the messages added to the queue can still be processed after the system is restored.

2.6 Order guarantee

In most usage scenarios, the order of data processing is very important. Most message queues are originally sorted, and can guarantee that the data will be processed in a specific order. Kafka guarantees the orderliness of messages within a Partition.

2.7 Buffer

In any important system, there will be elements that require different processing times. For example, it takes less time to load an image than to apply a filter. Message queuing uses a buffer layer to help the most efficient execution of tasks-write queue processing will be as fast as possible. This buffer helps control and optimize the speed of data flow through the system.

2.8 Asynchronous communication

Many times, users don't want and need to process messages immediately. The message queue provides an asynchronous processing mechanism that allows users to put a message into the queue, but does not process it immediately. Put as many messages as you want in the queue, and then process them when needed.

3. Comparison of commonly used Message Queue

3.1 RabbitMQ

RabbitMQ is an open-source message queue written in Erlang. It supports many protocols: AMQP, XMPP, SMTP, STOMP. Because of this, it is very heavyweight and more suitable for enterprise-level development. At the same time, the Broker framework is implemented, which means that messages are queued in the central queue before being sent to the client. It has good support for routing, load balancing or data persistence.

3.2 Redis

Redis is a NoSQL database based on Key-Value pairs, and development and maintenance are very active. Although it is a Key-Value database storage system, it itself supports MQ functions, so it can be used as a lightweight queue service. For the enqueue and dequeue operations of RabbitMQ and Redis, each executes 1 million times, and the execution time is recorded every 100,000 times. The test data is divided into four different sizes of data: 128Bytes, 512Bytes, 1K and 10K. Experiments show that: when entering the team, Redis' performance is higher than RabbitMQ when the data is relatively small, and if the data size exceeds 10K, Redis is unbearably slow; when leaving the team, Redis shows very good performance regardless of the data size , And RabbitMQ's dequeuing performance is much lower than Redis.

3.3 ZeroMQ

ZeroMQ is known as the fastest message queuing system, especially for high-throughput demand scenarios. ZeroMQ can implement advanced / complex queues that RabbitMQ is not good at, but developers need to combine multiple technical frameworks themselves. The technical complexity is a challenge to the success of this MQ application. ZeroMQ has a unique mode of non-middleware, you do not need to install and run a message server or middleware, because your application will play this server role. You only need to simply quote the ZeroMQ library, which can be installed using NuGet, and then you can happily send messages between applications. However, ZeroMQ only provides non-persistent queues, which means that if it goes down, data will be lost. Among them, Twitter's Storm versions before 0.9.0 use ZeroMQ as the data stream transmission by default (Storm supports both ZeroMQ and Netty as transmission modules from version 0.9).

3.4 ActiveMQ

ActiveMQ is a sub-project under Apache. Similar to ZeroMQ, it can implement queues with agents and peer-to-peer technologies. At the same time, similar to RabbitMQ, it can efficiently implement advanced application scenarios with a small amount of code.

3.5 Kafka / Jafka

Kafka is a sub-project under Apache. It is a high-performance cross-language distributed publish / subscribe message queuing system. Jafka is incubated on Kafka, which is an upgraded version of Kafka. It has the following characteristics: fast persistence, message persistence under O (1) overhead; high throughput, which can reach a throughput rate of 10W / s on a common server; fully distributed system, Broker , Producer, and Consumer all automatically support distributed and load balancing automatically; support parallel loading of Hadoop data. For log data and offline analysis systems like Hadoop, but require real-time processing limitations, this is a feasible solution . Kafka unifies online and offline message processing through Hadoop's parallel loading mechanism. Compared to ActiveMQ, Apache Kafka is a very lightweight messaging system. In addition to its very good performance, it is also a distributed system that works well.

4. Architecture and terminology

4.1 Kafka architecture

Insert picture description here

As shown in the above figure, a typical Kafka cluster contains a number of Producers (which can be Page View generated by the web front end, or server logs, system CPU, Memory, etc.), and a number of Brokers (Kafka supports horizontal expansion, generally the more brokers, The higher the cluster throughput rate), several Consumer Groups, and a Zookeeper cluster. Kafka manages the cluster configuration through Zookeeper, elects the leader, and rebalances when the consumer group changes. Producer uses push mode to publish messages to brokers, and Consumer uses pull mode to subscribe and consume messages from brokers.

4.2 Explanation of terms

  • Broker

The Kafka cluster contains one or more servers, and the server nodes are called brokers.

The broker stores topic data. If a topic has N partitions and the cluster has N brokers, then each broker stores a partition of the topic.

If a topic has N partitions and the cluster has (N + M) brokers, then there are N brokers storing one partition of the topic, and the remaining M brokers do not store partition data of the topic.

If a topic has N partitions and the number of brokers in the cluster is less than N, then a broker stores one or more partitions of the topic. In the actual production environment, try to avoid this situation, which can easily lead to imbalanced Kafka cluster data.

  • Topic

Each message posted to the Kafka cluster has a category, which is called Topic. (Physically, the messages of different topics are stored separately. Although logically, the messages of one topic are stored on one or more brokers, but users only need to specify the topic of the message to produce or consume data without having to care where the data is stored)

  • Partition

The data in the topic is divided into one or more partitions. Each topic has at least one partition. The data in each partition is stored using multiple segment files. The data in the partition is ordered, and the data between different partitions loses the order of the data. If a topic has multiple partitions, the order of data cannot be guaranteed when consuming data. In the scenario where the message consumption order needs to be strictly guaranteed, the number of partitions needs to be set to 1.

  • Replica
  1. When the replication factor of a topic is N and N is greater than 1, each Partition will have N replicas (Replica). Kafka's replica contains leader and follower.
  2. The number of Replica is less than or equal to the number of Broker, that is to say, for each Partition, there will be at most one Replica on each Broker, so you can use the Broker id to specify the Replica of the Partition.
  3. Replica of all Partitions will be evenly distributed to all brokers by default.
  • Producer

The producer is the publisher of the data, and this role publishes the message to the Kafka topic. After the broker receives the message sent by the producer, the broker appends the message to the segment file currently used to append data. The message sent by the producer is stored in a partition, and the producer can also specify the partition for data storage.

  • Consumer

Consumers can read data from the broker. Consumers can consume data from multiple topics.

  • Consumer Group

Each Consumer belongs to a specific Consumer Group (you can specify the group name for each Consumer, or the default group if you do not specify the group name).

  • Leader

Each partition has multiple copies, one and only one of which is the leader, which is currently the partition responsible for reading and writing data.

  • Follower

Follower follows the leader, all write requests are routed through the leader, the data will be broadcast to all followers, follower and leader keep data synchronization. If the leader fails, a new leader is elected from the follower. When Follower and Leader hang up, get stuck, or synchronize too slowly, Leader will delete this Follower from the "In Sync Replicas" (ISR) list and recreate a Follower.

Published 40 original articles · 25 praises · 100,000+ views

Guess you like

Origin blog.csdn.net/yym373872996/article/details/105653908