42. Introduction of Kafka

From the beginning of this article, Flume has come to an end, and then we will talk about Kafka. Kafka is a distributed message queue based on the publish/subscribe model, which is mainly used in the real-time processing of big data. Pay attention to the column "Break the Cocoon and Become a Butterfly-Big Data" to see more related content~


table of Contents

One, Kafka overview

Second, the message queue

Three, the characteristics of Kafka

Fourth, the architecture of Kafka

Five, message sending process

Six, Kafka application scenarios


One, Kafka overview

Kafka is a distributed publish-subscribe messaging system. It was originally developed by LinkedIn and later became part of the Apache project. Kafka is a distributed, partitionable, redundant and durable log service. It is mainly used to process active streaming data.

In a big data system, a problem is often encountered. The entire big data is composed of various subsystems, and the data needs to flow continuously in each subsystem with high performance and low latency. Traditional enterprise messaging systems are not very suitable for large-scale data processing. In order to handle both online applications (messages) and offline applications (data files, logs) at the same time, Kafka appeared. Kafka can play two roles: (1) Reduce the complexity of system networking. (2) Reduce programming complexity, each subsystem is no longer a mutual negotiation interface, each subsystem is plugged into a similar socket, and Kafka assumes the role of a high-speed data bus.

Second, the message queue

There are two modes of message queues, one is peer-to-peer, and the other is publish/subscribe. Kafka belongs to the latter.

The point-to-point mode is one-to-one, consumers take the initiative to pull data, and the message is cleared after the message is received. The message producer produces the message and sends it to the queue, and then the message consumer takes it out of the queue and consumes the message. After the message is consumed, there is no more storage in the queue, so it is impossible for the message consumer to consume the message that has already been consumed. The queue supports the existence of multiple consumers, but for a message, only one consumer can consume it.

The publish/subscribe model is one-to-many, and the message will not be cleared after consumers consume data. The message producer (publish) publishes the message to the topic, and multiple message consumers (subscribe) consume the message at the same time. Unlike the peer-to-peer method, messages published to the topic will be consumed by all subscribers.

Three, the characteristics of Kafka

(1) Provide high throughput for both publishing and subscribing. It is understood that Kafka can produce about 250,000 messages per second (50 MB) and process 550,000 messages per second (110 MB).

(2) Persistent operation can be performed. The message is persisted to disk, so it can be used for batch consumption, such as ETL, and real-time applications. Prevent data loss by persisting data to hard disk and replication.

(3) Distributed system, easy to expand outwards. There will be multiple producers, brokers, and consumers, all of which are distributed. The machine can be expanded without downtime.

(4) The state of the message being processed is maintained on the consumer side, not on the server side. It can automatically balance when it fails.

(5) Support online and offline scenarios.

Fourth, the architecture of Kafka

The overall architecture of Kafka is very simple, it is an explicit distributed architecture, there can be multiple producers, brokers (kafka) and consumers. The producer and consumer implement the interface registered by Kafka, and the data is sent from the producer to the broker, and the broker assumes the role of an intermediate cache and distribution. The broker distributes the consumers registered in the system. The role of a broker is similar to a cache, that is, a cache between active data and an offline processing system. The communication between the client and the server is based on a simple, high-performance and programming language-independent TCP protocol. The following are a few basic concepts:

1. Topic: specifically refers to the different categories of feeds of messages processed by Kafka.

2. Partition: Topic physical grouping. A topic can be divided into multiple partitions, and each partition is an ordered queue. Each message in the partition will be assigned an ordered id (offset).

3. Message: Message is the basic unit of communication. Each producer can publish some messages to a topic.

4. Producers: Producers of messages and data. The process of publishing messages to a topic in Kafka is called producers.

5. Consumers: message and data consumers, the process of subscribing to topics and processing the messages they publish is called consumers.

6. Broker: Cache proxy, one or more servers in the Kafka cluster are collectively referred to as broker.

7. Segment: The physical partition of the partition is composed of multiple segments.

8. Offset: Each partition consists of a series of ordered and immutable messages, which are successively appended to the partition. Each message in the partition has a continuous sequence number called offset, which is used to uniquely represent a message in the partition.

9. Consumer Group (CG): Consumer group, composed of multiple consumers. Each consumer in a consumer group is responsible for consuming data in different partitions, and a partition can only be consumed by one consumer; the consumer groups do not affect each other. All consumers belong to a certain consumer group, that is, the consumer group is a logical subscriber.

Five, message sending process

The Producer publishes the message to the partition of the specified topic according to the specified partition method (round-robin, hash, etc.). After the Kafka cluster receives the message sent by the Producer, it persists it to the hard disk and retains the message for a specified duration (configurable), regardless of whether the message is consumed. The Consumer pulls data from the Kafka cluster and controls the offset for obtaining messages.

Six, Kafka application scenarios

1. Message queue

Compared with most messaging systems, Kafka has better throughput, built-in partitioning, redundancy and fault tolerance, which makes Kafka a good solution for large-scale message processing applications. Message systems generally have relatively low throughput, but require smaller end-to-end delays, and try to rely on the strong durability guarantee provided by Kafka. In this field, Kafka is comparable to traditional messaging systems such as ActiveMR or RabbitMQ.

2. Behavior tracking

Another application scenario of Kafka is to track user browsing, search and other behaviors, and record them in real-time in corresponding topics in a publish-subscribe model. After these results are received by the subscriber, they can be further processed in real time, or monitored in real time, or placed in a Hadoop/offline data warehouse for processing.

3. Meta-information monitoring

Used as a monitoring module of operation records, that is, collecting and recording some operation information, it can be understood as data monitoring of the nature of operation and maintenance.

4. Log collection

In terms of log collection, there are actually many open source products, including Scribe and Apache Flume. Many people use Kafka instead of log aggregation. Log aggregation is generally to collect log files from the server, and then put them in a centralized location (file server or HDFS) for processing. However, Kafka ignores the details of the file and abstracts it more clearly into a message stream of logs or events. This makes Kafka processing lower latency and easier to support multiple data sources and distributed data processing. Compared to log-centric systems such as Scribe or Flume, Kafka provides the same efficient performance and higher durability guarantees due to replication, as well as lower end-to-end latency.

5. Stream processing 

There may be more scenarios for this, and it is easy to understand. Save and collect streaming data to provide Storm or other streaming computing frameworks (Spark Streaming Flink) for processing. Many users will process the data from the original topic in stages, aggregate, expand, or convert it to the new topic in other ways before continuing the subsequent processing. For example, the processing flow of an article recommendation may be to first grab the content of the article from the RSS data source, and then throw it into a topic called "article"; subsequent operations may need to clean up the content, such as returning to normal Data or delete duplicate data, and finally return the result of content matching to the user. This creates a series of real-time data processing processes in addition to an independent topic. Strom and Samza are very well-known frameworks that implement this type of data conversion.

6. Event source

The event source is a way of application design in which state transitions are recorded as a sequence of records sorted in chronological order. Kafka can store a large amount of log data, which makes it an excellent backend for applications in this way. Such as dynamic summary (News feed).

7. Persistence log (commit log)

Kafka can provide services for an external persistent log distributed system. This log can back up data between nodes and provide a resynchronization mechanism for data recovery from failed nodes. The log compression function in Kafka provides conditions for this usage. In this usage, Kafka is similar to the Apache BookKeeper project.

 

The above is the basic introduction of Kafka. What problems did you encounter in the process, welcome to leave a message and let me see what problems you all encountered~

Guess you like

Origin blog.csdn.net/gdkyxy2013/article/details/113973820