Kafka Introduction, Basic Principles, Execution Process and Usage Scenarios

1. Introduction

Apache Kafka is a distributed publish-subscribe messaging system, the definition of kafka on the  kafka official website : a distributed publish-subscribe messaging system. It was originally developed by LinkedIn Corporation, which contributed to the Apache Foundation in 2010 and became a top-level open source project. Kafka is a fast, scalable, inherently distributed, partitioned and replicated commit log service.

Comparison of several distributed system message systems:

write picture description here

Recommended related articles: Comparison of various message queues, Kafka in-depth analysis, recommended by everyone, wonderful and good articles!  
http://blog.csdn.net/allthesametome/article/details/47362451

2. Kafka Basic Architecture

Its architecture includes the following components:

1. Topic: It is a specific type of message flow. The message is the payload of bytes, and the topic is the category name or feed name of the message;

2. Producer: Any object that can publish messages to a topic;

3. Service Broker: Published messages are stored in a set of servers, which are called Brokers or Kafka clusters;

4. Consumer: You can subscribe to one or more topics and pull data from the Broker to consume these published messages;

write picture description here

As can be seen in the above figure, the producer sends data to the broker agent, the broker agent has multiple topics, and the consumer obtains data from the broker.

3. Basic principles

We call the publication of the message the producer, the subscription of the message as the consumer, and the intermediate storage array as the broker (broker), so that we can roughly describe such a scene:

write picture description here

The producer produces the data and gives it to the broker for storage. When the consumer needs to consume the data, it takes the data from the broker, and then completes a series of data processing operations.

At first glance, it's too simple. Isn't it said that it is distributed? Is it even distributed if the producer, broker, and consumer are placed on three different machines? Look at the official picture given by kafka:

write picture description here

Multiple brokers work together. Producer and consumer are deployed in various business logic and are frequently called. The three manage and coordinate requests and forwarding through zookeeper. Such a high-performance distributed message publishing and subscription system is completed.

There is a detail in the figure that needs to be noted. The process from producer to broker is push, that is, if there is data, it is pushed to broker, while the process from consumer to broker is pull, which is to actively pull data through the consumer, not the broker to understand the data. Sent to the consumer side.

Fourth, the role of Zookeeper in kafka

As mentioned above, Zookeeper is mentioned, so what is the role of Zookeeper in kafka?

(1) Both the kafka cluster, the producer and the consumer rely on zookeeper to ensure system availability. The cluster saves some meta information.

(2) Kafka uses zookeeper as its distributed coordination framework, which combines the processes of message production, message storage, and message consumption well.

(3) At the same time, with the help of zookeeper, kafka can establish a subscription relationship between producers and consumers, and achieve load balancing between producers and consumers in a stateless state for all components including producers, consumers and brokers.

5. Execution process

First look at the following process:

write picture description here

Looking at the picture above, we reduced the number of brokers to only one. Now suppose we deploy as shown above:

(1) Server-1 broker is actually the server of kafka, because both the producer and the consumer have to return it. Broker is mainly used for storage.

(2) Server-2 is the server side of zookeeper. It maintains a table that records the IP, port and other information of each node.

(3) What Server-3, 4, and 5 have in common is that they are all configured with zkClient. More specifically, the address of zookeeper must be configured before running. The reason is very simple. The connection between them requires zookeeper to come. distributed.

(4) The relationship between Server-1 and Server-2, they can be placed on one machine or separated, and zookeeper can also be configured with a cluster. The purpose is to prevent a certain one from hanging up.

Briefly describe the order in which the entire system operates:

(1) Start the zookeeper server

(2) Start the kafka server

(3) If the Producer produces data, it will first find the broker through zookeeper, and then store the data in the broker

(4) If Consumer wants to consume data, it will first find the corresponding broker through zookeeper, and then consume it.

6. Features of Kafka

(1) High throughput and low latency: Kafka can process hundreds of thousands of messages per second, and its latency is at least a few milliseconds. Each topic can be divided into multiple partitions, and the consumer group can consume the partitions;

(2) Scalability: Kafka cluster supports hot expansion;

(3) Persistence and reliability: messages are persisted to the local disk, and data backup is supported to prevent data loss;

(4) Fault tolerance: allow nodes in the cluster to fail (if the number of replicas is n, allow n-1 nodes to fail);

(5) High concurrency: support thousands of clients to read and write at the same time;

(6) Support real-time online processing and offline processing: You can use a real-time stream processing system such as Storm to process messages in real time, and you can also use a batch processing system such as Hadoop for offline processing;

7. Kafka usage scenarios

(1) Log collection: A company can use Kafka to collect logs of various services, and open it to various consumers through Kafka in the form of unified interface services, such as Hadoop, Hbase, Solr, etc.;

(2) Message system: decoupling and producers and consumers, caching messages, etc.;

(3) User activity tracking: Kafka is often used to record various activities of web users or app users, such as browsing the web, searching, clicking, etc. These activity information is published by each server to the topic of Kafka, and then subscribers pass Subscribe to these topics for real-time monitoring and analysis, or load them into Hadoop and data warehouses for offline analysis and mining;

(4) Operational indicators: Kafka is also often used to record operational monitoring data. Including collecting data from various distributed applications and producing centralized feedback for various operations, such as alarms and reports;

(5) Streaming processing: such as spark streaming and storm;

(6) Event source;



Related Reading:

1. Apache Kafka: The Next Generation Distributed Messaging System

Reference article:

1、http://www.cnblogs.com/likehua/p/3999538.html 
2、http://www.jianshu.com/p/a82b8619e56c 
3、http://blog.csdn.net/ychenfeng/article/details/74980531


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324944028&siteId=291194637
Recommended