Powerful distributed message middleware - kafka

When we use a large number of distributed databases and distributed computing clusters, will we encounter such problems:


l I want to analyze user behavior (pageviews), so that I can design better advertising spaces

l I want to provide users Statistics of search keywords and analysis of current fashion trends. This is very interesting. In economics, there is a theory of long skirts, that is, if the sales of long skirts are high, it means that the economy is in a recession, because girls have no money to buy all kinds of stockings.

l For some data, I think it is a waste to store the database, and I am afraid that the operation efficiency will be low when storing it directly on the hard disk.

At this time, we can use the distributed message system. Although the above description is more inclined to a logging system, it is true that kafka is widely used for logging systems in practical applications.

First of all, we have to understand what a messaging system is. The definition of kafka on the kafka official website is called: A distributed publish-subscribe messaging system. publish-subscribe means publish and subscribe, so it is more accurate to say that kafka is a message subscription and publishing system. The concept of publish-subscribe is very important, because the design concept of kafka can start from here.

We temporarily call the publication of the message as the producer, the subscription of the message as the consumer, and the storage array in the middle as the broker, so that we can roughly describe the scene: the


producer (blue, Blue-collar workers, always work harder) to produce data and throw it to the broker for storage. When consumers need to consume data, they take out the data from the broker, and then complete a series of data processing.

At first glance, this is too simple, doesn't it say that it is distributed, and is it distributed if the producer, broker, and consumer are placed on three different machines? Let's look at the official picture given by kafka:


Multiple brokers work together. Producer and consumer are deployed in various business logic and are frequently called. The three manage and coordinate requests and forwarding through zookeeper. Such a high-performance distributed message publishing and subscription system is completed. There is a detail in the figure that needs to be noted. The process from producer to broker is push, that is, if there is data, it is pushed to the broker, while the process from consumer to broker is pull, which actively pulls data through the consumer, rather than the broker sending the data actively. to the consumer side.

Where does such a system reflect its high performance, as described on the official website, the translation is as follows:

(1) The cost of data access on disk is O(1). General data is stored on disk using BTree, and the access cost is O(lgn).
(2) High throughput rate. Hundreds of thousands of messages can be processed per second even on ordinary nodes.
(3) Explicit distribution, that is, there will be multiple producers, brokers, and consumers, all of which are distributed.
(4) Support data to be loaded into Hadoop in parallel.
So

far you should have some experience with what kind of system kafka is, and understand its basic structure and what it can be used for. Next, let's go back to the relationship between producer, consumer, broker, and zookeeper.



Looking at the above picture, we reduce the number of brokers to only one. Now suppose we deploy according to the above figure:

l Server-1 broker is actually the server of kafka, because both the producer and the consumer need to connect to it. Broker is mainly used for storage.

l Server-2 is the server side of zookeeper. You can check the specific role of zookeeper on the official website. Here you can imagine that it maintains a table that records the IP, port and other information of each node (will be discussed later. , which also stores kafka-related information).

l The common feature of Server-3, 4, and 5 is that they are all configured with zkClient. More specifically, the address of zookeeper must be configured before running. The reason is very simple. The connections between them all require zookeeper for distribution. of.

l The relationship between Server-1 and Server-2, they can be placed on one machine, or they can be opened separately, and zookeeper can also be configured with a cluster. The purpose is to prevent a certain one from hanging up.

Briefly describe the order in which the entire system runs:

1. Start the zookeeper server

2. Start the kafka server

3. If the Producer produces data, it will first find the broker through zookeeper, and then store the data in the broker

4. If the Consumer wants to consume data, First, the corresponding broker will be found through zookeeper, and then the


core components of Kafka will be consumed in detail.
1. The basics of Kafka publish-subscribe message system
     Kafka is a distributed publish-subscribe message system. It was originally developed by LinkedIn Corporation, written in Scala, and later became the Apache top-level project framework. Kafka is a distributed, partitionable, multi-subscriber, redundant backup durable service.
     The Kafka ecosystem has been introduced in the actual combat of the Kafka ecological architecture. With big data computing, Kafka, as an important data buffer, buffers the data collected by Flume and provides it to Storm for real-time computing. Like the Storm framework, it is also the main application of Kafka for real-time and streaming computing systems. It is mainly used for processing active streaming data.
Second, the characteristics of Kafka
1. Provide high throughput for both publishing and subscription. Statistics Kafka can produce about 250,000 messages (50 MB) per second and process 550,000 messages (110 MB) per second.
2. Persistent operation. Persist messages to disk, so they can be used for batch consumption, prevent data loss by persisting data to disk and replication.
3. Distributed system, easy to expand outward. All producers, brokers and consumers will have multiple, all distributed. Machines can be expanded without downtime.
4. Each instance of kafka (broker) is stateless, regardless of the increase or decrease of the message, no matter who consumes the message, the status of the message being processed is that the consumer actively pulls the message from the topic partition, that is, who consumes the message by the consumer. decision, not by the server's broker.
3. The large accumulation of Kafka performance test 

     data causes the broker to be stuck. Here, the topic partition will be used to disperse the log storage size in the broker.
4. Kafka core      1.      Kafka
core component
     Producer: message producer


     Broker: A caching proxy, one or more servers in a Kafa cluster are collectively referred to as a broker.
     Topic: Specifically refers to the different categories of feeds of messages processed by Kafka.
     Partition: The physical grouping of topics. A topic can be divided into multiple partitions, and each partition is an ordered queue. Each message in the partition is assigned an ordered id (offset). Message: A message is the basic unit of communication. Each producer can publish some messages to a topic.
     Producers: message and data producers, the process of publishing messages to a topic in Kafka Consumers
     : message and data consumers, the process of subscribing to topics and processing the messages published by them To the topic of kafka broker (there are multiple messages in each topic) (2) the number of messages in each topic increases, and the messages in the topic are partitioned, so that the consumer can quickly locate and consume messages, and can also consume messages in multiple partitions at the same time. data, increasing the transfer rate. (3) Consumer Group: There are multiple consumers in the consumer group, and each consumer corresponds to data processing in different partitions. For example, the data of partition 1 is given to Consumer 1, and the data of partition 2 is given to 2, so that the topic composed of multiple partitions is consumed by multiple consumers in the group, that is, the concurrent processing of the same message within the data. 3. Core component analysis Producers







(1) The process of message producers distributing messages to a topic of Kafka is called Producers
(2) Disguised load balancing: which partition of the topic to send messages to, to achieve balanced distribution of messages
(3) Batch asynchronous sending : The producer and the broker (the client and the server of the message) are not in the same server. If each message is sent and a network connection is established, it will inevitably affect the efficiency. Kafka's message sending process adopts batch and asynchronous sending.
Broker
(1) Support message persistence: Broker has no copy backup, but Kafka persists messages to avoid loss. When the consumer obtains data from the broker of kafka, the broker will not directly consume the consumer, but save the data to the broker's local log file (the specific path can be configured), and each partition will have a log file. In addition, the addition of logs is persisted by appending to achieve the effect of orderly persistence of messages. Write log files: After the cache reaches a certain threshold, read and write disks for IO operations to improve performance
(2) Who consumes messages, such information is determined and maintained by consumers (specifically, zookeeper records are maintained), and brokers do not save them.
(5) When a failure occurs during consumption, Kafka can quickly locate the piece of data that was not consumed during the failure. How does the consumer determine which messages it did not consume? zk to record which message is consumed, and how to quickly find messages that are not consumed involves the sparse indexing mechanism of kafka. Continue to study next. 3 attributes of the Message message:
( 1) Offset The unique identifier of the message, through which the message is found. (2) MessageSize message size (3) data message itself




The purpose of Partition
partition:
(1) Slow down the disk space occupied by log files. A
large number of messages must be persisted to the file by the broker, which increases the space occupied by the hard disk. Partitions subdivide message particles, and each partition can be stored in different hard disk space to prevent topic files in a single broker from occupying memory space
(2) Different consumers process data in partitions at the same time The consumer and partition in
Kafka are 1:n , that is, 1 partition is consumed by only 1 consumer; 1 consumer can consume multiple different partitions at the same time. The partition refines the message granularity. The more consumers process multiple partitions at the same time, the efficient concurrent processing of messages is achieved.

Introduction to
Kafka Kafka is a distributed publish-subscribe messaging system. It was originally developed by LinkedIn Corporation and later became part of the Apache project. Kafka is a distributed, partitionable, redundantly-backed persistent logging service. It is mainly used for processing active streaming data.
In big data systems, a problem is often encountered. The entire big data is composed of various subsystems, and the data needs to be continuously circulated in each subsystem with high performance and low latency. Traditional enterprise messaging systems are not very suitable for large-scale data processing. In order to have online applications (messages) and offline applications (data files, logs) at the same time, Kafka appeared. Kafka can play two roles:
reduce the complexity of system networking.
To reduce programming complexity, each subsystem is not a mutual negotiation interface, each subsystem is plugged into a socket like a socket, and Kafka assumes the role of a high-speed data bus.
Kafka's main features:
High throughput for both publish and subscribe. It is understood that Kafka can produce about 250,000 messages per second (50 MB) and process 550,000 messages per second (110 MB).
Persistence operations are possible. Persist messages to disk, so it can be used for batch consumption, such as ETL, and real-time applications. Prevent data loss by persisting data to disk and replication.
Distributed system, easy to scale out. All producers, brokers and consumers will have multiple, all distributed. Machines can be expanded without downtime.
The state of the message being processed is maintained on the consumer side, not on the server side. Automatically balances when it fails.
Both online and offline scenarios are supported.
Kafka's architecture: The overall architecture of
kafka
Kayka is very simple, it is an explicit distributed architecture, and there can be multiple producers, brokers (kafka) and consumers. Producer, consumer implements the interface registered by Kafka, data is sent from the producer to the broker, and the broker acts as an intermediate cache and distribution. The broker distributes the registered consumers to the system. The role of the broker is similar to a cache, that is, a cache between active data and offline processing systems. The communication between the client and the server is based on the simple, high-performance, and programming language-independent TCP protocol. Several basic concepts:
Topic: Specifically refers to the different classifications of feeds of messages processed by Kafka.
Partition: The physical grouping of topics. A topic can be divided into multiple partitions, and each partition is an ordered queue. Each message in the partition is assigned an ordered id (offset).
Message: Message is the basic unit of communication. Each producer can publish some messages to a topic (topic).
Producers: message and data producers, the process of publishing messages to a topic in Kafka is called producers.
Consumers: message and data consumers, the process of subscribing to topics and processing the messages published by them is called consumers.
Broker: A caching broker, one or more servers in a Kafka cluster are collectively referred to as brokers.
The process of message sending: The
message
Producer publishes the message to the partition of the specified topic according to the specified partition method (round-robin, hash, etc.)
After receiving the message from the Producer, the kafka cluster persists it to the hard disk and keeps it The message specifies the duration (configurable), regardless of whether the message is consumed.
The consumer pulls data from the kafka cluster and controls the offset that gets the message.
Kayka's design:
1. Throughput
High throughput is one of the core goals that kafka needs to achieve. For this reason, Kafka has made the following designs:
Data disk persistence: messages are not in memory In the cache, write directly to the disk, making full use of the sequential read and write performance of
the disk Send a message to the specified partition





There are multiple partiitons, each partition has its own replica, and each replica is distributed on different broker nodes.
Multiple partitions need to select the lead partition, which is responsible for reading and writing, and the zookeeper is responsible for fail over
through the zookeeper to manage brokers and consumers.
3. Pulling system
Since the kafka broker will persist data and the broker has no memory pressure, the consumer is very suitable for consuming data in the way of pull, which has the following advantages:
simplifying the design of kafka
The consumer controls itself according to the consumption capacity Message pull speed
The consumer chooses the consumption mode independently according to its own situation, such as batch, repeated consumption, consumption from the end, etc.
4. Scalability
When the broker node needs to be added, the new broker will be registered with zookeeper, and the producer and Consumers will perceive these changes according to the watchers registered on zookeeper and make adjustments in time.

Application scenarios of Kayka:
1. Message queue
Compared with most message systems, Kafka has better throughput, built-in partitioning, redundancy and fault tolerance, which makes Kafka a good large-scale message processing applied solutions. Message systems generally have relatively low throughput, but require smaller end-to-end latency and rely on the strong durability guarantees provided by Kafka. In this area, Kafka is comparable to traditional messaging systems such as ActiveMR or RabbitMQ.
2. Behavior Tracking
Another application scenario of Kafka is to track user browsing, search and other behaviors, and record them in the corresponding topic in real time in a publish-subscribe mode. Then after these results are obtained by subscribers, they can be further processed in real time, or monitored in real time, or put into Hadoop/offline data warehouse for processing.
3. Meta information monitoring It is
used as a monitoring module for operation records, that is, to collect and record some operation information, which can be understood as data monitoring of the nature of operation and maintenance.
4. Log collection In terms of
log collection, there are actually many open source products, including Scribe and Apache Flume. Many people use Kafka instead of log aggregation. Log aggregation generally collects log files from servers and processes them in a centralized location (file server or HDFS). However, Kafka ignores the details of the file and abstracts it more clearly into a message stream of logs or events. This makes Kafka processing lower latency and easier to support multiple data sources and distributed data processing. Compared to log-centric systems such as Scribe or Flume, Kafka offers the same efficient performance and higher durability guarantees due to replication, as well as lower end-to-end latency.
5. Stream processing
This scenario may be more, and it is easy to understand. Save the collected stream data to provide Storm or other stream computing frameworks for processing later. Many users will process, aggregate, expand or otherwise transfer the data from the original topic to the new topic and continue the subsequent processing. For example, the processing flow of an article recommendation may be to first grab the content of the article from the RSS data source, and then throw it into a topic called "article"; the follow-up operation may be to clean up the content, such as replying to normal Data or delete duplicate data, and finally return the result of content matching to the user. This generates a series of real-time data processing processes in addition to an independent topic. Strom and Samza are very well-known frameworks for implementing this type of data transformation.
6. Event source
Event sourcing is a way of application design in which state transitions are recorded as a chronologically ordered sequence of records. Kafka can store large amounts of log data, making it an excellent backend for applications this way. Such as dynamic summary (News feed).
7. Persistent log (commit log)
Kafka can provide services for an external distributed system of persistent logs. This log can back up data between nodes and provide a resynchronization mechanism for data recovery from failed nodes. The log compression feature in Kafka provides the conditions for this usage. In this usage, Kafka is similar to the Apache BookKeeper project.

Design points of Kayka:
1. Directly use the cache of the linux file system to efficiently cache data.
2. Use linux Zero-Copy to improve the sending performance. The traditional data transmission needs to send 4 context switches. After the sendfile system call is used, the data is directly exchanged in the kernel mode, and the system context switching is reduced to 2 times. According to the test results, the data transmission performance can be improved by 60%. For detailed technical details of Zero-Copy, please refer to: https://www.ibm.com/developerworks/linux/library/j-zerocopy/
3. The cost of accessing data on disk is O(1). Kafka uses topics for message management. Each topic contains multiple parts (itions), and each part corresponds to a logical log, which is composed of multiple segments. Multiple messages are stored in each segment (see the figure below), and the message id is determined by its logical location, that is, the message id can be directly located to the storage location of the message, avoiding additional mapping from id to location. Each part corresponds to an index in memory, recording the offset of the first message in each segment. The message sent by the publisher to a topic will be evenly distributed to multiple parts (randomly or according to the callback function specified by the user). The broker receives the published message and adds the message to the last segment of the corresponding part. When When the number of messages on a segment reaches the configured value or the message publishing time exceeds the threshold, the messages on the segment will be flushed to the disk, and only the subscribers of the messages flushed to the disk can subscribe. Data will be written to the segment again, and the broker will create a new segment.
4. Explicit distribution, that is, there will be multiple producers, brokers and consumers, all of which are distributed. There is no load balancing mechanism between producers and brokers. Zookeeper is used for load balancing between brokers and consumers. All brokers and consumers will be registered in zookeeper, and zookeeper will save some of their metadata information. If a broker or consumer changes, all other brokers and consumers will be notified.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326990186&siteId=291194637