10 minutes to get to know the underlying principle kafka

1, the introduction of the background: What is the message queue

Real-time updates of modern technology, have become increasingly demanding real-time, so the demand for technology is also getting higher and higher, then the massive data transmission process in how to ensure rapid data transfer it, thus, Message Queue produced.

"Message" is the unit of data transfer between two computers. Message can be very simple, for example, contain only text strings; may be more complex, and may contain embedded objects.

Message is sent to the queue. "Message queue" message is stored in a container in the message transmission process. Message queue manager acts as an intermediary when the relayed message from its source to its target. The main purpose of the queue is guaranteed to provide routing and delivery of messages; if the recipient is unavailable to send a message, the message queue message will remain until it can successfully pass.

2, the message queue What are the advantages of it?

1) Decoupling: remove dependencies between the two data transmission

2) redundancy: security to ensure data is not lost

3) Scalability: Enhanced data transmission capability

4) Flexibility & peak capacity: to ensure the efficiency of the data received

5) recoverability: backup mechanism to ensure data security

6) the order guarantee: guarantee data ordering

7) Buffer: to prevent external factors not receive data lost

8) asynchronous communication: data receiving side may not produce sync

3, the data transfer mode What?

1) point to point mode (one, active pull data consumer, the message received message is cleared)

Model is typically a point to point messaging model based on polling or pulled, this model available from the queue

Request information, instead of the push message client. The characteristics of the model are sent to the message queue, and only one receiver is a reception process, even if a plurality of listeners message is true.

After 2) publish / subscribe model (many, data production, pushed to all subscribers)

Publish and subscribe messaging model is a push-based model. Publish and subscribe model can have a variety of different

Subscribers, subscribers receive only temporary active only when listening to a topic messages, all messages and durable subscribers are listening to a topic, even if the current subscriber is not available offline.

4, kafka what is it?

In the calculation flow, Kafka typically used to cache data, Storm Kafka consumption is calculated by the data.

1) Apache Kafka messaging system is an open source, written in Scala. Messaging system is an open source project developed by the Apache Software Foundation.

2) Kafka was originally developed by LinkedIn Company, and open in early 2011. October 2012 graduated from the Apache Incubator. The project's goal is to provide a unified real-time data processing, high-throughput, low-latency platform.

3) Kafka is a distributed message queue. Kafka when messages are stored according Topic classification, send messages called Producer, the message receiver is called Consumer, a plurality of clusters in addition kafka kafka instances, each instance (server) called a broker.

4) Whether kafka cluster, or consumer depend on the zookeeper cluster preserve some meta information, to ensure system availability.

5, kafka made of?

1) Producer: message producers, it is the message to the client end kafka broker;

2) Consumer: a message consumer takes the message to the client terminal kafka broker;

3) Topic: it can be understood as a queue;

4) Consumer Group (CG): This is a topic kafka used for broadcast messages (sent to all consumer) and unicast (issued any consumer) means. A topic may have a plurality of CG. topic of the message copy (not true copy, is conceptual) to all of the CG, but each partion only the message to a consumer in the CG. If you want a broadcast, as long as each consumer has an independent CG on it. To achieve as long as all of unicast consumer in the same CG. The consumer can also be grouped by freely without CG send messages to several different Topic;

5) Broker: a kafka server is a broker. A cluster composed of a plurality of broker. A broker can receive a plurality of Topic;

6) Partition: In order to achieve scalability, a very large topic may be distributed to a plurality Broker (i.e., server), the topic can be divided into a plurality of partition, each partition is an ordered queue. partition each message is assigned a sequential id (offset). kafka order to ensure that only one partition in a message to the consumer, a topic does not guarantee that the whole of (among a plurality of partition) sequence;

7) Offset: kafka stored files are named according to offset.kafka to do with the name of offset benefits are easy to find. For example, you want to find the location in 2049, just find 2048.kafka of files. Of course, the first offset is 00000000000.kafka.

6, kafka to build (example)?

1) prepare three virtual machines, named kafka01, kafka02, kafka03

2) extracting installation package

[kafka@kafka01 software]$ tar -zxvf kafka_2.11-0.11.0.0.tgz -C /opt/module/

After 3) modify unzip the file name

[kafka@kafka01 module]$ mv kafka_2.11-0.11.0.0/ kafka

4) create the logs folder under / opt / module / kafka directory

[kafka@kafka01 kafka]$ mkdir logs

5) modify the configuration file

[kafka@kafka01 kafka]$ cd config/

[Kafka @ kafka01 config] $ we server.properties

Enter the following:

#broker globally unique number, can not be repeated

broker.id=0

# Delete function is enabled topic

delete.topic.enable=true

# The number of threads processing network requests

num.network.threads=3

Into the number of lines to handle disk IO #

num.io.threads=8

Send socket buffer size #

socket.send.buffer.bytes=102400

# Socket receive buffer size

socket.receive.buffer.bytes=102400

# Request socket buffer size

socket.request.max.bytes=104857600

#kafka running log storage path

log.dirs=/opt/module/kafka/logs

The number of partitions on the current broker of #topic

num.partitions = 1

# The number of threads used to restore and clean up the data under data

num.recovery.threads.per.data.dir=1

#segment file retains the maximum time-out will be deleted

log.retention.hours=168

# Configure the connection Zookeeper cluster address

zookeeper.connect=kafka012:2181,kafka02:2181,kafka03:2181

6) configuration environment variable

[kafka@kafka01 module]$ sudo vi /etc/profile

#KAFKA_HOME

export KAFKA_HOME=/opt/module/kafka

export PATH=$PATH:$KAFKA_HOME/bin

[kafka@kafka01 module]$ source /etc/profile

7) distribution installation package

[kafka@kafka01 module]$ xsync kafka/

      Note: Remember to configure other machines after the distribution of environmental variables

8) are modified in the configuration file and kafka03 kafka02

/opt/module/kafka/config/server.properties中的broker.id=1、broker.id=2

      Note: broker.id not repeat

9) Start the cluster

In turn starts kafka on kafka01, kafka02, kafka03 node

[kafka@kafka01 kafka]$ bin/kafka-server-start.sh config/server.properties &

[kafka@kafka02 kafka]$ bin/kafka-server-start.sh config/server.properties &

[kafka@kafka03 kafka]$ bin/kafka-server-start.sh config/server.properties &

7, kafka command line?

1) Check all servers in the current topic

[kafka@kafka01 kafka]$ bin/kafka-topics.sh --zookeeper kafka01:2181 --list

2) Create a topic

[kafka@kafka01 kafka]$ bin/kafka-topics.sh --zookeeper kafka01:2181 \

--create --replication-factor 3 --partitions 1 --topic first

Option Description:

--topic defined topic name

--replication-factor-defined number of copies

--partitions define the number of partitions

3) Delete topic

[kafka@kafka01 kafka]$ bin/kafka-topics.sh --zookeeper kafka01:2181 \

--delete --topic first

Server.properties need to set delete.topic.enable = true or just marked for deletion or directly restart.

4) Send a message

[kafka@kafka01 kafka]$ bin/kafka-console-producer.sh \

--broker-list kafka01:9092 --topic first

>hello world

> Kafka Kafka

5) Consumer news

[kafka@hadoop103 kafka]$ bin/kafka-console-consumer.sh \

--zookeeper kafka01:2181 --from-beginning --topic first

--from-beginning: will first topic in the past, all the data is read out. Whether to increase the business scenario configuration selection.

6) View details of a Topic

[kafka@kafka01 kafka]$ bin/kafka-topics.sh --zookeeper kafka01:2181 \

--describe --topic first

8, kafka workflow analysis?

Written procedures:
      1) Producer zookeeper start of "/brokers/.../state" node finds leader of the partition

2) producer transmits the message to the leader

3) leader message written to a local log

4) After the leader pull followers message, sends an ACK to the write log local leader

After 5) leader received ACK replication of all the ISR, increasing HW (high watermark, and finally commit the offset) and sends ACK producer

Storage Policy:

Regardless of whether the message is consumed, kafka will retain all messages. There are two strategies you can delete the old data:

1) Time-based: log.retention.hours = 168

2) based on the magnitude: log.retention.bytes = 1073741824

Note that because Kafka read a particular message, the time complexity is O (1), that is, regardless of the file size, so there is nothing to delete outdated files and improve the performance of Kafka.

Consumption process:

It allows developers to control their own offset, reading from where you want to read from where.

Self-control connection partition, the partition custom load balancing

Reduced dependence on zookeeper (eg: offset zk not necessarily rely on storage, you can store its own offset, such as in the presence of a file or memory)

Consumer Group:

Consumers are way consumer group consumer group work, by one or more consumers form a group, a common consumer topic. Each partition can only be read by the group in a consumer at the same time, but multiple group can consume this partition at the same time. In the drawing, a Group composed of three consumer, a consumer has read topic two partitions, the other two to read a partition. A consumer reads a partition, also called a consumer is a partition of the owner.

In this case, consumers can expand horizontally reads large number of messages simultaneously. In addition, if a consumer fails, then the other group members will automatically load balance read partition failed before consumers read.

Consumption patterns:

consumer using pull (pull) mode, the read data from the broker.

push (push) model is difficult to adapt to different consumer consumption rates, because the message transmission rate is determined by the broker. Its goal is to deliver the message as quickly as possible, but this is likely to cause consumer a chance to process the message, typical performance is a denial of service, and network congestion. While the pull mode can be at a suitable rate consumer spending power consumption according to the message.

For Kafka, pull model is more appropriate, it is designed to simplify broker, consumer can independently control the rate of consumption of the message, while the consumer can control their consumption patterns - can also be one by volume consumer spending, while also choose a different submission to achieve different transmission semantics.

Deficiencies pull mode is that, if there is no data kafka, consumers may fall into circulation, we have been waiting for data to arrive. To avoid this, we have in our pull request parameters, allow the consumer to request blocked "long polling" waiting for the arrival of data (and optionally waiting for a given number of bytes to ensure a large transfer size).

Exchange qq group: 1022901775, get courseware, code, technical exchanges, the issue feedback;

To facilitate learning, please pay attention to the official "a few hundred cloud class" public number.

 

Released seven original articles · won praise 1 · views 1589

Guess you like

Origin blog.csdn.net/ZhenHunQuWuGe/article/details/104267977