Big Data Section: Kafka

Big Data Section: Kafka

kafka.apache.org

What Kafka that?

Kafka is a high throughput of distributed publish, subscribe messaging system, it can handle all the actions of consumers streaming data in the site. This action (web browsing, search and other user action) is a key factor in many social functions in modern networks. These data usually due to the required throughput is achieved by the polymerization process log and the log.

If you do not Kafka

Millions of messages per second over large areas of data, persistent message can not be processed;

Texting conventional art, e-mail, etc. asynchronous operation, clipping processing and the like. (Of course, you may also be used RabbitMQ, ActiveMQ, RocketMQ etc.)

Two kinds of message processing mode 1

2 Kafka architecture

  1. Producer: news producer, kafka broker message client.

  2. Consumer: news consumers, kafka broker's clients receive messages

  3. Topic: a topic for instructions to use the production and consumption is what the subject can be understood as a queue. (Data classification)

  4. Consumer Group (CG): kafka topic used to implement a broadcast message (sent to all consumer) and unicast (issued any consumer) means. A topic may have a plurality of CG. topic of the message copy (not true copy, is conceptual) to all of the CG, but each partion only the message to a consumer in the CG. If you want a broadcast, as long as each consumer has an independent CG on it. To achieve as long as all of unicast consumer in the same CG. The consumer can also be grouped by freely without CG message is transmitted multiple times to a different topic. (Above TopicA group message sent to the consumer, received by ConsumerA, you can not receive the ConsumerB, this time to achieve a unicast) (above hypothesis, three each consumer a consumer group, a 3 TopicA groups jointly took over, then you realize the broadcast) (theoretically, corresponding to the number of partitions Topic consumer group in the number of consumers the best performance)

  5. Broker: a kafka server is a broker. A cluster composed of a plurality of broker. A broker can receive a plurality of topic.

  6. Partition: In order to achieve scalability, a very large topic may be distributed to a plurality Broker (i.e., server), the topic can be divided into a plurality of partition, each partition is an ordered queue. partition each message is assigned a sequential id (offset). kafka order to ensure that only one partition in a message to the consumer, does not guarantee a whole topic (s partition between) sequence. (Kafka you can only read information Leader of the partition in)

  7. Offset: kafka stored files are named according to offset.kafka to do with the name of offset benefits are easy to find. For example, you want to find the location in 2049, just find 2048.kafka of files. Of course, the first offset is 00000000000.kafka

  8. zookeeper: to save kafka cluster configuration information

3 Monitoring

3.1 Kafkatool

This software is used to view data and other information kafka producer, you can download and install.

3.2 C

CMAK (formerly known as Kafka Manager) is a tool for managing Kafka clusters, mainly used to observe consumer information. 3.0.x need more java11 above, you can run more than zookeeper3.5.x

  1. Upload file unpack
mkdir /usr/local/src/CMAK
cd /usr/local/src/CMAK
unzip cmak-3.0.0.4.zip
cd cmak-3.0.0.4

  1. Change setting
vim conf/application.conf

zk cluster configuration needs attention here and the corresponding kafka configuration, otherwise it will not find the web consumer group, replaced if can not find the IP hostname, or the hostname into IP.

old version

kafka-manager.zkhosts="192.168.xx.xx:2181,192.168.xx.xx:2181,192.168.xx.xx:2181"

new version

cmak.zkhosts="192.168.xx.xx:2181,192.168.xx.xx:2181,192.168.xx.xx:2181"

  1. start up

cmak default port 8080, by -Dhttp.port, designated port; -Dconfig.file = conf / application.conf specified configuration file:

nohup bin/cmak -Dconfig.file=conf/application.conf -Dhttp.port=8080

4 Command Operation

4.1 Creating topic

#创建top-test主题,2个分区,2个副本
kafka-topics --create --zookeeper cdh01.cm:2181,cdh02.cm:2181,cdh03.cm:2181 --topic top-test --partitions 2 --replication-factor 2

4.2 View topic

kafka-topics --list --zookeeper cdh01.cm:2181,cdh02.cm:2181,cdh03.cm:2181

4.3 Delete topic

kafka-topics --delete --zookeeper cdh01.cm:2181,cdh02.cm:2181,cdh03.cm:2181 --topic top-test

4.4 View topic details

kafka-topics --describe --zookeeper cdh01.cm:2181,cdh02.cm:2181,cdh03.cm:2181 --topic top-test

4.5 Producer - Consumer

#生产者
kafka-console-producer --topic top-test --broker-list cdh01.cm:9092,cdh02.cm:9092,cdh03.cm:9092
#消费者(--from-beginning 从头开始读取)
kafka-console-consumer --topic top-test --bootstrap-server cdh01.cm:9092,cdh02.cm:9092,cdh03.cm:9092  --from-beginning --group g1
  • First, a plurality of data input at the producer side, input aa-> ff I here, the effect is as follows


5 Kafka workflow

kafka each partition are offset from 0 to ensure that the regional order, can not guarantee the global order;

zk producer is not registered, the consumers registered in zk.

topic is a logical concept, and on the concept of physical partition, each partition corresponding to a log file, the log file is stored in the production data. Producer production data will be continuously added to the end of the log file, and each has its own data offset. Consumer groups each consumer will consume its own real-time recording to which offset, in order to recover when an error occurs, continue to consume from the last position. As shown below:


5.1 news production process

Step 1 explained: producer zookeeper start of "/brokers/.../state" node finds the leader partition corresponding to the topic;

4-6 by step to ensure reliability of the data, kafka selected ack must all been sent successfully to complete the synchronization.


  • The reason partition

    • Conveniently extended in a cluster, each Partition can be adjusted to fit it in the machine, and they can have a plurality of topic Partition composition, so that the entire cluster can be adapted to any size of the data;
    • You can improve concurrency because you can read and write to Partition units.
  • Partition principle

    • Specified patition, is used directly;

    • patition not specified but the specified key, the key through a hash value for patition

    • patition and not specified key, a select polling patition.

5.2 Consumer information process

Consumers are way consumer group consumer group work, by one or more consumers form a group, a common consumer topic. Each partition can only be read by the group in a consumer at the same time, but multiple group can consume this partition at the same time. In the drawing, a Group composed of three consumer, a consumer has read topic two partitions, the other two to read a partition. A consumer reads a partition, also called a consumer is a partition of the owner.
In this case, consumers can expand horizontally reads large number of messages simultaneously. In addition, if a consumer fails, then the other group members will automatically load balance read partition failed before consumers read.

  • Consumption patterns
    • consumer using pull (pull) mode, the read data from the broker.
    • push (push) model is difficult to adapt to different consumer consumption rates, because the message transmission rate is determined by the broker. Its goal is to deliver the message as quickly as possible, but this is likely to cause consumer a chance to process the message, typical performance is a denial of service, and network congestion. While the pull mode can be at a suitable rate consumer spending power consumption according to the message.
    • For Kafka, pull model is more appropriate, it is designed to simplify broker, consumer can independently control the rate of consumption of the message, while the consumer can control their consumption patterns - can also be one by volume consumer spending, while also choose a different submission to achieve different transmission semantics.
    • Deficiencies pull mode is that, if there is no data kafka, consumers may fall into circulation, we have been waiting for data to arrive. To avoid this, we have in our pull request parameters, allow the consumer to request blocked "long polling" waiting for the arrival of data (and optionally waiting for a given number of bytes to ensure a large transfer size).
  • consumers node offset information stored in the customer group in the zookeeper.

5.3 Save a message

As the producer of news production will continue to append to the end of log files to prevent data location and low efficiency problem caused by log file is too large, Kafka take fragmentation and indexing mechanism, each patition into multiple segment.

Each segment corresponds to two files -> ". Index and .log" file. These files are located in a patition folder, it patition folder naming rule is: topic name - the partition number.

".Index and .log" file offset of the first message segment currently named as follows:

00000000000000000000.index

00000000000000000000.log

00000000000000135489.index

00000000000000135489.log

00000000000000268531.index

00000000000000268531.log

The following schematic structure of the file ".index and .log":

Finish

Guess you like

Origin www.cnblogs.com/ttzzyy/p/12636900.html