Kafka necessary knowledge (brief summary)

Kafka necessary knowledge (brief summary)

Here Insert Picture Description

Article Directory

1.kafka from production to consumption process

Here Insert Picture Description
1, Kafka ecosystem of the four roles, producer (Producer), kafka cluster (Broker), consumers (Consumer), ZooKeeper
2, every consumer must belong to a consumer group, consumers within a group can be more .
3, a plurality Topic can have zero to a plurality of partitions (the Partition)
. 4, a partition can have zero to many segment.
5, each segment can have an index of a log and composition.
6, is copied partition called the master copy (Leader), called a copy out from the copy (Follwer)
7, the producer only write data to the master copy, just take the consumer to pull data in the master copy.
8, data from the backup copy only, not read and write data.

[Red indicates hidden knowledge]
9, a piece of data partition, a consumer can only be consumed within a consumer group.
10, the consumer is determined by the maximum number of concurrent topic of partition
11, the number of partitions must be less than equal to the number of copies of broker
12, more than one topic within a partition, each partition is only a portion of all the data. All the accumulated data partitions together all your data in this topic.

13, zookeeper id broker records the correspondence between the offset data consumer spending, consumers and the partition (ConsumerA-> Partition-0, ConsumerB-> Partition-1)

2. What is the kafka

Is a distributed, multi-partition copy, multi-subscriber messaging publish subscribe system.

3, kafka usage scenarios

Applying the coupling, asynchronous processing, limiting clipping, a message driven system

4, kafka advantages and disadvantages

advantage:

  • Reliability (distributed partition + + copy)
  • Scalability (scalable)
  • High performance (data read)
  • Durable and strong (data persistence), time-sensitive.

Disadvantages:

  • Because it is sent in bulk, is not true real-time data
  • Only supports partition message ordering, you can not achieve an orderly global message
  • There may be repeated consumption
  • Dependent metadata management zookeeper

5, Kafka architecture (processes)

  • Producers
  • kafka cluster
  • consumer
  • zookeeper

6, Kafka Architecture (API)

  • Producers
  • consumer
  • StreamAPI
  • ConnectAPI

7, which make up the internal Topic?

Each contains one or more of the Partition Topic, exists among a plurality of partition segment file segments, each segment divided into two parts, and .index file .log file.

8, the relationship between the partition consumer spending within the group and what circumstances?

Concurrent consumption of the Partition = = just right task, each task to read a data partition
Partition> = task concurrency consumption part of the consumer tasks has a plurality of read data partitions
Partition <task concurrency consumer spending some task = idle (more than the number of consumers can create partitions)

9, the relationship between the number of partitions, consumers and reading efficiency

The more the number of partitions, the same time there can be more and more consumers to consume, the faster the rate of consumption data will improve the performance of consumer

10, the relationship between the number of copies and broker

Copy of the data (including itself) is less than the number equal to the number of broker generally

11, what is the master / slave copy

The first is called the primary copy out copy (leader), copied from the leader of all follower. (Can only be copied from the copy from the master copy)

12, master / What is the role of the replica

Responsible for reading and writing the master copy of the data.
From data backup copies only, not read and write data.

13. What lsr that?

ISR is a copy of the message with the group leaders fully synchronized (including leaders itself).

14, producer production data to kafka cluster data to the partition of the way

a) the partition number is not specified, there is no specified key, when the polling data depositors embodiment
b) is not specified partition number specified key, data distribution policy for the hash key value is obtained, and this value modulo the number of partitions, to the number of is the partition number.
c) a specified partition number, all data input to the specified partition
d) custom partitioning

15, Consumer consumption data flow

1, connected to a specified first Consumer leader Topic partition where Broker, using a binary / binary search to determine where the data segment.
2, is determined after which segment, index documents in the data segment using the determined specific location to find the way to obtain using pull messages from the kafkalogs.

16, what data is deleted mechanism Kafka?

1, connected to a specified first Consumer leader Topic partition where Broker, using a binary / binary search to determine where the data segment.
2, is determined after which segment, index documents in the data segment using the determined specific location to find the way to obtain using pull messages from the kafkalogs.

17, Kafka how to ensure that data is not lost

1, how producers to ensure that data is not lost? ? By ack mechanism ensures that data is not lost.
2, how kafka cluster to ensure data is not lost? ? A copy of the data is not lost by guarantee data.
3, consumers how to ensure that data is not lost? ? By offset maintenance data to ensure data is not lost.

18, where Kafka performance reasons?

Sequential read and write, partition, bulk transmission, data compression

19, Kafka and efficient data query What are the reasons

1, Kafka parition the topic in a large file into multiple small files segment by segment multiple small files, it is easy to regularly clean or delete the file has finished consumption, reduce the disk footprint.
2, you can quickly locate and determine the maximum size of the message by the response of the index information.
3, the index metadata mapping all to memory, to avoid the segment file IO disk operations.
4, through the sparse index file storage, can significantly reduce the index file metadata footprint size.

20, how to get accurate information (not re-read data) from Kafka?

To avoid duplication of data in the production process.
Consumption during the data to avoid duplication.

21, Kafka's design is what it?

Kafka message to the topic as a unit induction program topic will be announced Kafka to become producers. The book topics and consumer news program become consumer. Kafka is running in a cluster, it may consist of one or more services, each service It is called a broker. producers by sending a network message to Kafka cluster, the cluster to provide information to consumers

22, things data transmission definition What are the three?

1, at most once: a message sent will not be repeated up to be transferred once, but there may not be a transfer
may cause data loss.
3, at least once: a message sent will not be missing, at least once transferred, but there are likely to be repeated transmission.
May cause repeated consumption data.
3, and the precise time (Exactly once): not leak repeat transmission is not transmitted, each message transmission is to be transmitted once and only once, which is all aspire

23, Kafka determine whether a node is still alive what conditions?

1, the node must be maintained and the ZooKeeper connection, Zookeeper check each node is connected via a heartbeat mechanism
2, if the node is a follower, he must be synchronized in time leader writes, can not delay too long

24. What are the differences between Kafka and traditional messaging system?

1, Kafka persistent log: These logs can be read and retained indefinitely repeated
2, Kafka is a distributed system: it runs in a cluster, have flexible internal enhance fault tolerance and high availability data replication
3 , Kafka support real-time streaming

25, Kafka when you create a partition placed Topic What is the difference of Broker's strategy is?

Premise: copy factor can not be larger than the number of Broker;
first replica placement position of the first partition (number 0) are randomly selected from brokerList;
first replica placement position with respect to the other partitions of partitions sequentially 0 backward.
For example: There are 5 Broker, five partitions, the first partition on the assumption fourth Broker, then the second partition will be placed on the fifth Broker; third partition will be placed in the first the Broker; the fourth partition will be placed on the second Broker, fifth partition will be placed on the third Broker;

26, Kafka new partition will be created in which directory

If log.dirs parameters configured with only one directory, then assigned to a partition on each Broker will create a folder used to store data in this directory.
If log.dirs parameters configured with multiple directories, then Kafka will create a new partition containing the directory folder directory file fewest total number of partitions, partition directory named Topic name + partition ID.
Note (not to use the least amount of disk directory)

27, how to partition the data is saved to the hard disk

If log.dirs parameters configured with only one directory, then assigned to a partition on each Broker will create a folder used to store data in this directory.
If log.dirs parameters configured with multiple directories, then Kafka will create a new partition containing the directory folder directory file fewest total number of partitions, partition directory named Topic name + partition ID. Note (not to use the least amount of disk directory)

28, kafka of ack mechanism

There are three values ​​request.required.acks 01-1

  • 0: the producer does not wait for ack broker, but the lowest latency storage server to ensure that the weakest hang up when the time will lose data
  • 1: the server waits for ack value
    leader a copy of the acknowledgment ack sent after receiving the message, but if he does not hang leader if the copy is complete to ensure that the new leader will lead to data loss
  • -1: 1 on the same basis of all copies follower of the server will wait until after the data being sent by ack leader, so that data is not lost

29, Kafka consumers how to consume data

Every consumer consumption data, consumers will be physical offset (offset) of the record consumption of consumer position until the next time, he will then continue to spend the last position. But it can also be re-consumption in accordance with the specified offset.

30, how to make the data in the kafka cluster is ordered?

Create only one partition. (But actually it will have performance problems, analyze specific business after confirmation.)

31, Zookeeper summary retain what data?

1, offset submitted by consumers.
2, leader detection, distributed synchronization, configuration management, identify when a new node is connected or leave the cluster, the node real-time status
3, the owner and consumer partition off
4, broker id

32, kafka consumer what would trigger rebalancing reblance?

1, once consumers opt in or out of consumer groups, resulting in a change in the consumption list of group members, the group of all consumer spending should perform rebalancing.
2, subscribe to topics zoning changes, all consumers have to re-balance.

33, the re-equilibration step described kafka consumer?

1, pull data off the thread, the queue is cleared, and a message flow, submitted offset;
2, partition ownership is released, removed by the owner of the relationship between the partition and the consumer in zk;
3, all partitions reassigned to each consumer, each customer will be assigned to different partitions;
4, the partition corresponding written ZK all consumers, no record of ownership information partition;
5, restart consumer pull thread manager, management of each area of the pull thread.

34, moving submitted offset What are the benefits? ?

The update to offset more timely, to avoid causing problems not updating the offset repeated consumption data.

35, why the data kafka the need to periodically delete or merge?

Kafka only for temporary storage for data, temporary buffer, not permanent storage (persistent storage using HDFS).
-------------------------------------------------- -------------------------------------------------- --------------------------
above to end here, oh. Of our readersTripleXiao Bian is to stick to the power Oh! Above are subject to error, we welcome the timely help correct small series Oh!
I am a little Rebels, Chilean students pass a training college. A programming industry amateurs ... ha ha ha
the best relationship is mutual achievement, we see the next issue.

Reviewing the Old
Published 47 original articles · won praise 134 · views 20000 +

Guess you like

Origin blog.csdn.net/Mr_Yang888/article/details/105067549