In-depth understanding of Kafka partition replica mechanism

A, Kafka cluster

Kafka Zookeeper use to maintain information about cluster members (brokers) of. Each broker has a unique identifier broker.id, identifying itself in the cluster, the configuration file can server.propertiesbe configured in, or automatically generated by the program. Here is the process of Kafka brokers cluster automatically created:

  • Each broker startup, it will Zookeeper's /brokers/idscreate a path to the next 临时节点, and his broker.idwriting so as to register itself to the cluster;
  • When there are multiple broker, broker will all be created on a competitive Zookeeper /controllernode, since nodes on the Zookeeper will not be repeated, so inevitably there will only create a successful broker, then the broker called controller broker. It addition to other broker function, but is also responsible for managing the state of the partition and a copy of the theme .
  • When the broker downtime or voluntary withdrawal leading to its holdings session timeout Zookeeper, Zookeeper triggers registered on the watcher event, then Kafka will make the appropriate fault tolerance; if downtime is when the controller broker, will trigger a new controller elections.

Second, copy mechanism

In order to ensure high availability, kafka partitions are multiple copies, if one copy is lost, you can also get data from another partition copy. But this requires a copy of the corresponding data must be complete, which is the basis of Kafka data consistency, so it requires controller brokerto carry out special management. Detailed below is a copy of the mechanisms by Kafka.

2.1 partitions and copy

Kafka's theme is divided into multiple partitions, partition is Kafka basic storage units. Each partition can have multiple copies (can be used when creating a theme replication-factorparameter specified). One copy is chief copy (Leader replica), all events are sent directly to the chief copy; other copies are followers of copies (Follower replica), consistent with the need to keep by copying and chief copy of the data, when the chief copy is not available, which a copy of followers will become the new leader.

kafka-cluster

2.2 ISR mechanism

Each partition has an ISR (in-sync Replica) list, used to maintain synchronization of all available copies. Leader must be synchronized copy of a copy, but a copy of it to the followers, it needs to meet the following criteria in order to be considered a synchronized copy:

  • Between Zookeeper and has an active session, i.e. must periodically send a heartbeat to the Zookeeper;
  • Within the prescribed time low latency access over a copy of a message from the leader there.

If the copy does not meet the above conditions, it will be removed from the ISR list, until the condition will not be added again.

Here is an example to create a theme: Using --replication-factorcopies specified factor of 3, after creating successful use --describecommand to see the partition 0 There are 0,1,2 three copies, and three copies are in the ISR list, where 1 is chief copy.

kafka- partition replica

2.3 incomplete leader election

For a copy of the mechanism, the broker level has an optional configuration parameters unclean.leader.election.enable, the default value fasle, on behalf of the head of the election ban incomplete. This is when the leader for a copy of the ISR and hang up when there are no other copies available, not whether to allow a fully synchronized copy to become the leader of a copy, this may result in loss of data or inconsistent data, in some of the higher data consistency requirements scenes (such as the financial sector), which could not be tolerated, so the default value is false, if you can allow some data inconsistencies, you can configure is true.

2.4 Minimum synchronized copy

ISR mechanism Another related parameters min.insync.replicascan be configured in a broker or topic level, representing the ISR list must have at least a few copies available. It is assumed here set to 2, then when this value is less than the available number of copies, the entire partition is considered unavailable. At this point the client again when writing data partition will throw an exceptionorg.apache.kafka.common.errors.NotEnoughReplicasExceptoin: Messages are rejected since there are fewer in-sync replicas than required。

2.5 sends an acknowledgment

Kafka has on producers an optional parameter ack, which specifies how many copies of the partition must have received the message, the message is written successful producer will think:

  • = 0 ACKs : a message sent to believe has been successful, and will not wait for any response from the server;
  • =. 1 ACKs : As long as the cluster leader node receives the message, the producer will receive a successful response from the server;
  • = All ACKs : Only when all nodes participating in the replication of all messages received, the producer will receive a successful response from the server.

Third, the data request

3.1 yuan data request mechanism

In all copies, copies of leadership can only be read and write the message processing. As the leader a copy of the different partitions may be on a different broker, if a broker has received a zoning request, but the leadership is not a copy of the partition on the broker, then it will return to a client Not a Leader for Partitionerror response. To solve this problem, Kafka provides metadata request mechanism.

First, each broker will be cached copy of the cluster partition information, the client sends all the topics regularly sends the metadata request and then obtain the metadata cache. Timed metadata refresh time interval can be configured as a client metadata.max.age.msto be specified. Once you have metadata information, the client will know where the broker copy of leadership, then send the request directly to the corresponding broker can read and write.

If the election takes place a copy of the partition of the timing of the request within the time interval, it means that the original cached information may be outdated, and this time there may receive an Not a Leader for Partitionerror response, in this case the client will seek re-issued yuan data requests, and then updates the locally cached, again after the corresponding operation performed on the right Broker, FIG follows:

kafka- metadata request

3.2 Data Visibility

It should be noted that not all data stored on the partition leader can be read to the client, in order to ensure data consistency, only to be all synchronized copies (all copies of the ISR) data are stored in order to be a client read.

kafka- data visibility

3.3 Zero-copy

Kafka writing and reading of all data is achieved by zero-copy. Traditional zero-copy and the difference copy is as follows:

Four and four copies of the conventional mode switch context

To a disk file, for example through the network to send. The traditional mode, is generally used a method follows the pseudocode file data is read into memory first, then the data is sent through the memory Socket.

buffer = File.read
Socket.send(buffer)

This process actually took place four data copy. First, through the system call reads the file data into a kernel mode Buffer (the DMA copy), then the application memory state Buffer read data into user mode Buffer (the CPU copy), then the user program through the Socket transmitting data to user mode Buffer data copied into kernel mode Buffer (CPU copies), the last DMA copy by copy data to the NIC Buffer. At the same time, accompanied by four context switch, as shown below:

Kafka-BIO

and zero-copy sendfile transferTo

Linux 2.4+ kernel via sendfilesystem calls, provides zero-copy. After the data is copied into kernel mode by the DMA Buffer, via DMA to copy directly NIC Buffer, copies without CPU. This is also the source of zero-copy of this statement. In addition to reducing data copying, but because the entire read document sent by the network to a sendfilecall is completed, the process context switch only twice, thus greatly improving performance. Zero-copy process is shown below:

kafka- zero-copy

从具体实现来看,Kafka的数据传输通过TransportLayer来完成,其子类PlaintextTransportLayertransferFrom方法通过调用Java NIO中FileChannel的transferTo方法实现零拷贝,如下所示:

@Override
public long transferFrom(FileChannel fileChannel, long position, long count) throws IOException {
    return fileChannel.transferTo(position, count, socketChannel);
}

注: transferTotransferFrom并不保证一定能使用零拷贝。实际上是否能使用零拷贝与操作系统相关,如果操作系统提供sendfile这样的零拷贝系统调用,则这两个方法会通过这样的系统调用充分利用零拷贝的优势,否则并不能通过这两个方法本身实现零拷贝。

四、物理存储

4.1 分区分配

在创建主题时,Kafka会首先决定如何在broker间分配分区副本,它遵循以下原则:

  • 在所有broker上均匀地分配分区副本;
  • 确保分区的每个副本分布在不同的broker上;
  • 如果使用了broker.rack参数为broker指定了机架信息,那么会尽可能的把每个分区的副本分配到不同机架的broker上,以避免一个机架不可用而导致整个分区不可用。

基于以上原因,如果你在一个单节点上创建一个3副本的主题,通常会抛出下面的异常:

Error while executing topic command : org.apache.kafka.common.errors.InvalidReplicationFactor   
Exception: Replication factor: 3 larger than available brokers: 1.

4.2 分区数据保留规则

保留数据是 Kafka 的一个基本特性, 但是Kafka不会一直保留数据,也不会等到所有消费者都读取了消息之后才删除消息。相反, Kafka为每个主题配置了数据保留期限,规定数据被删除之前可以保留多长时间,或者清理数据之前可以保留的数据量大小。分别对应以下四个参数:

  • log.retention.bytes :删除数据前允许的最大数据量;默认值-1,代表没有限制;
  • log.retention.ms:保存数据文件的毫秒数,如果未设置,则使用log.retention.minutes中的值,默认为null;
  • log.retention.minutes:保留数据文件的分钟数,如果未设置,则使用log.retention.hours中的值,默认为null;
  • log.retention.hours:保留数据文件的小时数,默认值为168,也就是一周。

Find and delete messages because it is time-consuming in a large file, it is very easy to make mistakes, so Kafka to partition into several fragments, fragments of data currently being written is called active fragments. Activities segment will never be deleted. If you keep the data in accordance with the default values ​​a week and use a new clip every day, then you will see, using a new piece every day at the same time will remove one of the oldest pieces, so most of the time of the partition will be seven segments exist .

4.3 File Format

Usually stored on disk data format and sent over producers message format is the same. If sent by the producer of the message is compressed, then a batch of the same message are compressed together, are treated as "packaged message" is transmitted (format below), and then saved to disk. After reading the consumers themselves after unpack this package news, get specific information of each message.

kafka-compress-message

Reference material

  1. Neha Narkhede, Gwen Shapira, Todd Palino (a), Xue lamp life (translation). Kafka Definitive Guide Posts & Telecom Press. 2017-12-26
  2. Kafka high-performance architecture of the Road

More big data series can be found in personal GitHub open source project: Big Data Getting Started

Guess you like

Origin blog.51cto.com/14183932/2413107