[Kafka] "Kafka Definitive Guide" - read data from Kafka

Application uses KafkaConsumer subscribe to a topic to Kafka, and receive messages from the subscribed topic. Unlike the read data read data from other information systems quiet from Kafka, it involves some unique concepts and ideas. If you do not first understand these concepts, it is difficult to understand how consumers use the API. So we are going to explain these important concepts, and then a few examples, the cross shows how consumers use API for different applications.

Consumers and consumer groups

Suppose we have an application needs from - a well read messages Kafka theme validate these messages, and then save them up. Application needs to create a consumer objects, subscribe to topics and start receiving the message, and then save the results to verify the message well. After a while, the theme producers to write messages faster than the speed of application program verification data, this time how to do? If only a single consumer processes the message, the application will be much to keep up with the speed of messages generated. Obviously, at this time it is necessary to scale-consumers. Like multiple producers can write messages to the same theme, we can also use multiple consumer reads messages from the same theme, the message diversion.

Kafka consumers belonging to consumer group. Where a group of consumers subscribe to the same theme, each consumer receives the message subject portion of a partition.

Theme T1 assume there are four partitions, we created consumer C1, which is the only consumer group G1, we use it to subscribe to topics T1. Cl1 Consumers will receive a message relating to all four partitions T1, shown in Figure 4-1.

If you add a consumer group G1 in the C2, each consumer will receive a message from the two partitions. I assume that the consumer receives the message C1 partition 0 and partition 2, C2 consumers receiving messages Area 1 and Area 3, shown in Figure 4-2.

If the group G1 has four consumers, each consumer can be assigned to a partition, shown in Figure 4-3.

If we add more to the consumer group, the number of partitions than the subject, so much the consumer will be idle, will not receive any messages.

To increase consumer group in a major way to scale-spending power. Kafka consumers often do some high-latency operations, such as data is written to the database or HDFS, or use the data to calculate time-consuming. In these cases, individual consumers can not keep up with the speed of data generated, it is possible to add more customers, so that they share the load, each consumer only part of the partition process the message, which is the primary means of scale-out. We need to create a large number of partitions theme, an increase in the load can be added to more consumers. However, to sexual meaning, do not let the number exceeds the number of consumers theme partitions, the excess consumer will only be idle.

In addition to lateral stretching by increasing the individual consumer applications, the case of reading data of the same subject from a plurality of application often occurs. In fact, one of the main design goals Kafka, Kafka theme is to make the data in the enterprise to meet the needs of various application scenarios. In these scenes, each application can obtain all the information, not just a part of it. As long as each application has its own consumer group, you can let them get to the subject of all messages. Unlike traditional messaging systems, scale-Kafka consumers and consumer groups do not have a negative impact on performance.

In the example above, if a new group that contains only a consumer of G2, so that consumers receive all the news on topics from T1, independently of each other and between G1 group. Group G2 can add more consumers, each consumer can consume several partitions, like the group G1 as shown in Figure 4-5. Overall, the group G2 will still receive all messages, with or without the presence of other groups.

In short, create a consumer group for every application need to obtain one or more topics all messages, and then to add consumer groups in reading ability to stretch and processing power in each group consumers only deal with part of the message.

Consumer groups and partitions rebalancing

We have learned from a section, a group of consumers to read the partition in the theme. When adding a new consumer group, it reads that the original message read by other consumers. When a consumer is shut down or a crash, it will leave the group, the group originally by its partition will be read in the other consumers to read. When the topic changes, such as the administrator to add a new partition, the partition will happen redistribution.

Partition ownership is transferred from one consumer to another consumer, such behavior is known as rebalancing. Rebalancing is important because it brings high availability and scalability for consumer groups (we can safely add or remove consumers), but under normal circumstances, we do not want to happen such behavior. During the rebalancing, consumers can not read the message, which we can not use the entire group a short time. In addition, when the partition is re-assigned to another consumer, consumer to read the current state is lost, it may also need to flush the cache before it resumed the state will slow down the application. We will discuss how to re-balancing of safety and how to avoid unnecessary re-balanced in this chapter.

Is assigned to the consumers as a group coordinator Broker (different groups can have different coordinator) sends a heartbeat to maintain groups and their affiliation and their ownership of the partition. As long as consumers send heartbeat interval at a normal time, it was considered to be active, indicating that partition read it in the news. Consumers may send the heartbeat message in the polling (in order to obtain the message), or when the offset filed. If consumers stop sending heartbeat long enough, the session expires, the group coordinator think it's dead, it will trigger a rebalancing.

If a consumer crashes, well stop reading the message, a group coordinator (broker) will wait for a few seconds, to confirm that it will trigger the death of rebalancing. In those few seconds of time, consumers will not die partition of the message read. When cleaning up consumers, consumers will notify the coordinator will have to leave the group, the coordinator will immediately trigger a re-balanced to minimize processing pauses. Later in this chapter we will discuss some of the configuration parameters for controlling the transmission session expiration time and heart rate, and how to configure these parameters according to actual needs.

What is a partition allocation process

When consumers want to join the group, it sends a request to the JoinGroup group coordinator. The first to join the group of consumers will become "the main group." The main group obtained from the coordinator of the group where the members of the list (the list includes all recently sent consumer heartbeat, they are considered to be active), and is responsible for distribution to each consumer partition. It uses a class that implements the interface PartitionAssignor to determine which partition should be assigned to which the consumer.

Kafka built two distribution strategy, we'll discuss in a later section of the configuration parameters. After the distribution is completed, the main group of the distribution of the list sent to the group coordinator, coordinator and then send the information to all consumers. Each consumer can only see their own allocation information, only the main group in the group knows the allocation information for all consumers. This process is repeated each time it is re-balanced.

Creating Kafka consumers

Before reading the message, you need to create a KafkaConsumer object. KafkaConsumer create objects and create KafkaProducer object is very similar - the attributes consumers want to pass on the Properties object. Subsequent sections of this chapter will be discussed in depth all the properties. Here, we only need to use three essential attributes: bootstrap.servers, key.deserializer, value.deserializer.

The following code demonstrates how to create a KafkaConsumer objects:

Properties props = new Properties();
 
props.put("bootstrap.servers", "broker1:9092, broker2:9092");
 
props.put("group.id", "CountryCounter");
 
props.put("key.deserializer", "org.apache.kafka.common.serializaiton.StrignDeserializer");
 
props.put("value.deserializer", "org.apache.kafka.common.serializaiton.StrignDeserializer");
 
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);
复制代码

deserializer specified class (deserializer) turn into Java byte array object.

group.id specified KafkaConsumer which belong to a consumer group. group.id not required, but now we just assume that it is necessary. It specifies KafkaConsumer which belong to a consumer group. Creation does not belong to any one group of consumers is also possible, but less common to do so.

Subscribe theme

After creating the consumers, the next step can begin to subscribe to a topic. subscribe () method accepts as a parameter a list of topics

consumer.subscribe(Collections.singletonList("customerCountries"));
复制代码

Here we have created a list containing a single element, the name of the theme is called "customerCountries", we can also pass a regular expression when calling the subscribe () method, the regular expression can match multiple topics if someone created the new theme, and the theme name and regular expression matching, it will immediately trigger a rebalancing, consumers can read new topics added. If the application needs to read multiple topics, and can handle different types of data, then this subscription is very useful. It is common practice when copying data using regular expressions fashion subscribe to multiple topics between Kafka and other systems.

To subscribe to all test-related topics, you can do this: consumer.subscribe ( "test *.");

polling

API is a polling message core consumer, the data request to the server via a simple polling. Once consumers subscribed to the topic, polling will handle all the details, including the coordination group, zoning rebalancing, send heartbeat and access to data, developers only need to use a simple set of API to process the data returned from the partition. The main part of the consumer code is as follows:

Poll not only get the data so simple. In the first call of the new consumer poll () method, which is responsible for locating GroupCoordinator, then join the group, accepted the assigned partition. If this happens again balanced, the whole process is carried out during polling. Of course, the heartbeat is sent from the polling was bursting to go. Therefore, we must ensure that any work done should be treated as soon as possible during the polling.

Thread Safety

In the same group, we can not allow a consumer to run multiple threads, multiple threads can not allow a consumer to securely share. As a rule, a consumer using a thread. If you want to run multiple customers in the same consumer group, we need to let every consumer runs in its own thread. It is better to consumer logic is encapsulated in its own object, and then use the Java ExecutorService start multiple threads, so that every consumer runs on its own thread.

Consumer Configuration

So far, we have learned how to use consumer API, but only introduces a few configuration is a 'sex eleven as bootstrap.servers, key.deserializer, value.deserializer, group.id. Kafka's document lists all related consumer configuration instructions. Most parameters have reasonable defaults, generally do not need to modify them, but there are some parameters with the consumer has a lot of performance and availability. Next introduce these important properties.

1. fetch.min.bytes

This attribute specifies the minimum number of bytes to obtain records from the server consumers. broker upon receipt of the consumer's data request, if the amount of data available fetch.min.bytes less than the specified size, then it will wait until there is enough data is available when it is returned to the consumer. This reduces the workload broker and consumers, because they do not need to and fro in the subject is not very active time (or the day of the peak hours) to process the message. If you do not have much data available, but consumers CPU usage is very high, you need to set the value of the property is greater than the default. If the number of consumers is more, the value of the property is set a little too big can reduce the workload of the broker.

2. fetch.max.wait.ms

We told Kafka by fetch.min.bytes, wait until there is enough data when it is returned to the consumer. The fetch.max.wait.ms the waiting time for the designated broker default is 500ms. If there is not enough data to flow into Kafka, consumers get the minimum amount of data on the requirements are not met, leading to 500ms delay. To reduce the potential delay (in order to meet SLA), the parameter value can be set to be smaller. If fetch.max.wait.ms is set to 100ms, and fetch.min.bytes is set to 1MB, then Kafka after receiving the consumer's request, either return 1MB of data, or return all the available data after 100ms, to see what conditions are to be met.

3. max.parition.fetch.bytes

This attribute specifies the maximum number of bytes returned from each server partition to consumers. Its default value is 1MB, that is to say, KafkaConsumer.poll () method returns from each partition record a maximum of max.parition.fetch.bytes specified byte. If a theme has 20 partitions and five consumers, each consumer needs at least 4MB of memory available to receive the record. When allocating memory for the consumer, may be allocated with more to them, because if there are a group of consumers crash, leaving consumers need to process more partitions. The maximum number of bytes in the message must be able to receive value max.parition.fetch.bytes than the broker (by max.message.size property configuration) big, otherwise consumers may not be able to read these messages, consumers have been led to re-hang test. When you set this property, Another factor to consider is the time to deal with consumer data. Consumers need frequent calls to poll () method to avoid session expired and partitioning occurs rebalancing, too, if a single call to poll () returns the data, consumers need more time to process, may not be in time for the next poll to avoid the session expires. If this occurs, the value can max.parition.fetch.bytes piecemeal or extended session expiration time.

4. session.timeout.ms

This attribute specifies the time before the consumer is considered to be the death disconnected from the server, the default is 3s. If the consumer does not send a heartbeat to the group coordinator session.timeout.ms within the specified time, it was thought to have died, the coordinator will trigger rebalancing, assign it to the group in the partition of other consumers. This property is closely related to heartbeat.interval.ms. heartbeat.interval.ms specify the poll () method sends the heartbeat to the coordinator, session.timeout.ms specify how long consumers can not send a heartbeat. Therefore, the general need to modify these two properties, heartbeat.interval.ms must be smaller than session.timeout.ms, generally one-third of session.timeout.ms. If session.timeout.ms is 3s, then heartbeat.interval.ms should ls. The session.timeout.ms value set than the default value is small, it can quickly detect and recover from node crashes, but the long polling or garbage collection may lead to unintended rebalancing. The value of this attribute is set large, you can reduce accidents rebalancing, but detection nodes collapse take longer.

5. auto.offset.reset

This attribute specifies a read at the consumer without offset or shift the invalid amount partition case (due to failure of the consumer a long time, the recording including the offset well is obsolete deleted) the process for what. Its default value is the latest, meaning that, under the invalid offset, consumers will start reading the latest recording data (generated after consumers start recording). Earliest is another value, meaning that, in the case where the offset is invalid, the consumer reads the recording start position from the partition.

6. enable.auto.commit

We will be submitted to offset later introduced several different ways. This property specifies whether to automatically submit consumer offset default value is true. In order to avoid duplication of data and data loss, it can be set to false, control when submitting offset by themselves. If it is set to true, the frequency can also be controlled by configuring filed auto.commit.interval.mls properties.

7. partition.assignment.strategy

We know that the partition will be assigned to the group in the consumer. PartitionAssignor according to the given theme and consumers decide which partitions should be assigned to which the consumer. Kafka has two default assignment policy.

- Range

Several consecutive partition the policy theme will be assigned to the consumer. Assuming that fees were quiet C1 and C2 while consumers subscribed to the topic and theme T1 T2, wells and each theme has three partitions. Then the consumer C1 likely to be assigned to these two themes partition 0 and partition 1, and consumers C2 assigned to these two themes partition 2. Because each topic has an odd number of partitions, and the assignment is done independently within the subject matter, the first consumer finally assigned to more consumers than the second partition. Just use the Range strategy, but also the number of partitions can not be divisible by the number of consumers, it will happen.

- RoundRobin

The strategy put all partitions assigned individually themed to consumers. If a policy to consumers RoundRobin C1 and C2 dispensing partition consumer, then the consumer will be assigned theme T1 C1 partition 0 and partition 1 and partition 2 relating to T2, the topics assigned to the consumer C2 T1 and partition l theme T2 partition 0 and partition 2. In general, if all consumers subscribe to the same topic (which is common), RoundRobin distribution strategy will give all consumers the same number of partitions (or backward up to a partition).

You can select the partitioning strategy by setting partition.assignment.strategy. The default is org. Apache.kafka.clients.consumer.RangeAssignor, this class implements the Range strategy, but also can change it to org.apache.kafka.clients.consumer.RoundRobinAssignor. We can also use a custom policy, in this case, the value is the name of the custom class partition.assignment.strategy property.

8. client.id

This property can be any string, Broker use it to identify the message sent from the client over, typically used in the log, and quotas in metrics.

9. max.poll.records

This attribute is used to control a single call to call () method to return the number of records that can help you control the amount of data required in the polling process.

10. receive.buffer.bytes 和 send.buffer.bytes

socket used in the read and write data TCP buffer size may be provided. If they are set to -1, the default value of the operating system. If the producer or consumer and the broker is in a different data center, these values ​​can be appropriately increased, because the network across data centers generally have a relatively high latency and low bandwidth.

Author's Note: I welcome the focus on public numbers, the Internet regularly share IT, finance and so their experience, insights on life, welcome to exchange, at present the inaugural Ali - Mobile Division, a large plant can also be pushed to a public number to drop your resume, or view I get personal information. (Publication No. ID: weknow619).

Guess you like

Origin juejin.im/post/5cf5c023f265da1b725bedb8
Recommended