[Kafka] "Kafka Definitive Guide" - submitted and offset

KafkaConsumer (consumer) each poll call () method, it always returns records Kafka written by the producer but has not yet been read by consumers, and therefore we can track which records are groups in which the consumer who read. Discussed earlier, Kafka does not need to be confirmed as consumers, like other JMS queue, which is a unique Kafka's. Instead, consumers can use to track messages Kafka position (offset) in the partition.

We update the partition of the current operating position is called submission.

So how consumers are submitted by offsets it? Special theme called _consumer_offset a consumer to send a message, the message contains the offset of each partition. If the consumer has been running, then the offset is of little use. However, if the fee were quiet crashes or new consumers to join the group, it will trigger rebalancing, after the completion of rebalancing, every consumer may be assigned to the new partition, and that not before processing. In order to be able to continue to work before, consumers need to read the offset of the last submission of each partition, and then continue processing from the offset specified place.

If the offset is less than the offset of the last submitted a message processing client, the message is the offset between the two will be iterative process, shown in Figure 4-6.

If the offset is greater than the offset filed last message processing client, then the message is an offset between the two will be lost, shown in Figure 4-7.

Therefore, the approach to offset the client will have a great impact. KafkaConsumer API provides a variety of ways to commit offset.

Auto Commit

The easiest way is to submit quiet fees are automatically submitted offset. If enable.auto.commit is set to true, then every 5s, consumers will automatically put () method receives maximum offset to submit up from the poll. Auto.commit.interval.ms submitted by the control time interval, the default value is 5s. Consumers in other things, like automatic submission is carried out in the poll (poll ()) inside. Consumers will be checked during each poll whether the offset is submitted, and if so, will submit an offset from a poll on the return.

However, before using this simple way, it will need to know what kind of results. Suppose we are still using the default 5s commit time interval, rebalancing took place in 3s after the last submission, after rebalancing, consumers began to read messages from the offset position of the last commit. This time offset is already behind 3s, so this message 3s within reach will be repeated. Filed by modifying the offset time interval to submit more frequently, reducing the time window is repeated message may occur, but this situation is not completely avoided.

When using automatic submission, each call to poll the party mood will offset returned submit up on the last call, it does not know exactly which messages have been processed, so before calling again best to ensure that all current call returns messages have been processed (will automatically be submitted before calling the close () method). Not a problem under normal circumstances, but be careful when handling exceptions or early exit polling.

Automatic submission is convenient, but does not leave room for developers to avoid duplication process the message.

Submitted to the current offset

Most developers submitted by controlling the time offset to eliminate the possibility of lost messages, wells to reduce the number of duplicate messages in the event of rebalancing. Consumers API provides another way to submit offsets, developers can submit current offset plate when necessary, rather than based on time interval.

Cancel the automatic submission, the auto.commit.offset set to false, the application to decide when to submit the offset. Use commitSync () to submit offset easiest and most reliable. This API will be submitted by the poll () method returns the offset of the latest, after submitting a successful return immediately, if it fails to submit an exception is thrown.

Remember, commitSync () will be submitted by the poll () returns the latest offset, so after processing all the records to be sure to call the commitSync (), or otherwise risk losing the message. If this happens again balanced, the most recent batch of messages to all messages between the rebalancing happen will be repeated.

Here is an example we use the most recent commitSync after processing a batch of messages () method to submit offset.

Asynchronous submitted

There is a synchronization submit deficiencies, before the broker to respond to the request submitted, the application has been blocked, this will limit the throughput of the application. We can submit to improve throughput by reducing the frequency, but if it does rebalancing will increase the number of duplicate messages.

This time can be submitted using an asynchronous API. We just sent a request to submit, without waiting for a response broker.

Before submitting a successful or an error is encountered no timid recovery, commitSync () will always retry (the application has been blocked), but commitAsync () No, this is commitAsync () a bad place. The reason it is not retried, because when it receives a response from the server, may have a larger offset has been submitted successfully. Suppose we send out a request for 2000 submitted by offsets, transient communication problem occurred this time, the server does not receive the request, naturally, will not make any response. At the same time, we are dealing with another batch of messages, and have successfully submitted an offset 3000. If commitAsync () attempts to re-submit Offset 2000, it is possible to submit success in 3000 after the offset. This time the event of rebalancing, repeating the message will appear.

The importance and complexity of the order of submission of the reason we mention this problem, because performs a callback when commitAsync () also supports callback, respond broker. Callback is often used to record or generate metrics to submit a bug, but if you use it to retry must pay attention to the order submitted.

Retry asynchronous submit

We can use a monotonically increasing sequence numbers to maintain order is submitted asynchronously. Incrementing serial number when submitting offset after each commit or offset in the callback. Before retrying performed, check the serial number of the callback and the offset to be submitted are equal, if they are equal, indicating that no new submission, it can safely be retried. If the serial number is relatively large, indicating that there is a new submission has been sent out, you should stop and try again.

Submit a combination of synchronous and asynchronous

In general, for the occasional failure to submit, no retries will not be a big problem, because if the commit fails because of a temporary problem caused it to submit follow-up there will always be successful. But if this is happening at the close of consumers or the last submission before the rebalancing, we must ensure that the submission is successful.

Therefore, before the consumer is closed generally used in combination commitAsync () and commitSync (). They work as follows (when the listener later talked about rebalancing, we will discuss how to submit an offset before rebalancing occurs):

Submit specific offset

Frequency processing frequency offset message batch submission is the same. But if you want to submit a more frequently how to do? If poll () method returns a large number of data, in order to avoid duplication of due process of rebalancing cause the entire batch of messages you want to submit the batch offset in the middle of how to do? This situation can not be achieved by calling commitSync () or commitAsync (), since they will only last offset to submit, at which point the batch in the message has not been processed.

Fortunately, the API allows consumers to pass in the partition and the offset would like to submit a map when you call commitSync () and commitAsync () method. Suppose you deal with half a batch of messages, the last theme "customers" message from the offset zone 3 is 5000, you can call commitSync () method to commit it. However, because consumers could not only read a partition, you need to track offset all partitions, so at this level control offset submission makes the code complicated.

The following are examples of specific offset submitted:

Rebalancing listeners

Mentioned in the submission of an offset, a consumer before exiting and re-partition equilibrium, will do some clean-up work.

You will lose submit a final records have been processed before the ownership of a partition to offset the consumer. If consumers are prepared to handle the occasional event a buffer, then the partition before losing ownership need to be addressed in the buffer accumulated records. You may also need to close the file handles, database connections.

When allocating new partition or remove the old partition to the consumer, you can perform some application code through consumer API, pass in when calling subscribe () method ConsumerRebalancelistener example of it. ConsumerRebalancelistener There are two methods need to be implemented.

(1) public void onPartitionsRevoked (Collection partitions) method will be called after reading the news and consumers to stop before rebalancing begins. If you submit offset here, next to take over the partition of consumers know where to start reading up.

Is called before (2) public void onPartitionsAssigned (Collection partitions) method and consumers will start reading the message after the reallocation of partitions.

The following examples will demonstrate how to submit offset by onPartitionsRevoked () method before losing partition ownership. In the next section, we will demonstrate another use of onPartitionsAssigned () method of example.

Processing begins at a specific offset from the recording

So far, we know how to use the poll () method from the latest offset of each partition begins processing the message. However, sometimes we need to start reading messages from a specific offset at.

If you want to start reading the message from the start position of the partition, or jump directly to the end of the partition starting a message, you may be used seekToBeginning (Collection tp) and seekToEnd (Collection tp) these two methods.

However, Kafka also provides us used to find a specific offset API. It has many uses, such as backward or to skip several messages rollback forward several messages (time-sensitive applications desirable in the case of processing a plurality of hysteresis can be skipped forward message). When using a system other than Kafka to store the offset, it will give us greater surprise.

Imagine this scenario: application reads events from Kafka (may be a user clicks the site of the event flow), to process them (possibly using an automated program to clean up clicks well to add session information), and then save the results to a database, NoSQL storage engine or Hadoop. Suppose we really do not want to lose any data, but also do not want the same results repeatedly saved in the database.

In this case, the consumer code might look like this:

In this example, each processing a record submitted offset. Nevertheless, after the recording is saved to the database and offset before it is submitted, the application is still possible to crash, leading to double handle data, the database will be duplicate records.

If the record-keeping and offset can be done in an atomic operation, the above situation can be avoided. Record and offset either have been successfully submitted or not submitted. If the record is stored in the database and the offset is submitted to Kafka, then it can not implement atomic operations.

However, if the same transaction in the record and offset are written to the database would happen then? Then we'll know record and offset either have successfully submitted, or are not, then reprocess the recording.

The question now is: If the offset is stored in the database rather than in Kafka, so consumers know how to get new partition from where to start reading this time you can use seek () method?. When consumers start or assigned to a new partition, you can use seek () method to find the offset stored in the database.

The following example shows how to use roughly the API. Use ConsumerRebalancelistener and seek () side battle to ensure that we are saved from the database offset location specified by the start processing the message.

Is achieved by storing the offset and records the same to a single external system semantics There are many ways, but they all require combination ConsumerRebalancelistener and seek () method to ensure the timely save offset wells to ensure that consumers Total message is able to start reading from the correct position.

How to Quit

Had said during the discussion before the poll, consumers will not need to worry about polling messages in an infinite loop, we will tell consumers how to gracefully exit the loop.

If you want to exit the cycle, you need to call consumer.wakeup () method by another thread. If the cycle runs in the main thread, you can call this method ShutdownHook years. Remember, consumer.wakeup () method is the only one who can secure consumers called from the other thread. Thrown call consumer.wakeup () exit poll (), and throw WakeupException abnormal, or if the thread does not wait for the polling call cconsumer.wakeup (), then an exception will be in the next round to call poll (). We do not need to deal with WakeupException, because it's just a way to jump out of the loop. However, before exiting thread calls consumer.close () is necessary, it will be submitted anything yet committed, and send a group coordinator (broker) message that they have to leave the group, the next step It will trigger rebalancing, without waiting for the session timeout.

Below are the consumers on the main thread of the thread exit code.

Guess you like

Origin juejin.im/post/5cf85fc2f265da1bb31c2852