Missed consumption and repeated consumption
Repeated consumption: The data has been consumed, but the offset has not been submitted.
Leaked consumption: first submit the offset and then consume, which may cause data to be missed and consumed
consumer affairs
If quasi-one-time consumption is required to complete the consumer side, then the Kafka consumer side needs to atomically bind the consumption process and the offset submission process. At this point, we need to save the offset of kafka to a custom medium that supports transactions (such as MySQL), this part will only be involved in the follow-up project
Data backlog (how consumers can improve throughput)
1) If Kafka's consumption capacity is insufficient, you can consider increasing the number of topic partitions, and at the same time increase the number of consumers in the consumption group, where the number of consumers = the number of partitions.
2) If the downstream processing is not timely, increase the number of pulls in each batch. If the data pulled in a batch is too small (pull data/processing time < production speed), the processed data will be smaller than the production data, which will also cause data squeeze.
parameter name | describe |
---|---|
fetch.max.bytes | Default Default: 52428800 (50 m) The maximum number of bytes that a consumer can obtain from a batch of messages from the server. If the batch of data on the server side is larger than this value (50m), the batch of data can still be pulled back, so this is not an absolute maximum. The batch size is affected by message.max.bytes(broker config) or max.message.bytes(topic config) |
max.poll.records | The maximum number of messages returned by a poll pull data, the default is 500 |