Kafka log cleaner crashing

Yannick :

Please note that I already read a lot the following article and tried to find relevant info on different forums, but without success: https://medium.com/@anishekagarwal/kafka-log-cleaner-issues-80a05e253b8a Hope you will understand my issue and give some clues =)

Here is the story:

Few days ago, We deployed a deduplication service on a kafka cluster. Since we used that service, we noticed that the __consumer_offsets topics started growing. The reason was the the log cleaner (used to compact this topic amongst others) did crash with the following error : java.lang.IllegalStateException: This log contains a message larger than maximum allowable size of 1000012

From what we understood, we first think it was a message size issue, so we increased the max.messsage.bytes (until more than 20 MB) value but then we got the same issue (with the error message correctly updated with the new value).

So we started to thought that it was maybe some kind of "corrupted" message size value or "misunderstood segment" ( like the kafka log cleaner version did not handle correctly the message)

We were able to isolate the segment's offset that cause us a problem. Was so strange because when we consumed it with a simple consumer, the record was about 4K bytes, but it took 7 or 8 min for the consumer to consume only this record ( during this polling, a tcpdump clearly showed many >1000 bytes packets coming from the broker).

So we started using the dumpSegment class to have a look at the segment, and it looked like this (I replaced some values to anonymise a bit) :

Dumping 00000000004293321003.log

Starting offset: 4293321003

baseOffset: 4310760245 lastOffset: 4310760245 count: 1 baseSequence: -1 lastSequence: -1 producerId: 66007 producerEpoch: 2 partitionLeaderEpoch: 50 isTransactional: true isControl: true position: 0 CreateTime: 1556544968606 size: 78 magic: 2 compresscodec: NONE crc: 2072858171 isvalid: true

| offset: 4310760245 CreateTime: 1556544968606 keysize: 4 valuesize: 6 sequence: -1 headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 0

baseOffset: 4310760295 lastOffset: 4310760295 count: 1 baseSequence: -1 lastSequence: -1 producerId: 65010 producerEpoch: 2 partitionLeaderEpoch: 50 isTransactional: true isControl: true position: 78 CreateTime: 1556544968767 size: 78 magic: 2 compresscodec: NONE crc: 2830498104 isvalid: true

| offset: 4310760295 CreateTime: 1556544968767 keysize: 4 valuesize: 6 sequence: -1 headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 0

baseOffset: 4310760731 lastOffset: 4310760731 count: 1 baseSequence: -1 lastSequence: -1 producerId: 64005 producerEpoch: 2 partitionLeaderEpoch: 50 isTransactional: true isControl: true position: 156 CreateTime: 1556544969525 size: 78 magic: 2 compresscodec: NONE crc: 3044687360 isvalid: true

| offset: 4310760731 CreateTime: 1556544969525 keysize: 4 valuesize: 6 sequence: -1 headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 0

baseOffset: 4310760732 lastOffset: 4310760732 count: 1 baseSequence: -1 lastSequence: -1 producerId: 66009 producerEpoch: 2 partitionLeaderEpoch: 50 isTransactional: true isControl: true position: 234 CreateTime: 1556544969539 size: 78 magic: 2 compresscodec: NONE crc: 1011583163 isvalid: true

| offset: 4310760732 CreateTime: 1556544969539 keysize: 4 valuesize: 6 sequence: -1 headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 0

So we saw lot of bunches like above

And then the faulty offset that makes log cleaner crash :

baseOffset: 4740626096 lastOffset: 4740626096 count: 1 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 50 isTransactional: false isControl: false position: 50471272 CreateTime: 1557322162912 size: 4447 magic: 2 compresscodec: NONE crc: 3030806995 isvalid: true

| offset: 4740626096 CreateTime: 1557322162912 keysize: 25 valuesize: 4352 sequence: -1 headerKeys: [] key: {"metadata":"MYGROUPID"} payload: {"protocolType":"consumer","protocol":"range","generationId":32,"assignment":"{CLIENTID=[TOPICA-0, TOPICA-1, TOPICA-2, TOPICA-3, TOPICA-4, TOPICA-5, TOPICA-6, TOPICA-7, TOPICA-8, TOPICA-9, TOPICA-10, TOPICA-11], AND THIS FOR ABOUT 10 OTHER TOPICS}"}  ==> approximative 4K bytes

This does not look like the standard __consumer_offsets data schema ( the [groupId,topicName,partitionNumber]::offset schema), and i think it's because the new service used kafka transactions.

We think that this crash may be due to the fact that our kafka cluster is 0.9.11 ( or maybe 1.0.1) and our deduplication service is using kafka 2.0.0 API (and use transactions).

So here some interrogations i have :

  • How __consumer_offsets topic handle committed offsets when dealing with kafka transactions ? I don't get the structure at all.. Seems like there are multiple COMMIT marker messages ( but with no clue what topic or partition it is.. So how does this work :/ ?) always following by this non transactionnnal record that include metadata tag.. Any documentation on this structure ?

  • Would it be possible that the log cleaner version of 1.1.0 kafka cluster does not handle correctly those kind of transactionnal messages in __consumer_offsets (feeded with 2.0.0 API) ?

Any clue / correction would be welcome here.

Regads,

Yannick

Yannick :

After some researches, I found why we had such a behavior and found the solution.

It's a well known kafka bug that impacts at least the 1.1.0 version of Kafka: https://issues.apache.org/jira/browse/KAFKA-6854

Solution : The easy way is to upgrade to version 2 ( or 1.1.1 handle it as you can see in the Jira).

It's because of segments full of transaction markers , and while delete retention time is reached to get rid of those markers, during compaction, LogCleaner crash ( trying to double its buffer multiple times)

If you want more details on the segments structure and how exactly LogCleaner crashed, more information and researches are in this article :

https://medium.com/@ylambrus/youve-got-a-friend-logcleaner-1fac1d7ec04f

Yannick

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=156903&siteId=1