How to manage the offset of Spark Streaming consuming Kafka (2)

The last article discussed how to manage the offset of consuming Kafka in spark streaming. In this article, I will talk about the last case of upgrade failure.

A month ago, because we wanted to improve the parallel processing performance of the spark streaming program, we needed to increase the number of kafka partitions. I need to say here that in the integration of the new version of spark streaming and kafka, follow the recommendations of the official website spark streaming The number of executors should be equal to the number of kafka partitions, so that each executor processes the data of one kafka partition with the highest efficiency. If the number of executors is greater than the number of partitions in kafka, in fact, the redundant executors are equivalent to not processing any data, and this part of the process is actually a waste of performance.

If the number of executors is less than the number of kafka partitions, there are actually some executors that need to process data from multiple partitions. Therefore, the official website recommends that the number of spark executors and the number of kafka partitions should be consistent.

Then the problem comes, if you want to improve the parallel processing performance of spark streaming, you can only increase the partition of kafka. It is easier to add a partition to kafka, and you can directly execute a command. However, it should be noted here that the partition of kafka can only be increased. Reduce, so add partitions to consider how many are appropriate.

Next, we increased the number of kafka partitions, and modified the number of executors of spark streaming to correspond to the number of kafka partitions one-to-one, and then started the streaming program. As a result, a strange problem occurred, which is as follows:

Created a few pieces of test data into kafka, and found that the program can only process part of the data, and some data is always lost each time. It stands to reason that there is no change in the code, just increase the number of kafka partitions and spark streaming executors, there should be no problem, so I re-tested the original old partition and program, and found that there is no problem. After comparison, it was found that the problem was only It will appear after kafka adds a new partition, and then this situation of data loss occurs. Then, together with the operation and maintenance classmates, we looked at the disk directory of the newly added kafka partition to see if there was any data falling into it. After querying, we found that the new partition had indeed entered data. This is very strange. How did the lost data get lost?

Finally, I checked the offset of kafka that we saved by ourselves, and found that the offset inside did not have the offset of the newly added kafka partition. At this point, I finally found the problem, that is to say, if there is no offset of the newly added partition The data of the newly added partition will not be processed when the program is running, and the newly added partition does indeed have data falling into it. This is the reason for the strange loss of data mentioned above. The data program of kafka's partition has not been processed, and the reason is that the offset of the new partition is not recorded in our own saved offset.

The problem is found, so how to repair the lost data online?

At that time, I thought of a relatively stupid method, because our kafka online default is to retain 7 days of data, the data of the old partition has been processed, that is, the new partition data has not been processed, so we deleted the processed old partition data. Partitioned data, and then restarted the streaming program during the peak period of business traffic, allowing it to start consuming and processing from the earliest data. Since the old partition is deleted, only the new partition has data, so it is equivalent to the lost data. Some data have been fixed. After the repair is complete, stop the program again, and then configure to start processing from the latest offset, so that the newly added partition can be identified in the offset, and then continue to process normally.

Note that deleting the data of the old partition of kafka here is a relatively dangerous operation. It requires all kafka nodes to be restarted to take effect, so unless there are special circumstances, do not use such a dangerous method.

Later, I carefully analyzed the source code of an open source program we used to manage offsets, and found that this program has a little bug, which did not take into account the new partition of kafka, that is to say, if your kafka partition is increased, your program will restart after restarting The newly added partition cannot be recognized, so if the newly added partition still has data entering, then your program will definitely lose data, because the operation of expanding the kafka partition is not common, so this bug is more difficult to trigger.

After knowing the reason, it is easier to solve, that is, before starting the streaming program each time, compare the current number of kafka partitions saved by ourselves with the number of topic partitions saved from zookeeper. If they are inconsistent, then Add the new partition to the information we save, and initialize the concurrent offset to 0, so that after the program starts, the data of the new partition will be automatically recognized.

Therefore, looking back at the above problem, the simplest and most elegant solution is to manually modify the partition offset information of our own saved kafka, add the new partition, and then restart the streaming program. .

This case is also the case of the third scenario I mentioned in the last article. If you manually manage the offset of kafka, you must pay attention to the situation after the new partition is compatible, otherwise the program may have the problem of data loss.

If you have any questions, you can scan the code and follow the WeChat public account: I am the siege division (woshigcs), leave a message in the background for consultation. Technical debts cannot be owed, and health debts cannot be owed. On the road of seeking the Tao, walk with you.

How to manage the offset of Spark Streaming consuming Kafka (2)

Guess you like