How to manage the offset of Spark Streaming consuming Kafka (1)




I have been a bit busy with work recently, so the frequency of updating articles is a little lower. I apologize to everyone here. I have written about managing offsets in spark streaming, but I only knew how to use it at the time, but I didn’t really understand why it was used that way. The most recent paragraph I took the time to look at the source code of a github open source program that manages offsets by itself. I have basically understood it thoroughly. Of course, there is also a case of an upgrade failure due to a lack of understanding, which will be shared in the next article. In this article, we first talk about how to manage the offset state when Spark Streaming integrates Kafka from a theoretical point of view.

spark streaming version 2.1

kafka version 0.9.0.0



Before that, let's reiterate the strategy for managing offsets in spark streaming. The default spark streaming manages offsets by checking the state of each batch. Persistent to HDFS, if the machine fails or the program fails and stops, the next time it starts, the state of rdd at the time of the failure can still be read from the checkpoint directory, and the data processed last time can continue to be processed, but the checkpoint method The biggest drawback is that if the code is upgraded, the new version of the jar cannot reuse the serialization state of the old version, resulting in a smooth transition between the two versions. The result is that data is either lost or data is duplicated, so almost no one does this on the official website. Dare to run very important streaming projects in production.




Therefore, the more general solution is to write code to manage the offset when spark streaming integrates Kafka, and write code to manage the offset. In fact, it is to store each batch of offset in an external storage system including (Hbase, HDFS, Zookeeper, Kafka, DB, etc.), any storage system that
is not used needs to take into account the status of the offset at three times, otherwise the status of the offset is incomplete, which may lead to some bugs.



scene one:

When a new spark streaming+kafka streaming project is started for the first time, it is found that the external storage system does not record any offsets related to all partitions of this topic, so the InputStream stream is created directly from KafkaUtils.createDirectStream , the default is to consume from the latest offset. If it is the first time that the latest and oldest offsets are equal to 0, then in each subsequent batch, the latest offset will be stored externally. The storage system is constantly updated.


Scenario 2:

When the streaming project is stopped and restarted, it will first read from the external storage system whether there is an offset recorded, if so, read the offset, and then pass the offset set to the Build InputSteam in KafkaUtils.createDirectStream, so that you can continue processing the offset after the last stop, and then continue to update the offset of the external storage system in each batch, so that it can be seamlessly connected. , whether it is a failure stop or an application upgrade, it is transparently processed.



Scenario 3:

For a running spark streaming+kafka streaming project, we increase the number of kafka partitions during program execution. Please note: the newly added partitions at this time cannot be sensed by the running streaming project Yes, if you want the program to recognize the newly added partition, the spark streaming application must be restarted. At the same time, if you are still using the offset that you write your own code to manage, you must pay attention to the offset of the partition that has been stored. Also insert the new partition, otherwise the program you run will still read the original partition offset, which will lose some data.



Summary:

If you manage the offset of kafka yourself, you must pay attention to the three scenarios above. If you do not consider it completely, there may be strange problems.

If you have any questions, you can scan the code and follow the WeChat public account: I am the siege division (woshigcs), leave a message in the background for consultation. Technical debts cannot be owed, and health debts cannot be owed. On the road of seeking the Tao, walk with you.



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326523700&siteId=291194637