Spark Structured Streaming checkpoint and Flink different implementations (on)

Begins

This article will be divided into two parts, Part theory, Part II source code. Will explain checkpoint in spark and flink inside of implementation, and probably why this realization, here only discuss the real-time system, the other is not included.

Why should checkpoint

For a real-time processing systems, checkpoint is essential, according to the recovery when used as fault tolerance.

theory

For example such a system, the state of how to use this system to express checkpoint:

That is, when we input data stream is when the state of our entire system is at the side of a collection of all the states and all the operators expressed

For example, when you want to express said input data stream 1,2,3,4,5, 1,2,3 this case also the elements likely network transfer, the operator has the elements 4 and 5, then the entire state of the system S including all the data (2, 3 in the network and has arrived in 4,5 operator)

trade off

Let me talk about the conclusion:

Side of the state to be empty! That edge can not have status, system status can only be expressed by the Operator of the state!

the reason:

Checkpoint playback action when an error occurs is the whole process, and to ensure consistency in the system, if the state of the entire system is set comprising StateOfEdge (edge ​​state), then, for the playback system, it is basically impossible, as edge elements just an abstract view, the specific example is, it is possible to input elements on the wire at high and low transmission, it is possible to transmit the optical signal, the card might also buffer, the buffer can be in the kernel Area. Difficult to play them back in a cable transmission is 2,3, 4,5 has been in the state of the system operator.

aims:

No edge state, all states are in the operator, the white point is put, there is no data being transmitted, the user data in the operator's mode.

How to do it

Or white point, how to ensure that the data have been processed over, there is no piece of data during transmission, or left on the side where.

Enter the boundary determination system

If the input word is determined, calculation is determined, the output terminal can be processed to determine all elements (i.e. can be determined, there is no state in all edges, for example 1,2,3,4,5 input is calculated is seeking average, know exactly the output terminal, the data set size is 1,2,3,4,5, when the operator receives the output of 1,2,3,4,5, the edges can be no clear state, At this time, the output of the checkpoint operator average = 3, to the input state expressed 1,2,3,4,5.

This time point under the title , the Spark is determined, for each input micro-batch batch, is determined based on calculation, so each batch checkpoint at the end of the state will be written to an external memory inside, this is checkpoint It can be expressed in the state of the entire flow system.

Enter the uncertain boundary system

The question then becomes, when we do checkpoint, how to determine, no one edge of the state (no element)? We can enter a special mark, when the downstream receive this mark, you can clear the elements before this particular sign is not on the edge. Such as x -> 1 -> 2 is our input, when the downstream receive x, we can determine, before 1 -> 2 does not have the edge. This is the practice of Flink's checkpoint. Here is a special mark in the Barrier Flink, we can see here more thought for a while, should be able to understand why Flink in the partition until all Barrier in place, do a checkpoint (because you want to wait until all sides states are empty, note that the formula in the Sigma symbol).

end

I personally writing is relatively poor, can not write as many articles fire burst in the kind of storytelling types of articles. Write articles just to clarify my own ideas, but also to practice what open source spirit. After all, sharing knowledge is also a spirit of open source, what if the article is badly written, or where areas for improvement, you can leave a message. I have nothing but the article estimated that people see. . . If anyone saw here, I hope to help you a bit.

the public

Micro-channel public number: Big Data onslaught

Big concern technical data. Questions or suggestions, please leave a message the public number. No public can do more in the article, I do not want to stop you, ha ha.

Guess you like

Origin juejin.im/post/5e5779c7518825494822c88d