Flink's checkpoint mechanism

Outline




checkpoint mechanism is the cornerstone of Flink reliability can guarantee Flink cluster for some reason (such as abnormal exit) fails, the state can be restored to the entire application flow diagram of a state before an operator failure to ensure applications FIG consistent flow state.

implementation process checkpoint mechanism



Flink checkpoint is initiated by Jobmaster, proceeding checkpoint mechanisms, when the program starts, it will create a Jobmaster CheckpointCoordinator, the operator periodically transmitted to the downstream barrier in order for each state and calculated by the data backup, when state data to calculate a final backup operator success, then this checkpoint is completed. When a failure occurs, the program simply reads the backup data of a most recent successful checkpoint operator calculation state is restored.

Related to the components and concepts



JobMaster: JobMaster Flink is the master node, is responsible for the task to receive, distribute and coordinate the implementation of, and responsible for the implementation of the recovery checkpoint data.
barrier: barrier is a lightweight data according to certain rules (scheduling) is inserted into the original data stream, the original data will not affect the performance of data processing without changing the order of the original data. CheckpointCoordinator: checkpoint coordinator, a thread when the program starts JobMaste in turn, each application requires checkpoint at startup, Flink's JobManager create a CheckpointCoordinator, CheckpointCoordinator solely responsible for making this application snapshots.
Operator source: Loading data source operator.
Intermediate Operator: All intermediate data conversion processing operator.
sink operator: finally landing data operator.
Snapshot: count one point in time backup sub-state data is calculated.

specific analysis checkpoint process
a single input source checkpoint implementation process



To all source stream application 1. CheckpointCoordinator operator periodically transmitted barrier.
164027i31yoxwh4wxyywv4.png.thumb.jpg
2. When a source operator to receive a barrier, then suspend data processing, and then made into their current state snapshot, and saved to the specified persistent storage, the final report of the situation to make their own snapshots CheckpointCoordinator, at the same time to all of the broadcast operator's own downstream barrier, data recovery process.
164138rmuh7auhj97r9sc9.png.thumb.jpg
3. Upon receipt of the downstream operator barrier, will suspend their own data processing and related state itself made into a snapshot, and saved to the specified persistent storage, the final report their situation to the snapshot CheckpointCoordinator, while all to itself downstream broadcasting operator barrier, data recovery process.
164202v3xpps7bbg6qobi3.png.thumb.jpg
4. Each operator in Step 3 to the downstream snapshot continuously broadcast, is transmitted to the sink until the last barrier operator, the snapshot finished.
164257ndyohy7vcw7ey7bv.png.thumb.jpg
5. When CheckpointCoordinator after receipt of all the reports of the operator that a snapshot of the production cycle of success; otherwise, if the operator has not received all the reports within the prescribed period of time, this is considered a snapshot of the production cycle failed.

Two input sources checkpoint implementation process



If an operator has two input sources, the temporary blockage of the barrier to receive the input source, until the barrier second input source with the same number of arrival, and then make their own snapshots and downstream broadcasts the barrier. Specific steps are as follows.

C 1. Assuming the operator has two input sources A and B, the i-th snapshot period, Barrier for some reason (e.g., processing delays, network latency, etc.) A given input source before the arrival, time count sub C input channels temporarily blocking the input source a, source B only receive the input data.
164444wlbr2jvrbbmrbwm0.png.thumb.jpg 
2. When the input source emitted barrier B arrives, operator C's own production report its own snapshot and the snapshot production CheckpointCoordinator case, then the two merged into a barrier, broadcast to all downstream operators.
164505ih717t51teh7r24t.png.thumb.jpg 
3. When a fault occurs for some reason, CheckpointCoordinator FIG notifications on all operators uniform flow returns to a cycle checkpoint state, and then recover the data stream processing. Distributed checkpoint mechanism ensures that data is processed only once (Exactly Once).
164525sjz3fvplf2nzvffj.png.thumb.jpg 

summary



checkpoint mechanism is an important characteristic Flink, he is fault-tolerant Flink's lightweight implementation, operation error when Flink program, only need checkpoint
restore the saved data to the operational state data, does not require re-operation to recover the data. Master checkpoint mechanism is an important learning Flink's
part.

The total number of articles from the public, Guangzhou dark horse programmer Center (itheimagz) more resources, please visit

image.png




Guess you like

Origin blog.51cto.com/14500648/2430118