[Big Data] Detailed Explanation of Flink (2): Core Part III

Detailed Explanation of Flink (2): Core Chapter Ⅲ

29. What does Flink implement a reliable fault-tolerant mechanism?

Flink uses lightweight distributed snapshots and design checkpoints ( checkpoint ) to achieve reliable fault tolerance.

insert image description here

30. What is Checkpoint?

Checkpoint is called a checkpoint . It is the core function of Flink's fault-tolerant mechanism and the cornerstone of Flink's reliability. It can periodicallygenerate Snapshot snapshots based on the state of each Operator in the Stream according to the configuration , so as to persist these state data periodically. After storage, when the Flink program crashes unexpectedly, these Snapshots can be selectively recovered when the program is re-run, so as to correct the interruption of the program data state caused by the failure.

The principle of Flink's Checkpoint mechanism comes from Chandy-Lamport algorithmthe algorithm (a distributed snapshot algorithm).

Note the distinction between State and Checkpoint:

1.State

  • It generally refers to the state of a specific Task/Operator (the state of an Operator represents some intermediate results that some operators will produce during the running process).
  • State data is stored in Java heap memory/TaskManage node memory by default.
  • State can be recorded and data can be recovered in case of failure.

2.Checkpoint

  • It represents a global state snapshot of a FlinkJob at a specific moment, including the state of all Tasks/Operators.

  • It can be understood that Checkpoint stores State data regularly and persistently.

  • For example, the Offset state maintained in the KafkaConsumer operator can be obtained from the Checkpoint when the task resumes.

31. What is a Savepoint save point?

Savepoint, called Savepoint in Flink, is an application-complete snapshot backup mechanism based on the Flink checkpoint mechanism, which is used to save the state, and the job can be restored from the saved state in another cluster or at another point in time. Applicable to application upgrades , cluster migrations , Flink cluster version updates , A/B testing , and what-if scenarios , pausing and restarting, archiving, and other scenarios. A savepoint can be regarded as a (operator ID → State) Map. For each stateful operator, the Key is the operator ID, and the Value is the operator State.

32. What is CheckpointCoordinator Checkpoint Coordinator?

The checkpoint coordinator in Flink is called CheckpointCoordinator , which is responsible for coordinating the distributed snapshot of the State of the Flink operator. When a snapshot is triggered, the CheckpointCoordinator injects a Barrier message into the Source operator , and then waits for all Tasks to notify the completion of the checkpoint confirmation, while holding the State handle reported by all Tasks in the confirmation completion message.

33. What information is saved in Checkpoint?

What information is stored in the checkpoint? Let's take Flink's consumption of Kafka data Wordcount as an example:

1. We read log by log from Kafka, parse them from the log app_id, and then put the statistical results into a Map collection in the memory, app_idas the key, and the corresponding pvvalue as the value, and only need to store the corresponding app_idvalue pveach time + 1 +1After + 1 , put it into the Map;

2、kafka topic:test;

3. The operation flow of Flink is as follows:

insert image description here
Kafka topic has one and only one partition.

Assuming that Kafka's topic-test has only one partition, Flink's Source task records all partitions that are currently consumed to the Kafka test topic offset.

例:(0,1000)表示 0 号 partition 目前消费到 offset 为 1000 的数据。

Flink's pv task records pvthe value of each app currently calculated. For the convenience of explanation, I have two apps here: app1 and app2.

例:(app1,50000)(app2,10000)
表示 app1 当前 pv 值为 50000
表示 app2 当前 pv 值为 10000
每来一条数据,只需要确定相应 app_id,将相应的 value 值 +1 后 put 到 map 中即可。

In this case, what CheckPoint actually saves is the nth CheckPoint consumption offsetinformation and pvthe value information of each app , records the current status information of CheckPoint, and saves the status information to the corresponding status backend. The code below: (Note: The state backend is the place to save the state, how to save the state, how to ensure the high availability of the state , we only need to know that we can get offsetinformation and pvinformation from the state backend. The state backend must It is highly available, otherwise our state backend often fails, which will make it impossible to restore our application through checkpoint).

chk-100
offset:(0,1000)
pv:(app1,50000)(app2,10000)
该状态信息表示第 100 次 CheckPoint 的时候, partition 0 offset 消费到了 1000,pv 统计。

34. When a job fails, how does the checkpoint restore the job?

Flink provides automatic application recovery mechanism and manual job recovery mechanism .

1. Application automatic recovery mechanism

Flink sets a job failure restart strategy, including three types:

  • Periodic recovery strategy ( fixed-delay): The fixed-delay restart strategy will try a given number of times to restart the Job. If the maximum number of restarts is exceeded, the Job will eventually fail. Between two consecutive restart attempts, the restart strategy will wait for a fixed time. Default Integer.MAX_VALUEtimes.

  • Failure ratio strategy ( failure-rate): The failure ratio restart strategy restarts after the job fails, but after the failure ratio is exceeded, the Job will eventually be deemed to have failed. Between two consecutive restart attempts, the restart strategy will wait for a fixed time.

  • Straightforward failure strategy ( None): Failure does not restart.

2. Manual job recovery mechanism

Because the Flink checkpoint directory corresponds to the JobId, the JobId will be regenerated every time the flink run method/page submission method is restored. Flink provides the function of specifying the checkpoint directory by setting parameters at startup, so that the new Jobld can read -sthe Checkpoint metafile information and status information, so as to achieve the purpose of starting jobs at specified time nodes.

The way to start it is as follows:

/bin/flink -s /flink/checkpoints/03112312a12398740a87393/chk-50/_metadata

35. When the job fails, how to restore the job from the save point?

It is not easy to restore a job from a savepoint, especially in the case of job changes (such as modifying logic, fixing bugs), the following points need to be considered:

  • The order of operators changes . If the corresponding UID has not changed, it can be restored; if the corresponding UID has changed, the restoration will fail.
  • A new operator has been added to the job . If it is a stateless operator, there is no impact, and it can be restored normally; if it is a stateful operator, it will be processed in the same way as a stateless operator.
  • A stateful operator was removed from the job . --allowNonReStoredSlale(short: -n)By default, the state of all operators recorded in the savepoint needs to be restored. If a stateful operator is deleted, the deleted OperatorID cannot be found when restoring from the savepoint, so an error will be reported. You can add a jumpto the commandunrecoverable operator.
  • Add and remove stateless operators . If the UID is manually set, it can be restored, and stateless operators are not recorded in the savepoint. If it is an automatically assigned UID, the stateful operator may change (Flink generates a UID with a monotonically increasing counter, and the DAG changes, the counter is likely to change) and the recovery is likely to fail.

36. How does Flink implement lightweight asynchronous distributed snapshots?

To achieve distributed snapshots, the most critical thing is to be able to split the data stream. Barriers ( ) are used in Flink Barrierto split data streams. Barrierr will be periodically injected into the data flow, and as part of the data flow, it will be processed by operators from upstream to downstream. Barrier will strictly guarantee the order and will not exceed the data in front of it. The barrier divides the records into record sets, and the data in the data flow between the two barriers belongs to the same checkpoint . Each barrier carries an ID number of the snapshot it belongs to . Barriers flow down with data without interrupting the data flow, so they are very lightweight. In a data flow, there may be multiple barriers belonging to different snapshots, and the distributed snapshots are executed concurrently and asynchronously, as shown in the figure below:
insert image description here
Barriers will be injected into the parallel data flow at the source of the data flow. Barrier nnThe position where n is located is the starting position of data reprocessing during recovery. For example, in Kafka, this position is the offset ( ) of the last record in the partitionoffset. When the job resumes, it will request data from Kafka from this offset according to this position. This offset is saved in State one of the contents.

The barrier is then passed downstream. When a non-datasource operator receives snapshot nn from all input streamsn Barrier, the operator will save a snapshot of its own State, andthe snapshotnnbroadcastn 's Barrier. Once the Sink operator receives the Barrier, there are two situations:

  • If it is a strict one-time processing guarantee in the engine , when the Sink operator has received all upstream Barrie nnn , the Sink operator takes a snapshot of its own State, and then notifies the checkpoint coordinator (CheckpointCoordinator). After all operators report success to the checkpoint coordinator, the checkpoint coordinator confirms to all operators that the snapshot is complete.
  • If it is an end-to-end strict one-time processing guarantee , when the Sink operator has received all upstream Barrie nnAt n , the Sink operator takes a snapshot of its own State, andpre-commits the transaction (the first phase of the two-phase commit), and then notifies the checkpoint coordinator (CheckpointCoordinator), and the checkpoint coordinator confirms to all operators that the snapshot is completed , the Sink operatorcommits the transaction (the second phase of the two-phase commit), and the transaction is completed.

we go on 33 3333 cases to specifically explain how to perform distributed snapshots:

Corresponding to pvthe case, after the Source Task receives the chk-100CheckPoint trigger request with the number of the JobManager (restored from the last time), it finds that it has just received offset(0,1000)the data from Kafka, so it will insert a barrier offset(0,1000)after the data and before the data, and then offset(0,1001)Start taking snapshots yourself, that is, offset(0,1000)saving to the state backend chk-100. Then the barrier is sent downstream. When the statistical pvtask receives the barrier, it will also pause processing data and save the information stored in its own memory to pvthe state backend . OK, Flink probably uses this principle to save snapshots.(app1,50000)(app2,10000)chk-100

When the statistics pvtask receives the barrier, it means that the data before the barrier has been processed, so there will be no data loss.

37. What is Barrier alignment?

insert image description here
The above figure represents from left to right: start alignment, alignment, checkpoint execution, and continue processing data.

Once the operator receives the checkpoint barrier nn from the input streamn , it cannot process any data records from that stream until it receives barriernnuntil n . Otherwise, it mixes belonging to the snapshotnnn records and belong to snapshotn+1 n+1n+1 record;

As shown above :

  • Figure 1 11 : The operator has received the barrier of the digital stream, but the barrier corresponding to the letter stream has not yet arrived.
  • Figure 2 22 : The operator will continue to receive data from the digital stream barrier after receiving the digital stream, but these streams can only be put on hold, and the records cannot be processed, but placed in the cache, waiting for the letter stream barrier to arrive. 1 , 2 , 3 1, 2, 3before the stream of letters arrives1 , 2 , 3 data has been cached.
  • Figure 3 33 : When the stream of letters arrives, the operator starts to align the state to take an asynchronous snapshot, and broadcasts the barrier to the downstream without waiting for the snapshot to be completed.
  • Figure 4 44 : The operator takes an asynchronous snapshot, first processes the backlog of data in the cache, and then obtains the data from the input channel.

38. What is Barrier misalignment?

Checkpoint is not completed until all Barriers arrive.

Figure 2 above 2In 2 , when there are barriers of other input streams that have not yet arrived, the data 1, 2, 3 after the arrived Barrier will be sent1, 2, 31 , 2 , and 3 are left in the buffer, and can only be processed after the arrival of barriers of other streams.

Barrier misalignment : It means that when there are barriers of other streams that have not arrived, in order not to affect performance, do not bother, and directly process the data after the barrier. After the Barriers of all streams arrive, the Operator can be checked.

39. Why do we need to perform Barrier alignment? Is it okay to not align?

Exactly OnceThe Barrier must be aligned, if the Barrier is not aligned, it will become At Least Once.

The purpose of Checkpoint is to save the snapshot. If it is not aligned, some data after the corresponding chk-100has been processed before the snapshot . When the program resumes the task from , the data after the corresponding will be processed again, so there is duplication Consumption.chk-100offsetchk-100chk-100offset

41. What conditions are needed to realize Exactly-Once?

To realize the streaming system Exactly-Once, it is necessary to ensure that the three parts of the upstream Source layer, the intermediate computing layer and the downstream Sink layer meet the end-to-end strict one-time processing at the same time, as shown in the following figure:

insert image description here
Source : When data enters Flink from the upstream, it is necessary to ensure that the message is strictly consumed once. At the same time, the Source must meet the requirements of replayability (replay). Otherwise, the Flink calculation layer does not calculate after receiving the message, but fails and restarts, and the message will be lost.

Flink computing layer : Use the Checkpoint mechanism to store state data periodically and persistently. Once the Flink program fails, you can choose to restore the state point to avoid data loss and duplication.

Sink side : When Flink sends the processed data to the sink side, it submits the protocol through two phases , that is,TwoPhaseCommitSinkFunctionthe function. The SinkFunction extracts and encapsulates the public logic in the two-phase commit protocol, ensuring that Flink implements strict one-time processing semantics when sending to the Sink. At the same time, the Sink side must support the transaction mechanism, be able to perform data rollback or satisfy idempotency.

  • Rollback mechanism : that is, when a job fails, the partially written results can be rolled back to the previously written state.
  • Idempotency : It is the same operation, no matter how many times it is repeated, the result is equal to the operation only once. That is, when the job fails, some results are written, but when all the results are rewritten, no negative results will be brought, and repeated writing will not bring wrong results.

42. What is a two-phase commit protocol?

The two-phase commit protocol (Two-Phase Commit,2PC) is the most commonly used method to solve distributed transaction problems. It can guarantee that in distributed transactions, either all participating processes submit transactions, or all cancel them, that is, to achieve ACID ACIDAAin A C I DA (atomic).

There are two important roles in the two-phase commit protocol, the coordinator ( Coordinator) and the participant ( Participant), among which there is only one coordinator, which plays the role of coordination and management of distributed transactions, and there are multiple participants.

The two-phase commit phase is divided into two phases: the voting phase ( Voting) and the commit phase ( Commit).

(1) Voting stage

  • The coordinator sends preparea request and transaction content to all participants, asking whether the transaction can be prepared for submission, and waits for the responses of the participants.
  • Participants perform the operations contained in the transaction, and log undo(for rollback) and redolog (for replay), but do not actually commit.
  • The participant returns the execution result of the transaction operation to the coordinator, the execution returns successfully yes, and the failure returns no.

(2) Submission stage

  • Divided into two cases of success and failure.
  • If all participants return yes, the transaction can be committed:
    • The coordinator sends commitrequests to all participants.
    • After receiving committhe request, the participant actually submits the transaction, releases the occupied transaction resources, and returns to the coordinator ack.
    • The coordinator receives ackthe messages from all participants, and the transaction is successfully completed, as shown in the figure below:

insert image description here
insert image description here

  • If any participant returns noor fails to return after a timeout, the transaction is interrupted and needs to be rolled back:
    • The coordinator sends rollbackrequests to all participants.
    • rollbackAfter receiving the request, the participant undorolls back to the state before the transaction execution according to the log, releases the occupied transaction resources, and returns to the coordinator ack.
    • The coordinator receives ackthe messages from all participants, and the transaction rollback is completed.

insert image description here
insert image description here

43. How does Flink guarantee Exactly-Once semantics?

Flink guarantees semantics through a two-phase commit protocolExactly-Once .

For the source side : the strict one-time processing of the source side is relatively simple, because the data needs to enter Flink, so Flink only needs to save the offset (offset) of the consumed data. If the source end is Kafka, Flink uses Kafka Consumer as the source, and the offset can be saved. If the subsequent task fails, the connector can reset the offset and consume the data again to ensure consistency.

For the sink side : the sink side is the most complicated, because the data is landed on other systems, once the data leaves Flink, Flink cannot monitor the data, so the strict one-time processing semantics must also be applied to Flink to write data Therefore, these external systems must provide a means to allow these write operations to be committed or rolled back, and at the same time ensure that they can be used in coordination with Flink Checkpoint (Kafka 0.11 0.11Version 0.11 has implemented exactly-once processing semantics).

We take as Kafka - Flink - Kafkaan example to illustrate how to guarantee Exactly-Oncesemantics.

insert image description here
As shown in the figure above: Flink jobs include the following operators.

  • A Source operator to read data from Kafka (ie KafkaConsumer)
  • A window operator, an aggregation operation based on time windowing (i.e. window+windowa function)
  • A Sink operator that writes the result to Kafka (ie KafkaProducer)

Flink uses a two-phase commit protocol , the pre-commit ( Pre-commit) phase and the commit ( Commit) phase to ensure end-to-end strictly once.

1. Pre-submission stage

(1) When Checkpoint starts, it enters the pre-submission stage , and the JobManager injects the checkpoint boundary ( CheckpointBarrier) into the Source Task, and the Source Task inserts the CheckpointBarrier into the data flow, and broadcasts the snapshot downstream, as shown in the following figure:

insert image description here
(2) Source side: Flink Data Source is responsible for saving the offset offset of KafkaTopic . When Checkpoint is successful, Flink is responsible for submitting these writes, otherwise it will terminate and cancel them. When Checkpoint completes displacement saving, it will checkpoint Barrier (checkpoint dividing line) to the next Operator, and each operator will take a snapshot of the current state and save it to the State Backend.

For the Source task, the current one will offsetbe saved as the state. When recovering from Checkpoint next time, the Source task can resubmit the offset and start to consume data again from the last saved position, as shown in the following figure:

insert image description here
(3) Slink side : Starting from the Source side, when each internal Transformation task encounters a Checkpoint Barrier (checkpoint dividing line), it will save the state in the Checkpoint. When the data processing is completed and sent to the Sink, the Sink task first writes the data to the external Kafka. These data belong to pre-committed transactions (not yet consumed). The state backend must also pre-commit its external transactions, as shown in the following diagram:

insert image description here
2. Submission stage

(4) When the snapshots of all operator tasks are completed (all created snapshots are considered part of the Checkpoint), that is, when the Checkpoint is completed, the JobManager will send a notification to all tasks to confirm that the Checkpoint is completed. When the Pre-commit pre-commit phase is completed. It is officially the second phase of the two-phase commit protocol: the Commit phase. In this phase, the JobManager will initiate the callback logic that the Checkpoint has been completed for each Operator in the application.

The Data Source and window operations in this example have no external state, so at this stage, the two Opeartors do not need to execute any logic, but the Data Sink has an external state. At this time, we must submit an external transaction . When the Sink task receives the confirmation notification, the previous transaction will be officially submitted, and the unconfirmed data in Kafka will be changed to " confirmed ", and the data can really be consumed, as shown in the following figure:

insert image description here
Note : In Flink, the JobManager coordinates each TaskManager to store Checkpoints. Checkpoints are stored in the StateBackend (state backend). The default StateBackend is memory-level, or it can be changed to file-level for persistent storage.

44. The number is very good and very clear, then you can make a summary of Flink's end-to-end strict Exactly-Once semantics.

insert image description here

Guess you like

Origin blog.csdn.net/be_racle/article/details/132198120