Apache Ignite Transaction Architecture: Failure and Recovery

In the previous article in this series , the concurrency model and isolation levels were explored, and here are the topics that will be discussed in the remaining articles in this series:

  • failure and recovery
  • Transaction processing for Ignite persistence layer (WAL, checkpointing, and others)
  • Transaction Processing for Third-Party Persistence Layers

In this article, the focus will be on failure and recovery during transaction execution.

A distributed cluster consists of transaction coordinators, primary nodes, and backup nodes. Some or all node failures are very likely, in order of increasing severity, as follows:

  • Backup node failure;
  • master node failure;
  • transaction coordinator failure;

The following will analyze these scenarios one by one and explain how Ignite manages these failures, starting with the backup node failure.

Backup node failure

Recall from the first article in this series that in a two-phase commit protocol, there are preparation and commit phases. Regardless of the stage, if the backup node fails, it will not affect Ignite, because the transaction will continue to execute on the remaining active and standby nodes in the cluster, as shown in Figure 1:Figure 1: Backup Node Failure

After all active transactions (including this one) are over, Ignite will update the network topology version due to node failure, and then select one or more nodes to hold the data held by the previous failed node, Ignite will start the rebalancing process in the background to meet the required level of data replication.

Next, take a look at how Ignite manages master node failures.

master node failure

Master node failures require different handling, depending on whether the failure occurred in the prepare phase or the commit phase.

If the failure occurs during the prepare phase, the transaction coordinator throws an exception, as shown in Figure 2 ( 3 exception ), then it should be up to the application to decide how to handle this exception and what to do next, for example, restart the transaction? Or what other exception handling.Figure 2: Primary node failure during preparation

If the failure occurs during the commit phase, as shown in Figure 3, the transaction coordinator waits for a specific message ( 4 ACK ) from one of the backup nodes.Figure 3: Master node failure during commit phase

When the backup node detects a failure, it notifies the coordinator that the transaction has been successfully committed. At this time, because there is a backup, the data is not lost, and it does not affect the application's access and use of the data.

After the transaction coordinator completes the transaction, Ignite will rebalance the cluster because the master node fails, and it will elect a new master node to replace the failed master node. Next, take a look at how Ignite manages coordinator failures.

transaction coordinator failure

The worst case is the failure of the transaction coordinator, because the primary and backup nodes can only perceive the local transaction state and cannot know the global transaction state. Only some nodes will receive the commit message and others will not, as shown in Figure 4.Figure 4: Transaction coordinator failure

The solution to this failure scenario is for nodes to exchange their local transaction state with each other, as shown in Figure 4, so that they know the global transaction state.

At this point, Ignite will initiate a recovery protocol . The workflow is such that all nodes participating in a transaction send a message to all other nodes participating in the transaction, asking them if they received a prepare message. If any node replies that it has not received the prepare message, the transaction will be rolled back, otherwise the transaction will be committed. However, some nodes may have committed the transaction before receiving the recovery protocol message. For this case, all nodes retain completed transaction ID information for a period of time. If no transaction in progress is found for the given ID, the background log is checked, and if the background log also does not contain the transaction, the transaction was never started. Therefore, if the failure occurs before the prepare phase completes, the transaction is rolled back. The recovery protocol also works if the primary or backup node also fails with the transaction coordinator.

Summarize

At different stages, various types of nodes may fail. Through the above examples, you will know how Ignite manages these failures gracefully and provides recovery mechanisms.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324377409&siteId=291194637