Distributed OLTP database system

OLTP

OLTP VS OLAP

On-line Transaction Processing:

  • Brief read/write txns.
  • Small footprint
  • Repeatable

On-line Analytical Processing:

  • Long-running read-only queries.
  • Complex Joins
  • ad hoc query

Three questions:

  1. What happens if a node fails?
  2. What happens if our information shows up late?
  3. What happens if we don't wait for every node to agree?

Important assumption:
All nodes in a distributed DBMS are assumed to be well-behaved and under the same administrative domain. If we tell the node to commit txn, then it will commit txn (if not failed).

If you do not trust other nodes in the distributed DBMS, you need to use Byzantine Fault Tolerant protocol (Byzantine Fault Tolerant) for txns (blockchain).

Atomic Commit Protocol

When a multi-node txn is complete, the DBMS needs to ask all involved nodes whether it is safe to commit.

examples:

  • Two-Phase Commit
  • Three-Phase Commit
  • Paxos
  • Raft
  • Apache Zookeeper
  • Viewstamped Replication

Two-Phase Commit/Abort

Two-Phase Commit

Each node logs inbound/outbound messages and results of each stage in a non-volatile storage log.

When recovering, check the log for 2PC messages:

  • If the local txn is ready, contact the Cooidinator.
  • If the local TXN is not ready, abort it.
  • If the local txn is submitting and the node Coordinator, send a commit message to the node.

Two-Phase Commit Failures

What happens if the Coordinator crashes?

  • Participants must decide what to do after a timeout.
  • The system is unavailable during this time.
    What happens if the Coordinator crashes?
  • The Coordinator assumes that if an acknowledgment has not been sent, it responds with an abort.
  • Likewise, nodes use timeouts to determine that a Participant has died.

2PC Optimizations

Early Prepare Voting (Rare)
If a query is sent to a remote node that knows the last query that will be executed there, that node will also return their votes for the prepare phase and return the query result.

Early Ack After Prepare (Common)
If all nodes voted to commit the txn, the coordinator can send an acknowledgment that its txn was successful to the client before the end of the commit phase.

Early acknowledgment after prepare is an optimization method for 2PC, which can improve the response time and throughput of transactions. Its basic idea is that if all participants agree to commit the transaction, the coordinator can send an acknowledgment message (Acknowledge Message) to the client before the commit phase, indicating that the transaction has been successfully completed. This way, the client does not need to wait for the end of the commit phase, but can start a new transaction immediately. At the same time, the coordinator can also reduce the communication overhead to the participants, because it does not need to wait for completion messages from all participants.

The advantage of early acknowledgment after prepare is that it can reduce the average transaction delay and improve the concurrency and performance of the system. The disadvantage is that it may increase the complexity and risk of the system, because if a failure or network partition occurs during the commit phase, the client may receive wrong confirmation messages, resulting in data inconsistencies or exceptions. Therefore, this optimization method needs to ensure the reliability and fault tolerance of the system, and reasonably select the sending timing of the confirmation message.

PAXOS

A consensus protocol in which the coordinator proposes an outcome (e.g., commit or abort), and then participants vote on whether that outcome should succeed.
If a majority of Participants are available and message latency is provably minimal in the best case, then it will not block.

MULTI-PAXOS

If the system elects a leader to oversee proposed changes for a period of time, it can skip the proposal phase. Whenever there is a failure, it reverts to full Paxos.
The system periodically renews the leader (called a lease) with another round of Paxos. Nodes must exchange log entries during leader election to ensure everyone is up to date.

2PC VS PAXOS

2 Phase Commit
If the Coordinator fails after sending the prepare message, it will block until the Coordinator recovers.

PAXOS
does not block if most of the Participants are alive, provided there is a long enough time without further failures.

Replication

A DBMS can replicate data across redundant nodes to increase availability.

Design decision:

  • replica configuration
  • Communication plan
  • Propagation timing
  • update method

Replica Configurations

Method 1: Primary-Replica

  • All updates will go to the designated master for each object.
  • The master propagates updates to its replicas without an atomic commit protocol.
  • Read-only txns may be allowed to access replicas.
  • If a master node goes down, an election is held to elect a new master node.

Method 2: Multi-Primary

  • Txns can update data objects on any replica.
  • Replicas must be in sync with each other using the atomic commit protocol.

K-Safety

K-Safety is the threshold that determines the fault tolerance of the replicated database.
The value K represents the number of replicas that must always be available for each data object.
If the number of replicas falls below this threshold, the DBMS stops execution and takes itself offline.

Propagation Scheme

When txn commits on the replicated database, the DBMS decides whether it has to wait for txn's changes to propagate to other nodes before sending acknowledgment to the application.

Spread level:

  1. Synchronous (strong consistency): The master sends updates to the replicas, then waits for them to confirm that they have fully applied (ie, logged) the changes.
  2. Asynchronous (eventually consistent): The master returns acknowledgments to clients immediately without waiting for replicas to apply changes.

Propagation Timing

Method 1: Continuous: DBMS sends log messages immediately when generating log messages. Commit/abort messages also need to be sent.
Method 2: On Commit: DBMS only sends the log message of txn to the replica after committing txn. Don't waste time sending log records for aborted txns. Assuming txn's logging fits perfectly in memory

Updata Method

Method 1: Active-Active

  • txn is executed independently on each replica.
  • Need to check at the end if txn has the same result on each replica.

Method 2: Active-Passive

  • Each txn executes at a single location and propagates changes to replicas.
  • Physical or logical duplication can be performed.
  • Unlike master replicas vs multi-master replicas

CAP theory

  • Consistent
  • Always Available
  • Network Partition Tolerant

One flaw is that it ignores the consistency-latency tradeoff.

Guess you like

Origin blog.csdn.net/weixin_47895938/article/details/132337445