Distributed System Concepts and Design - Distributed Transactions

Distributed system concept and design

distributed transaction

Transactions that access objects managed by multiple servers are called distributed transactions.

When a distributed transaction ends, the atomic nature of the transaction requires that all servers participating in the transaction must either submit or give up.

accomplish:

  • One of the servers assumes the role of coordinator, ensuring that the same results are obtained on all servers.
  • The actions of the coordinator depend on the protocol: the two-phase commit protocol, which allows communication between servers to jointly make decisions about committing and relinquishing.

Flat and nested distributed transactions

insert image description here

Coordinator of distributed transactions

  • Servers implementing distributed transaction requests need to communicate with each other to ensure that they can coordinate their actions when the transaction is committed.
  • The coordinator who creates a distributed transaction is called the coordinator of the distributed transaction, and is responsible for committing or abandoning the transaction at the end of the distributed transaction.
  • Every server accessed by a distributed transaction is a participant in that transaction, and each server provides an object we call a participant.
  • Each transaction participant is responsible for keeping track of all recoverable objects participating in the distributed transaction.
  • These participants cooperate with the coordinator to jointly implement the decision to submit the agreement.
  • During the execution of a transaction, the coordinator records all references to participants in a list, and each participant records a reference to the coordinator.
  • The coordinator records the participants into the list of participants of the distributed transaction.
  • The coordinator knows every participant in this transaction, and every participant knows the coordinator.
  • During the commit process of the transaction, both the participant and the coordinator can collect the necessary information.

insert image description here

atomic commit protocol

Phase Two Commitment Protocol

The starting point is that any participant can commit or abandon its part of the transaction.

According to the atomic nature of the transaction, if a transaction abandons the commit transaction, the entire distributed transaction is abandoned.

  • The first phase is where each participant votes on whether the transaction commits or aborts.
    • Once a participant has decided to commit a transaction, it is no longer allowed to abandon the transaction.
    • When a participant decides to vote to commit a transaction, he must be able to ensure that his commit protocol can be executed correctly, even if other transactions fail,
    • A participant can finally commit the transaction, indicating that the participant belongs to the ready state of the transaction.
    • In order to ensure that the transaction can be committed, each participant must save all changed objects and state that occurred in the transaction to persistent storage.

Fault Model for Submission Protocol

The two-phase commit protocol is a consensus protocol.

In an asynchronous system, it is impossible to reach consensus if a process crashes

The reason why the second-order commit protocol reaches consensus is because the crashed process is replaced by a new process, and the state of the new process is set by the data in the persistent memory to respond.

Concurrency and Control of Distributed Transactions

Lock

In a distributed transaction, the locks for an object are always in the same server, and it is up to the local lock manager to authorize these locks.

The local lock manager decides whether to satisfy the lock request or make it wait.

The local lock manager cannot release any locks until the local transaction is either committed or abandoned.

The problem of distributed deadlock.

Distributed locks are a mechanism for ensuring the consistency and reliability of concurrent access to distributed systems. When designing and implementing distributed locks, the following aspects need to be considered:

  1. Lock management: Distributed locks need to be able to manage operations such as lock acquisition and release to ensure concurrency and reliability. Distributed locks can usually be implemented using a mechanism similar to a mutex.
  2. Node management: Distributed locks need to be able to manage nodes in a distributed environment to ensure node availability and fault tolerance. Node management can usually be implemented using a distributed coordination service similar to ZooKeeper.
  3. Communication protocol: Distributed locks need to be able to use a reliable communication protocol for communication between nodes to ensure data consistency and correctness. Communication protocols can usually be implemented using a reliable transport protocol similar to TCP.

When implementing distributed locks, the following methods can be used:

  1. Database-based lock mechanism: You can use the database lock mechanism to implement distributed locks, such as row locks, table locks, etc. This method is relatively simple, but it will affect concurrency performance.
  2. Mechanism based on distributed coordination services: Distributed locks can be implemented using distributed coordination services, such as the distributed lock mechanism based on ZooKeeper. This method can ensure the consistency and reliability of distributed locks, but it will increase the complexity and overhead of the system.
  3. Redis-based mechanism: Redis' distributed lock mechanism can be used to implement distributed locks. This method can ensure the consistency and reliability of distributed locks, while having high performance and scalability.

When choosing the implementation method of distributed locks, various factors need to be considered comprehensively according to the actual situation, such as system complexity, concurrency performance, reliability, etc. At the same time, when implementing distributed locks, issues such as lock granularity, lock timeout time, and lock retry mechanism need to be considered to ensure the correctness and reliability of distributed locks.

Distributed timestamp sorting concurrency control

  • In distributed transactions, the coordinator must ensure that each transaction attaches a globally unique timestamp
  • The globally unique timestamp is sent to the client by the first coordinator of the transaction access.
  • If an object on a server performs an operation in a transaction, the transaction timestamp is passed to the coordinator on that server
  • All servers in a distributed transaction jointly guarantee the serial equivalence property of the transaction.
    • If the server object version accessed by a transaction U is committed after the transaction T, and T and U access another server object on another server, the objects must also be committed in the same order.
    • To guarantee the same ordering on all servers, the coordinator must agree on the timestamp order
    • Timestamp comparison, first compare the local time, then compare the server id, two-tuple data structure

In a distributed system, Timestamp Ordering Concurrency Control (TOCC) is a mechanism for ensuring the consistency and isolation of concurrent transactions. The basic idea of ​​TOCC is to use global timestamps to sort transactions, so as to ensure the order of execution between transactions, thereby ensuring the consistency and isolation of transactions.

The implementation of TOCC can be divided into two stages:

  1. Transaction commit phase: When a transaction is committed, each transaction needs to be assigned a global timestamp, which can be generated by incrementing the sequence number, clock synchronization, etc. At the same time, the operation sequence and timestamp of the transaction need to be submitted to the coordinating node for processing.
  2. Transaction processing phase: In the coordinating node, transactions need to be sorted by timestamp to ensure the execution order of transactions. At the same time, it is necessary to lock the data accessed concurrently to ensure the isolation of transactions. After the transaction processing is completed, the result of the transaction needs to be returned to the client together with the timestamp.

The advantage of TOCC is that it can guarantee the consistency and isolation of transactions, and at the same time support high concurrent access. However, the disadvantage of TOCC is that operations such as timestamp sorting and locking are required, which will affect the performance and scalability of the system.

When implementing TOCC, the following aspects need to be considered:

  1. Time stamp generation method: It is necessary to select an appropriate time stamp generation method to ensure the uniqueness and increment of the time stamp.
  2. Timestamp sorting algorithm: It is necessary to select an appropriate timestamp sorting algorithm to ensure the correct sorting of timestamps.
  3. Lock management: It is necessary to lock concurrently accessed data to ensure transaction isolation and correctness.
  4. Fault-tolerant mechanism: It is necessary to consider the fault-tolerant mechanism in a distributed environment to ensure the availability and reliability of the system.

optimistic concurrency control

  • Every transaction must first be validated before committing.
  • Transactions are affixed with a transaction sequence number at the beginning of verification, and the serialization of transactions is achieved according to the order of these transaction sequence numbers.
  • The verification of distributed transactions is jointly completed by a group of servers, and each server verifies the transactions of its own objects.

In a distributed system, Optimistic Concurrency Control (OCC) is a mechanism for ensuring the consistency and isolation of concurrent transactions. The basic idea of ​​OCC is to assume that there is no conflict between transactions, that is, all transactions can be executed concurrently. When a transaction is committed, it needs to be version controlled to ensure the consistency and isolation of the transaction.

The implementation of OCC can be divided into two stages:

  1. Transaction execution phase: During transaction execution, a version number needs to be recorded for each transaction, which can be generated by incrementing sequence numbers, timestamps, etc. At the same time, it is necessary to mark the data accessed during transaction execution to record the version number of transaction execution.
  2. Transaction commit phase: When a transaction is committed, it is necessary to check whether the data accessed by the transaction has been modified by other transactions, and if it has not been modified, the transaction is committed. If it is modified, the transaction needs to be rolled back and executed again.

The advantage of OCC is that it can support high concurrent access, and does not require operations such as locking, which can improve system performance and scalability. However, the disadvantage of OCC is that operations such as version control and conflict detection are required, which will increase the complexity and overhead of the system.

When implementing OCC, the following aspects need to be considered:

  1. Version control method: It is necessary to choose an appropriate version control method to ensure the uniqueness and increment of the version number.
  2. Conflict detection method: It is necessary to select an appropriate conflict detection method to ensure transaction consistency and isolation.
  3. Fault-tolerant mechanism: It is necessary to consider the fault-tolerant mechanism in a distributed environment to ensure the availability and reliability of the system.

When choosing the implementation method of OCC, various factors need to be considered comprehensively according to the actual situation, such as system complexity, concurrency performance, reliability, etc. At the same time, when implementing OCC, issues such as version control methods, conflict detection methods, and fault tolerance mechanisms need to be considered to ensure the correctness and reliability of OCC.

distributed deadlock

Deadlock is a common problem in distributed systems. A deadlock occurs when multiple processes or threads wait for each other to release lock resources in a distributed system. The occurrence of distributed deadlock is due to the existence of multiple independent resource managers in the distributed system, and these resource managers depend on each other, which leads to the occurrence of deadlock.

The solutions to distributed deadlocks can be divided into the following categories:

  1. Timeout mechanism: In a distributed system, a timeout mechanism can be used to solve the deadlock problem. When a transaction waits for other transactions to release lock resources for more than a certain threshold, it is considered that a deadlock has occurred and the transaction is rolled back.
  2. Deadlock detection: In a distributed system, a deadlock detection mechanism can be used to solve the deadlock problem. The deadlock detection mechanism can judge whether a deadlock has occurred by checking the resource allocation in the system, and take corresponding measures to remove the deadlock.
  3. Lock granularity control: In a distributed system, the occurrence of deadlocks can be reduced by controlling the lock granularity. For example, the lock granularity can be lowered from the table level to the row level, thereby reducing the probability of deadlocks.
  4. Resource allocation strategy: In a distributed system, the occurrence of deadlocks can be reduced by optimizing the resource allocation strategy. For example, a resource reservation strategy can be adopted, that is, the required resources are reserved before the transaction is executed, so as to avoid the occurrence of deadlocks.

Distributed transaction recovery

  • The atomicity of the transaction requires that the objects accessed by the transaction can reflect the effects of all committed transactions, and cancel the effects of all unfinished or abandoned transactions
    • Persistence: The required objects are permanently stored in reliable storage and are always available. If the client submits a transaction confirmation, all object updates will be saved to the persistent storage
    • Fault Atomicity Requirement: Even if the server fails, the effect of the thing is atomic.
  • The recovery function of the transaction is to ensure the persistence of the server object and provide the fault atomicity service.
  • Recovery Manager:
    • For committed transactions, save objects to persistent storage
    • Restoring objects on the server after a server crash
    • Reorganizes recovery files to improve performance of recovery operations
    • Reclaim storage space

log

  • Log Technically, the recovery file contains a history of all operations performed on the server
  • Operation history consists of object value, operation transaction status and intent list
  • The order in transactions reflects the order in which transactions were prepared, committed or aborted on the server
  • During normal server operation, the recovery manager is invoked each time a transaction prepares to commit, commits, or aborts.
  • The addition of recovery files is atomic, i.e. the writes are always complete.
  • If the server crashes, the last write may not be complete.
  • In order to improve efficiency, it may be cached several times in a commit recovery write action.
  • The other is that sequential writing to disk technology is more efficient than random writing
  • All uncommitted transactions will be abandoned according to the log content after the crash.

In a distributed system, transaction recovery is an important mechanism to ensure the consistency and reliability of transactions. Distributed transaction recovery usually includes the following steps:

  1. Fault detection: In a distributed system, it is necessary to monitor the fault conditions in the system, such as node faults, network faults, etc. When a fault occurs, it needs to be detected in time and dealt with accordingly.
  2. Logging: In a distributed system, it is necessary to log the operations during transaction execution. Log records can be used to restore data and state during transaction execution.
  3. Transaction rollback: When a failure occurs, unfinished transactions need to be rolled back. Rollback operations can restore data and state by undoing the operations of the transaction.
  4. Transaction replay: In the process of fault recovery, it is necessary to replay the committed transactions to ensure the consistency and reliability of data and state.

When implementing distributed transaction recovery, the following aspects need to be considered:

  1. Logging method: It is necessary to choose an appropriate logging method to ensure the reliability and consistency of data and status.
  2. Rollback mechanism: An appropriate rollback mechanism needs to be designed to ensure transaction consistency and reliability.
  3. Replay strategy: It is necessary to select an appropriate replay strategy to ensure the consistency and reliability of data and state.
  4. Fault-tolerant mechanism: It is necessary to consider the fault-tolerant mechanism in a distributed environment to ensure the availability and reliability of the system.

When implementing a distributed system, it is necessary to consider the transaction recovery mechanism and take corresponding measures to ensure the stability and reliability of the system. At the same time, transaction recovery testing and debugging are required to ensure the correctness and reliability of the system.

Guess you like

Origin blog.csdn.net/weixin_38233104/article/details/131039068