Cloud Native Distributed Database Transaction Isolation Level (Part 2)

In the previous article, we have introduced the related concepts of transactions and different levels of transaction isolation. This article will focus on the development of snapshot isolation.

 

Part 3 Development of Snapshot Isolation

 

 The definition of Snapshot Isolation is proposed in the paper A Critique of ANSI SQL Isolation Levels : 

  • The read operation of the transaction reads data from the committed (Committed) snapshot. The snapshot time can be any time before the first read operation of the transaction, recorded as StartTimestamp;

  • When the transaction is ready to commit, get a CommitTimestamp, which needs to be larger than the existing StartTimestamp and CommitTimestamp;

  • Conflict check is performed when the transaction is committed. If no other transaction has submitted data that intersects with its own WriteSet within the [StartTS, CommitTS] interval, the transaction can be submitted;

  • Snapshot isolation allows transactions to be executed with very old StartTS so as not to be blocked by any write operations, or to read a historical data.

Note here:

  • The above time and space have no intersection check, mainly to prevent LostUpdate anomalies;

  • When implementing, it usually uses locks and LastCommit Map, locks the corresponding row before submitting, and then traverses its own WriteSet to check whether there is a row of LastCommit that falls within its [StartTS, CommitTS];

  • If there is no conflict, update your CommitTS to LastCommit, and submit the transaction to release the lock.

Considering the above definition of snapshot isolation carefully, consider the following questions:

  1. Acquisition of CommitTS: How to get a timestamp larger than the existing StartTS and CommitTS, especially in a distributed system;

  2. Acquisition of StartTS: Although the StartTS mentioned above can be a very old time, does StartTS need to increase monotonically;

  3. The conflict check performed at the time of submission is to resolve the Lost Update exception, so for this exception, whether the write-write conflict check is sufficient and necessary;

  4. How to implement a distributed, decentralized snapshot isolation level.

In response to the above problems, the following is expanded. Here, the snapshot isolation (SI) mentioned above is recorded as Basic SI.

 

Distributed snapshot isolation

This section mainly explains the engineering practice progress of HBase, Percolator and Omid in snapshot isolation.

HBase

Snapshot isolation in HBase is a distributed SI based entirely on multiple HBase tables:

  • Version Table: Record the last CommitTS of a row of data;

  • Committed Table: Record the CommitLog, and write the commit log to this table when the transaction is committed, which can be considered Committed;

  • PreCommit Table: used to check concurrency conflicts, which can be understood as a lock table;

  • Write Label Table: used to generate a globally unique Write Label;

  • Committed Index Table: Speed ​​up the generation of StartTS;

  • DS: actually stores the data.

The protocol is roughly implemented as follows:

  • StartTS: Traverse the Committed Table to find the maximum commit timestamp that increases monotonically and continuously, that is, there is no hole in front of it (the hole here refers to the transaction taking CommitTS but not committing in the order of CommitTS);

  • Committed Index: In order to avoid traversing too much data in the process of obtaining StartTS, each transaction will be written to the Committed Index Table after obtaining StartTS, and subsequent transactions can be traversed from this timestamp, which is equivalent to caching;

  • read: It is necessary to judge whether the data of a transaction has been committed, go to VersionTable and Committed Table to check;

  • precommit: First check whether there are conflicting transactions in the Committed Table, then record a row in the PreCommit Table, and then check whether there are conflicting transactions in the PreCommitTable;

  • commit: Get a commitTS, write a record in the CommittedTable, and update the PreCommit Table.

HBase implements distributed SI in a structurally decoupled manner. All states are stored in HBase, and the requirements of each scenario are implemented with different tables, but this decoupling also brings performance losses. Here, the snapshot isolation implemented by HBase is recorded as HBaseSI.

 

Percolator

The Percolator proposed in 2010 is more engineered on the basis of HBase, combining multiple tables involved into one, and adding lock and write columns on the basis of the original data:

  • lock: used as a WW conflict check, the lock will distinguish between Primary Lock and Secondary Lock in actual use;

  • write: It can be understood as commit log. Transaction submission still takes 2PC. When the Coordinator decides to commit, it will write a commit log in the write column. After writing, the transaction is considered committed.

At the same time, as a distributed SI solution, it still needs to rely on 2PC to achieve atomic commit; while the prewrite and commit processes combine transaction locking and 2PC prepare well, and use Bigtable's single-row transaction to A lot of conflict handling in HBaseSI scheme is avoided. Here, the snapshot isolation implemented by Percolator is recorded as PercolatorSI.

 

Omid

Omid is the work of Yahoo and is also based on HBaseSI, but compared with Percolator's Pessimistic method, Omid is an Optimistic method. Its architecture is relatively elegant and concise, and its engineering is also good. In recent years, articles have been published in ICDE, FAST, and PVLDB.

Although Percolator's Lock-based solution simplifies transaction conflict checking, it leaves the transaction driver to the client. In the case of client failure, the legacy Lock cleanup will affect the execution of other transactions and maintain additional locks and writes. Column, obviously will also increase a lot of overhead. Optimistic solutions such as Omid are completely determined by the central node to commit or not, which will be simpler in terms of transaction recovery; moreover, Omid is actually easier to adapt to different distributed storage systems, with less intrusion.

The article of ICDE 2014 lays down the Omid architecture:

  • TSO (TimestampOracle): responsible for timestamp allocation and transaction submission;

  • BookKeeper: Distributed log component, used to record the Commit Log of transactions;

  • DataStore: HBase is used to store actual data, and it can also be adapted to other distributed storage systems.

TSO maintains the following states:

  • Timestamp: Monotonically increasing timestamp is used for StartTS and CommitTS of SI;

  • lastCommit: Commit timestamp of all data, used for WW conflict detection, here will be cut according to the commit time of the transaction, so that it can be stored in memory;

  • committed: Whether a transaction is committed or not, the transaction ID is identified by StartTS, and the mapping of StartTS -> CommitTS can be recorded here;

  • uncommitted: Transactions that have been assigned CommitTS but have not yet committed;

  • T_max: The low water level reserved by lastCommit. Transactions less than this timestamp will be Abort when submitted.

The lastCommit here is the key, indicating that the same Pessimistic method of first locking and then detecting conflicts as Percolator is no longer used when the transaction is committed, but:

  • Send the Commit request to TSO for Optimistic conflict detection;

  • According to the lastCommit information, it is detected whether the WriteSet of a transaction and the lastCommit overlap in time and space. If there is no conflict, update lastCommit and write commit log to BookKeeper;

  • TSO's lastCommit will obviously take up a lot of memory and become a performance bottleneck. For this reason, only the most recent lastCommit information is kept, and Tmax is used to maintain the low water level. When it is less than this Tmax, it will always be aborted.

In addition, an optimization scheme of client cache Committed is proposed to reduce the query to TSO; in the start request of a transaction, TSO will return the committed transaction up to the start time point to the client, so that the client can directly judge whether a transaction is It has been submitted, and the overall architecture is shown in the figure below.

In FAST 2017, Omid adjusted the previous architecture and made some engineering optimizations:

  • The commit log is no longer stored in BookKeeper, but is stored in an additional HBase table;

  • The client no longer caches committed information, but caches it on the data table; therefore, most of the time, users can judge whether the row of data has been submitted based on the commit field when reading data.

In PLVDB 2018, Omid once again made significant engineering optimizations, covering more scenarios:

  • Commit Log is no longer written by TSO, but offloaded to the client, which improves scalability and reduces transaction latency;

  • Optimize single-row read and write transactions, and add a maxVersion memory record to the data, so that single-row read and write transactions no longer need to be verified by the central node.

 

Decentralized snapshot isolation

The above are all implementations for distributed SI, and they all have a common feature: the central node is reserved, either for transaction coordination, or for timestamp distribution. This is still a risk for large-scale or cross-regional transactional systems. In response to this, there has been a series of explorations into decentralized snapshot isolation.

 

Clock-SI

Clock-SI first pointed out that the correctness of Snapshot Isolation includes three points:

  • Consistent Snapshot: The so-called Consistent, that is, the snapshot contains and only contains all transactions whose Commit precedes SnapshotTS;

  • Commit Total Order: All transaction commits constitute a total order relationship, and each commit will generate a snapshot, identified by CommitTS;

  • Write-Write Conflict: Transactions T i and T j have conflicts, that is, their WriteSets have an intersection, and [SnapshotTS, CommitTS] have an intersection.

Based on these three points, Clock-SI proposes the following algorithm:

  • StartTS: Obtained directly from the local clock.

  • Read: When the clock of the target node is less than StartTS, wait, that is, Read Delay in the figure below; when the transaction is in the Prepared or Committing state, it also waits; after waiting, the latest data less than StartTS can be read; here Read Delay is to ensure Consistent Snapshot.

  • CommitTS: Distinguish between single Partition transactions and distributed transactions. Single Partition transactions can be directly submitted using the local clock as CommitTS; while distributed transactions select max{PrepareTS} as CommitTS for 2PC submission; in order to ensure the total order of CommitTS, it will be submitted at time The id of the node is added to the stamp, which is the same as the method of Lamport Clock.

Commit: Whether it is a single partition or a multi-partition transaction, WW conflict detection is performed by a single engine.

Clock-SI has several innovations:

  • Using a common physical clock, it no longer relies on a central node to assign timestamps.

  • The intrusion to the single-machine transaction engine is small, and distributed SI can be realized based on a single-machine Snapshot Isolation database.

  • Distinguishing between single-machine transactions and distributed transactions will hardly reduce the performance of single-machine transactions, and distributed use 2PC for atomic commit.

In the project implementation, the following issues need to be considered:

  • StartTS selection: older snapshot times can be used so that they are not blocked by concurrent transactions;

  • Clock drift: Although the correctness of the algorithm is not affected by clock drift, clock drift will increase the delay of the transaction and increase the abort rate;

  • Session Consistency: After the transaction is committed, the timestamp is returned to the client and recorded as latestTS. The client will bring this latestTS to the next request and wait.

The experimental results of the paper are very prominent, but the correctness demonstration is relatively simple and needs to be further proved.

 

ConfluxDB

If Clock-SI has any shortcomings, it may be dependent on the physical clock, which will affect the transaction delay and abort rate in the case of clock drift. Is it possible to achieve decentralization without relying on physical clocks?

In the solution proposed by ConfluxDB, only the logical clock is used to capture the prior relationship of transactions, and conflicts are detected based on the prior relationship:

  • When the transaction T i is ready to commit, the Coordinator of 2PC requests the concurrent(T i ) list of the transaction from all participants, where the concurrent(T i ) is defined as the transaction of begin(T j ) < commit(T i );

  • After the Coordinator receives the concurrent(T i ) of all participants, it merges it into a large gConcurrent(T i ) and sends it back to all participants;

  • Participants check whether there is a transaction T j according to gConcurrent(T i ), dependents(T i ,T j ) ∧ (T j ∈ gConcurrent(T i )) ∧ (T j ∈ serials(T i )), that is, there is A transaction T j has different precedence relationships in different partitions, which violates the rules of Consistent Snapshot;  

  • The participant sends the conflict detection result back to the Coordinator, and the Coordinator decides whether to Commit or Abort accordingly;

  • In addition, the Coordinator needs to generate a CommitTS for this transaction. Here, a method similar to ClockSI is selected, commitTS=max{prepareTS}, where prepareTS and commitTS will be passed between nodes in the way of Logical Clock.

This solution of ConfluxDB does not need to rely on the physical clock, no wait, or even a single-machine transaction engine to support the function of reading point-in-time snapshots. But the disadvantage of this scheme is that the Abort rate may not be very good, and the delay problem when executing distributed transactions.

 

Generalized SI

Generalized SI applies Snapshot Isolation to Replicated Database, so that Snapshots of transactions can be read from slave nodes in a replication group. This has two meanings. Using an old snapshot will not be blocked by the currently running transaction, thereby reducing transaction latency; and reading data from the Secondary node can achieve a certain degree of read-write separation and expand read performance.

 

Parallel SI

In the above solution, the read request can be offloaded to the Secondary node, which can expand the read performance to a certain extent. So continue to extend this idea, can you also hand over the transaction submission to the Secondary node for execution?

This is the idea of ​​Parallel Snapshot Isolation. In the scenario of cross-regional replication, the business usually has geographical locality requirements. Users in Shanghai send requests to the nearest computer room in Shanghai, and users in Guangzhou send requests to Guangzhou. computer room; and in actual business scenarios, the requirements for consistency and isolation can often be relaxed. Parallel abandons the constraint on Commit Total Order in Snapshot Isolation, thus realizing multi-point transaction commit. Such a scheme might be difficult to use in a general-purpose database, but would be valuable in real business scenarios.

 

Serializable YES

The difference between Snapshot Isolation and Serializable is the Write Skew exception. In order to solve this exception, we can optimize based on Snapshot Isolation, and try to retain the excellent properties of Snapshot Isolation, and then propose Serializable SI.

The paper  Serializable isolation for snapshot databases  was published by Alan D. Fekete and Michael J. Cahill in 2009, which is the theoretical result of early research on SSI.

The paper starts from the serialization graph theory. In the serial graph of Multi-Version, an edge called RW dependency is added, that is, transaction T1 writes a version first, and transaction T2 reads this version, then RW is generated. rely. When this graph produces cycles, Serializable is violated.

In the paper, the author proves that in the ring generated by SI, two RW edges must be adjacent, which means that there will be a pivot point with both outgoing and incoming edges. Then as long as this pivot point is detected and one of the transactions is selected to be aborted, the structure of the ring will naturally be broken. The core of the algorithm is to dynamically detect this structure, so some states will be recorded in each transaction. In order to reduce memory usage, two bool values ​​of inConflict and outConflict are used to record; in the process of executing read and write operations in a transaction, it will be combined with The read and write dependencies of other transactions are recorded in these two states.

  • Although the use of bool values ​​reduces memory usage, it obviously also increases false positives, which will cause some transactions without exceptions to be aborted.

  • According to the experimental results in this paper, the performance is better than S2PL (strict two-phase lock, which can realize serialization), the abort is lower, and the overhead brought to Snapshot Isolation is also relatively small.

  • However, according to the later SSI implementation of PostgreSQL, a lot of work is still required to reduce memory usage. If you are interested, you can refer to  Serializable Snapshot Isolation in PostgreSQL .

 

Write SI

Yabandeh proposed Write-Snapshot Isolation in the paper  A critique of snapshot isolation  . The author criticizes Basic SI first, because Basic SI is misleading: it is necessary to perform write-write conflict detection. At the beginning of the article, it is proposed that the LostUpdate exception in SI does not necessarily need to prevent WW conflicts; replacing RW detection and allowing WW conflicts can not only prevent LostUpdate exceptions, but also realize Serializable, killing two birds with one stone.

Why is WW detection not necessary? Briefly argue that in MVCC, the transaction that writes the conflict is written in a different version, why must there be a conflict? In fact, the exception occurs only when two transactions are both RW operations. If one of the transactions has only W operations, Lost Update will not occur; in other words, there is no need to detect WW conflicts, and RW conflicts are the root cause.

Based on the idea of ​​RW conflict detection, the author proposes Write Snapshot Isolation and names the previous Snapshot Isolation as Read Snapshot Isolation. As shown in the figure below:

  • TXN n and TXN c' conflict because TXN c' modifies the ReadSet of TXN n ;

  • There is no conflict between TXN n and TXN c . Although they both modified the record r', Basic SI will think that there is a conflict, but Write SI believes that TXN c has not modified the ReadSet of TXN n , so there is no RW conflict.

How to detect RW conflict: Maintain ReadSet during transaction reading and writing, and check whether your ReadSet has been modified by other transactions when submitting. But it is not so simple in practice, because the overhead of maintaining ReadSet is usually larger than that of WriteSet, and how to do this conflict check, is it possible to add a read lock? So in the original text, the author only explained how the centralized Write SI is implemented ( BadgerDB uses this algorithm to implement a KV engine that supports transactions). As for the implementation of decentralization, a little shadow can be found from CockroachDB .

However, RW detection brings many benefits:

  • Read-only transactions do not need to detect conflicts, and their StartTS and CommitTS are the same;

  • A write-only transaction does not need to detect conflicts, and its ReadSet is empty;

  • More importantly, the isolation level implemented by this algorithm is Serializable instead of Snapshot Isolation.

Based on the above content, in order to achieve serialization, traditionally, only concurrency control based on locks can be used. Due to performance problems, it is difficult to apply in practical projects. Serializable SI provides a new path for high performance serializability.

The above content mainly refers to  the Snapshot Isolation overview .

 


 

write at the end

 

On the wiki, PostgreSQL has a sentence explaining SSI : Documentation of Serializable Snapshot Isolation (SSI) in PostgreSQL compared to plain Snapshot Isolation (SI). These correspond to the SERIALIZABLE and REPEATABLE READ transaction isolation levels, respectively, in PostgreSQL beginning with version 9.1. Therefore, when discussing the isolation level implemented by any database product, it is necessary to understand the algorithm principle behind the implementation of the isolation level.

 

references

1. (https://book.douban.com/subject/26851605/)

2. (https://cs.uwaterloo.ca/~ddbook/)

3. (https://cs.uwaterloo.ca/~ddbook/)

4. (https://dl.acm.org/doi/abs/10.1145/568271.223785)

5. (https://zhuanlan.zhihu.com/p/54979396)

6. (https://zhuanlan.zhihu.com/p/37087894)

7. (https://dgraph.io/blog/post/badger-txn/)

8. (https://wiki.postgresql.org/wiki/SSI)

{{o.name}}
{{m.name}}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324129913&siteId=291194637