Yunxi Distributed Database Transaction Concurrency Control Introduction

1. Overview of transaction concurrency control principles

1.1 Why concurrency control
The database is a shared resource, and there are usually many transactions running at the same time. When multiple transactions access the database concurrently, the same data is read or modified at the same time. If the concurrent operation is not controlled, incorrect data may be accessed and stored, and the consistency of the database may be destroyed. So the database management system must provide concurrency control mechanism.

In the figure below, the write operation W(A) of the left transaction T2 at time t6 overwrites the write operation of the transaction T1 at time t5, causing the update of the transaction T1 to be lost; the right transaction T2 in the figure below reads the write operation that was rolled back by the transaction T1, This data is not legal and is called a dirty read. In addition, due to the concurrent execution of transactions, problems such as non-repeatable read, phantom read, and read and write partial order will also occur, which will not be introduced here.

Figure 1 Left is update loss, right is dirty read

In order to improve resource utilization and transaction execution efficiency, and reduce response time, the database allows concurrent execution of transactions. However, when multiple transactions operate on the same object at the same time, there must be conflicts. The intermediate state of a transaction may be exposed to other transactions, causing some transactions to write wrong values ​​to the database according to the intermediate state of other transactions. It is necessary to provide a mechanism to ensure that transaction execution is not affected by concurrent transactions, so that users feel that only the transactions initiated by themselves are currently being executed. This is isolation. Since isolation has high requirements on the execution order of transactions, many databases provide different options, and users can sacrifice part of the isolation to improve system performance. These different options are transaction isolation levels. The isolation levels of transactions are generally divided into four, from low to high: read uncommitted, read committed, repeatable read, and serialized. The ultimate goal is to perform logically serial operations on concurrently executed transactions in a reasonable order, so as to avoid destroying the consistency of the database by executing in time series.

Figure 2 Serial execution sequence of concurrent operations

1.2 Common concurrency control methods
Common concurrency control methods include locks, timestamps, validity checks, snapshots, and multi-version mechanisms. Here we introduce several concurrency controls related to Yunxi database.

1.2.1 Lock

In order to maximize the concurrent capability of database transactions, the locks in the database are designed in two modes, shared locks and mutex locks. When a transaction acquires a shared lock, it can only perform read operations, so the shared lock is also called a read lock; and when a transaction acquires a mutex lock for a row of data, it can read and write the row of data, so mutual Exclusion locks are also called write locks. If the current transaction has no way to obtain the lock corresponding to the row of data, it will fall into a waiting state, until other transactions release the lock corresponding to the current data, the lock can be obtained and the corresponding operation can be performed.

Figure 3 Compatibility of mutual exclusion locks and shared locks

The Two-Phase Locking Protocol (2PL) is a protocol that guarantees the serializability of transactions. It divides the acquisition and release of locks of a transaction into two distinct phases: Growing and Shrinking. During the growth phase, a transaction can acquire locks but cannot release locks; while in the shrink phase, transactions can only release locks and cannot acquire new locks.

Deadlocks are often encountered in multi-threaded programming. Once multiple threads are involved in competing for resources, it is necessary to consider whether the current threads or transactions will cause deadlocks. Whether the directed waiting graph generates a loop can determine whether it is not. A deadlock occurs. How to recover from a deadlock is actually very simple. The most common solution is to select a transaction in the entire ring to roll back to break the cycle in the entire wait graph.


Figure 4 Generation of deadlock

1.2.2 Timestamp

Assign timestamps to each transaction, and use this to determine the order in which transactions are executed. When the timestamp of transaction 1 is less than that of transaction 2, the database system must ensure that transaction 1 performs T/O-based concurrency control before transaction 2. No lock is required for reading and writing, and each row of records is marked with the transaction that last modified and read it. timestamp. When the timestamp of the transaction is smaller than the recorded timestamp (the "future" data cannot be read), it needs to be re-executed after abort. Assuming that the record X is marked with two timestamps for reading and writing: WTS(X) and RTS(X), the timestamp of the transaction is TTS, and the visibility is judged as follows:

Read:
TTS < WTS(X): The object is not visible to the transaction, abort the transaction, and start over with a new timestamp.
TTS > WTS(X): The object is visible to the transaction, update RTS(X) = max(TTS,RTS(X)). To satisfy repeatable reads, the transaction replicates the value of X.

Write:
TTS < WTS(X) || TTS < RTS(X): abort the transaction, start over.
TTS > WTS(X) && TTS > RTS(X): Transaction updates X, WTS(X) = TTS.

Its defects include: long transactions are easy to starve to death, because the timestamp of long transactions is small, and there is a high probability that updated data will be read after a period of execution, resulting in abort; read operations will also generate writes (write RTS).

1.2.3 Multi-version concurrency control

The database maintains multiple physical versions of a record. When the transaction is written, a new version of the written data is created, and the read request obtains the latest version of the data that already exists at that time according to the snapshot information at the beginning of the transaction/statement. The most immediate benefits it brings are: writes do not block reads, reads do not block writes, and read requests will never fail due to conflicts (such as single-version T/O) or wait (such as single-version 2PL). For database requests, read requests tend to outnumber write requests. Almost all mainstream databases have adopted this optimization technique.

Figure 5 Multi-version concurrent read and write operations

 

2. Yunxi database concurrency control mechanism
Yunxi database adopts Percolator's concurrency control model, the values ​​in Yunxi database are not directly written to the storage layer; instead, everything is written in a temporary state, called "Write Intent", And adds an additional value that identifies the transaction record to which the value belongs. Whenever an operation encounters a Write Intent, it looks up the state of the transaction record to know how it should handle the Write Intent value.

2.1 Transaction records
To track the status of transaction execution, we write the value of a transaction record to the KV store, and the transaction's write intents point to this record, which allows all transactions to check the write intents they encounter. This is important for concurrency support in a distributed environment. A transaction record expresses the state of one of the following transactions:

  1. PENDING: The initial state of all values, indicating that the Write Intent transaction is still in progress.
  2. COMMITTED: After the transaction is complete, this status indicates that the value can be read.
  3. STAGING: Used to enable the parallel commit feature. Depending on the state of the write intent referenced by this record, the transaction may or may not be committed.
  4. ABORTED: If the transaction fails or is aborted by the client, it will enter this state.
  5. Record does not exist: If a transaction encounters a write intent for which no transaction record exists, it will use the timestamp of the write intent to determine how to proceed. A write-intent transaction is considered a transaction if its timestamp is within the transaction liveness threshold PENDING, otherwise it is considered a transaction ABORTED.

2.2 Write Intent
They are essentially multi-versioned concurrency control values ​​(also known as MVCC, explained in more depth at the storage layer) with an added value that identifies the transaction record to which the value belongs. Think of them as a combination of replicated locks and replicated temporary values.

Whenever an operation encounters a Write Intent (rather than an MVCC value), it looks up the state of the transaction record to see how it should handle the Write Intent value. If the transaction record is lost, the action checks the timestamp of the write intents and evaluates whether it has expired.

Whenever an operation encounters a Write Intent for key, it tries to "resolve" it, and the result depends on the Write Intent's transaction record:

  1. COMMITTED: This operation reads the Write Intent and converts it to an MVCC value by removing the Write Intent's pointer to the transaction record.
  2. ABORTED: The Write Intent is ignored and removed.
  3. PENDING: This indicates that there is a transaction conflict and must be resolved.
  4. STAGING: The transaction coordinator needs to check whether the heartbeat of the transaction record is still there, and if it exists, you need to wait

2.3 Conflict resolution
2.3.1 Write conflict

1. If the transaction has a clear priority setting, the priority is compared to determine the push operation, and the transaction with a higher priority will roll back the transaction with a lower priority.

2. If there is no priority, the transaction with a large timestamp will enter the queue of the transaction with a small timestamp and wait for the transaction record to become committed or aborted.

2.3.2 Write and read conflicts
1. If the transaction has a clear priority setting, the priority is compared to determine the push operation. The transaction with high priority will push the timestamp of the transaction with low priority to the time stamp of the transaction with high priority.

2. If there is no priority, the transaction with a large timestamp will enter the queue of the transaction with a small timestamp and wait for the transaction record to become committed or aborted.

2.3.3 Write after read

When a read operation reads a value, the read timestamp is stored in a timestamp cache, which shows the high water mark of the read value. When a write operation occurs, check the timestamp against this timestamp cache, if the write transaction timestamp is less than the time Stamp the latest value of the cache, then a read-after-write occurs. The write transaction timestamp will be pushed back, which may affect the transaction restart (read refreshing)

 

3. Yunxi Database Concurrency Controller
3.1 Why do you want to be a concurrent controller?

  1. Centralizes request synchronization handling and transaction conflict handling in one place, allowing topics to be recorded, understood, and tested individually.
  2. Simplifies transaction queuing, reduces the frequency with which transactions push RPCs, and allows waiters to continue immediately after intent resolution.
  3. Create a lock framework that can do kv-level SELECT for update and SELECT for share functions.
  4. Provide stronger guarantees around fairness when transactions conflict to reduce tail latency in contention scenarios

3.2 Basic structure of concurrent controller

A concurrency manager is a structure that orders incoming requests and provides isolation between transactions issuing those requests to perform conflicting operations. During the sorting process, conflicts are found and any found conflicts are resolved by a combination of passive queuing and active push. Once a request is ordered, it is free to evaluate without fear of conflicting with other requests. This isolation is guaranteed for the lifetime of the request, but terminates when the request completes.

Each request in a transaction should be isolated from other requests, either for the lifetime of the request or after the request completes (assuming it acquires the lock), but requests should all be within the lifetime of the transaction.

The core is two parts: latch manager and lock table.

  1. lm is to sort requests to ensure their isolation
  2. lt provides locks and ordering for requests. It is an in-memory data structure at each node that holds the set of ongoing transactions that acquired locks. The lock mechanism is compatible with the write intent, so when the request finds an external lock during the evaluate process, this information will be imported.

3.3 Concurrent Controller Control Process

  1. Obtain the latch from the request SequenceReq to ensure that there is no req conflict and check the memory locktable. If there is a conflict, release the latch and wait for the corresponding lock
  2. After that, the request will be executed normally, and the lock information of the current transaction will be added to the locktable after applying apply.
  3. And after the request is completed, FinishReq releases the Latch to continue other requests
  4. Finally, after the resolve intent after the transaction is committed or rolled back is completed, the lock occupied by the transaction in the locktable is released, and other requests waiting for the transaction are awakened.

3.4 latch manager

The latch manager orders incoming requests and provides isolation between those requests under the supervision of the concurrent manager. A latch is like a low-level duration mutex

Way of working:

       1. The write request of a range will be serialized by the LeaseHolder of the range and placed in a certain order

       2. In order to strengthen this serialization, LeaseHolder uses latch to provide non-competitive access to these written values

       3. Other requests entering the LeaseHolder request the same set of keys that are locked by the latch, and must obtain the latch before continuing.

       4. Read requests can also generate latches, and multiple read requests can hold latches for the same key at the same time (compatible), but read latches and write latches are incompatible.

Another way of looking at latch is similar to a mutex, which is only needed for the duration of a single low-level request. To coordinate longer-running, higher-level requests (i.e. client-side transactions), we use a persistent write intent system.

3.5 lock table

It is an in-memory data structure at each node that holds the set of ongoing transactions that acquired locks. Each lock has a queue associated with it, and transactions waiting for the lock to be released are queued in it. Items in the locally stored lockWaitQueue will be propagated by RPC to the existing TxnWaitQueue, which is stored on the leader node of the Raft Group where the transaction record is located.

Not all locks are stored directly under the manager's control, so not all locks are discoverable during the sorting process. Specifically, write intents (copied, exclusive locks) are stored inline in the MVCC keyspace, so they are not detected until evaluation is requested. To accommodate this form of lock storage, the manager integrates information about external locks into the concurrent manager structure.

The lifetime of a lock is greater than the lifetime of the lock holder (transaction), and a lock extends the duration of the isolation provided on a particular key to the lifetime of the lock holder's transaction itself. They are (usually) only freed when the transaction commits or aborts.

However, not all locks are stored directly under the manager's control, so not all locks are discoverable during the sorting process. Specifically, write intents (copied, exclusive locks) are stored inline in MVCC, so they are not detected until evaluation is requested. Currently, the concurrent manager operates on a non-replicated lock table structure

3.6 TxnWaitQueue

TxnWaitQueue keeps track of all transactions they encounter that cannot push the write transaction and must wait for the blocking transaction to complete before continuing. It is a data structure that stores blocking transaction IDs

Important: These activities take place on a single node, which is the leader of the Raft group for the range containing transaction records.

Once the transaction does resolve, a signal is sent to the TxnWaitQueue, which allows all transactions blocked by that transaction to start executing.

Blocked transactions check the status of their own transactions to make sure they are still active. If the blocked transaction is aborted, it just needs to be deleted.

If there is a deadlock between transactions (i.e. they are each blocked by each other's Write Intents), one of the transactions is randomly aborted.

3.7 lockTableWaiter

lockTableWaiter is responsible for conflicting transactions holding locks in the lock wait queue, which ensures that requests can continue in the event of a business coordinator travel or deadlock

Waiter implements the logic of requesting to wait for conflicting locks to be released in the lock table. Similarly, it implements the logic of waiting for conflicting requests before the caller requests. The lock waiting queue is part of the caller.

This wait state responds to a set of state transitions in the lock table:

1. Conflicting locks are released

2. Conflicts are updated so that they no longer conflict

3. Conflicting requests in the lock wait queue acquire locks

4. Conflicting requests in the lock wait queue exit the lock wait queue

These state state transitions are usually reflective, and waiters can wait for locks to be released or to be exited by other participants.

LockManager supports reacting to state transitions of conflicting locks

The RequestSequencer interface supports reacting to state transitions of conflicting lock wait-queues

However, in the event of a transaction coordinator failure or transaction deadlock, the state transition may never happen without the intervention of the waiter. To ensure forward progress, the waiter may need to actively push the conflicting lock holder or the head of the conflicting lock waiting queue. This behavior push requires an RPC to the leaseholder of the conflicting transaction record, which usually causes the RPC to be queued in the leaseholder's txnWaitQueue. Because it is expensive to do so, the push does not execute immediately, it only executes after a delay.

{{o.name}}
{{m.name}}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324124444&siteId=291194637