Sequoia Sequoia database SequoiaDB] Tech | SequoiaDB Distributed transaction implementation principle Introduction

1

Distributed Transaction background

With the development of distributed database technology becomes more mature, the industry's requirements for a distributed database had only to meet the massive data storage solutions and business transformation to read this type of edge to the core trading business. Distributed database core accounts if you want to meet the needs of such transactions, it needs to improve distributed transactions, in line with the traditional relational database. That is also the need to achieve a distributed transaction like a traditional relational database transactions as required to meet the standard definition of a transaction and that the ACID characteristics.

Data distributed database is multi-machine multi-node distributed storage, this storage architecture to achieve a distributed transaction brought great difficulty. When the operation data transaction, the transaction operation will be combined with the distribution of data to different storage locations up executed, and the storage location of the different machines located in a network different disks.

 

2

The basic concept of a transaction

2.1 Transaction usage scenarios

Banking application is a classic case, the need for transactional applications can be explained. Banks assume database has two tables, checking account table (check) and savings account tables (save). Now from LiLei checking account transfer 200 yuan to her savings account, you need to complete at least three-step operation:

1. Check check account balance is greater than 200;

2. Subtract 200 yuan from the checking account balance;

3. Add the balance of the deposit account 200;

All operations are executed packaged in a single transaction, if a step fails, it rolls back all the steps have been completed. Transactional operations generally use START TRANSACTION statement begins a transaction, the entire transaction with COMMIT statement submitted to permanently modify data, or roll back the entire transaction with the ROLLBACK statement to undo the changes that have been made. Sample Affairs SQL operations as follows:

START TRANSACTION;SELECT balance FROM check WHERE customer_id = 10233276 ; UPDATE check SET balance = balance - 200.00 WHERE customer_id = 10233276; UPDATE save SET balance = balance + 200.00 WHERE customer_id = 10233276; COMMIT;

This operating scenario for the bank for the transaction transfer class exchange must be used, but in the actual production environment, the complexity of the transaction operations much more complex than that.

2.2 Transaction concepts and features

A transaction is a sequence of operations to access and manipulate database of various types of data items in the collection, such as various types of SQL CRUD operations portfolio. It is usually defined by the begin transaction and end transaction statements.

Transactional database system must contain the following features:

  • Atomicity (Atomicity): all operations in the transaction in the database or execute all succeed or fail.

  • Consistency (Correspondence): the transaction before and after the operation, the integrity of the data must be consistent.

  • Isolation (Isolation): when a plurality of users concurrent access to the database, the user opens the database for each transaction, the operation data can not be disturbed by other transactions. Each transaction that is not feel the system has to perform other transactions concurrently.

  • Persistent (Durability): After a transaction completes successfully, it changes to the database must be permanent, even if the system failure will not affect the transaction.

Transaction isolation level

For transaction isolation, the SQL standard defines four types of isolation levels, including a number of specific rules, which defines changes to internal and external transaction is visible, which is not visible. Here are four isolation levels:

  • READ UNCOMMITTED (read uncommitted content)

In the READ UNCOMMITTED isolation level, all transactions can "see" the results of uncommitted transactions. Read uncommitted, also called "dirty read."

  • READ COMMITTED (read submission)

Most database systems default isolation level is read committed. It meets the definition of a single previously isolated: one transaction at the beginning, only "see" has been submitted to change the firm to do a transaction from start to before submitting any data changes made are not visible, unless submit. This isolation level is not supported "repeatable read" operation. This means that users run the same statement twice, to see the results are different.

  • REPEATABLE READ (can be re-read)

REPEATABLE READ isolation level to solve the problem caused READ UNCOMMITTED isolation level. It ensures that the same transaction multiple concurrent instances when data is read, it will "see the same" row. But in theory, it will lead to another thorny issue: Magic Reading (Phantom Read). In simple terms, means that when the user reads the phantom read a range of data rows, and another transaction by inserting a new row within the range, then when the user reads the range data row, will find a new " Phantom "line. Database storage engine can be multi-version concurrency control (Multiversion Concurrency Control) mechanism to solve the phantom read problems, such as MySQL's InnoDB and Falcon.

  • SERIALIZABLE (serializable)

SERIALIZABLE isolation level is the highest level, it is forced to sort through the transaction, making it impossible to conflict with each other, so as to solve the problem phantom read. Briefly, SERIALIZABLE is locked on each row of data read. At this level, it could lead to a lot of timeouts and lock contention phenomenon. Database applications rarely see the user selects this isolation level. However, if the user's application for stability data, concurrent need to force reduction, it may be selected such isolation level.

 

3

Distributed Transaction

Implement distributed transactions need to ensure that transaction atomicity, consistency, isolation and durability, while the implementation of this basic technical idea ACID properties are:

  • By "two-phase commit (Two-phase Commit, 2PC)" protocol transaction atomicity, consistency, and durability properties;

  • Isolation level implementations typically use multi-version concurrency control mechanism to guarantee. Multi-version concurrency control method is commonly used "snapshot isolation (Snapshot Isolation)" technology;

The following are brief first two concepts.

3.1 two-phase commit

Two-phase commit (Two-phase Commit, 2PC) is to enable a protocol based on all nodes in the distributed system architecture to maintain consistency during the transaction commits and design.

Two-phase commit algorithm was established based on the following assumptions:

  • The distributed system, there is one node as a transaction coordinator, the other nodes as a transaction manager, and can communicate between network nodes.

  • All nodes using write-ahead logging (Write Ahead Log), and the log is written ie held on a reliable storage device, even if the node does not result in damage to disappear log data.

  • All nodes will not be permanent damage, even after the damage can still be restored.

The following stages of the two-stage commit algorithm is described.

The first phase (commit request phase)

Transaction coordinator node to ask whether all transaction manager node can do a commit, and began to wait for the response of each manager node affairs. The transaction manager node performs all transactional operations until the inquiry initiated, and writes information log Undo and Redo information.

Each node in response to the transaction manager transaction coordinator node initiating the inquiry. If the transaction manager node transaction operations actually executed successfully, it returns a "yes" message; if the transaction operations in the transaction manager node fails actual implementation, it returns a "suspend" message. Sometimes, also known as the first phase of voting phase, that each transaction manager to vote whether or not to continue the next commit operation.

The second stage (to submit the implementation phase)

Successful case

When the corresponding message transaction coordinator node obtained from all transaction managers are "agree":

1. Transaction Coordinator requesting node "officially submitted" to all transaction manager node.

2. The formal completion of the transaction manager node operation, and release resources occupied during the life of the transaction.

3. Transaction Manager node sends a "done" message to the transaction coordinator node.

4. coordinator node being "complete" message to all the transaction manager node feedback complete the transaction.

Fails

If the message is a response to any transaction manager node returns in the first stage as "suspend" or transaction coordinator node before unable to obtain a response message to all the transaction manager node timeout in the first phase of the inquiry when:

1. The transaction coordinator requesting node "rollback" transaction manager to all nodes.

Undo information is written to the transaction manager node 2. Before use of rollback, and releases the resources occupied during the life of the transaction.

3. Transaction Manager node sends a "rollback complete" message to the transaction coordinator node.

4. After the transaction coordinator node by the "rollback complete" message to all the transaction manager's feedback node, cancel the transaction.

Sometimes also referred to as the second phase of completion stage, because no matter what the outcome, the transaction coordinator must end the current transaction at this stage.

Communication flow between the transaction coordinator and a schematic transaction manager:

​ 

The biggest drawback of two-phase commit algorithm lies: its execution middle nodes is blocked. I.e. between respective nodes while waiting for the other messages, it can do nothing. In particular, when a node in the case have occupied a resource, in order to wait for a response message to other nodes and into a blocking state, when the third node attempts to access the resources of the occupied node, the node will also be jointly and severally fall blocking state.

In addition, the transaction coordinator node indicates that the transaction manager when submitting nodes and other operations, if the transaction manager node appeared in the case which led to the collapse and other transaction coordinator can not always get all the information in response to the transaction manager, then the transaction coordinator We can only rely on their own timeout mechanism transaction coordinator to take effect. But often a time-out mechanism to take effect, the transaction coordinator will indicate that the transaction manager to roll back the operation. Such a strategy more conservative.

 

3.2 Snapshot Isolation

Snapshot Isolation (Snapshot Isolation) technology is one of the techniques to achieve multi-version concurrency control. This is the premise of technology policy must support each version of the data, the data after the transaction is successfully submitted each write operation (update, insert, delete) will generate a new version of the data. There is a concept that is submitted after the write operation is successful, will generate a new version of the data. Before the write operation is not successfully submitted, any changes to the data are not in force.

What is a snapshot Snapshot it? It is simply a collection of all the latest version of T1 data inside the database at a particular time. For example, there are only three such database records, when their time stamps T1, the state is as follows:

 

I.e. row1 [version 10], row2 [version = 1], row3 [version = 19] is formed of a snapshot database at time T1. After a few minutes, to the time T2, if no write operation between T1 and T2 submitted successfully, then the state of the database has not changed, that snapshots and snapshot time T1 T2 time are equal. After another few minutes, the time T3, between T2 and T3, there is an update operation of row2, a delete operation on row3 and an inserting operation row4 successfully submitted, the data becomes a state database : 

​ 

I.e. row1 [version 10], row 2 [version = 2], row3 [version = 20], row4 [version = 1] constitutes a snapshot of the database at the time T3. Because each record version changes are not the same, it is necessary to note that the data version to change the situation.

Also, please note that in multiple versions of data requirements, the deletion is not really deleted row3, but generate a new version of a row3. In actual implementation, the databases are not necessarily the same as above the value assigned to null example, you may use a special flag logo This is a "delete" version.

Snapshots forever and a specific time-related, from time to talk about the snapshot does not make sense. If a period of time, the database did not submit any write operation is successful, then this time, the database snapshot at any time are equal. So, we can say that, each containing a write transaction successfully submitted, will form a database of different snapshot. In many implementations, database, version stamp directly, instead of the above numerical examples.

每个事务在启动时,都会记录当时的时间作为启动时间戳Start-Timestamp。该事务只能读取启动时间戳那个时刻的数据快照。然后每个事务在提交时,会记录当时时间作为提交时间戳Commit-Timestamp,当该事务成功提交后,会形成一个Commit-Timestamp的数据快照。后续启动的事务才能看到该事务写的数据(如果该事务有写操作)。

 

​ 

上图中,三条横线代表三个事务。事务T2是看不到事务T1写的任何数据的,因为事务T2启动时,事务T1还没有提交。而事务T3可以看到事务T1和事务T2写的数据,因为它启动的时候,事务T1和事务T2都提交了。

快照隔离(Snapshot Isolation)需要通过锁机制来防止写冲突,对于读操作,不加锁。如果多个事务同时写一个数据,锁机制保证最多只有一个事务能提交成功。由于对读操作不加锁,Snapshot Isolation的性能会显著提高。

 

4

SequoiaDB 分布式事务实现

4.1 基本概念和定义

为了实现分布式事务,巨杉数据库通过采用全局时间来实现全局事务对跨数据分片的事务的协调和管理。基于此需求,为了确定全局时间,巨杉数据库定义了时间戳的相关概念与定义,引入了时间戳管理机制。具体的定义如下:

  • LLT(Local Logical Timestamp):每个节点(CATALOG、COORD、DATA)维护自己的本地逻辑时间(最小单位:microsecond)

  • ULT(Universal Logical Timestamp): 定义CATALOG主节点的本地时间为全局逻辑时间(最小单位:microsecond)

  • LRT(Local Real Timestamp):本地UTC时间

为了保证整个集群全局时间的一致与准确,协调节点(COORD)和数据节点(DATA)需要定时与编目节点(CATALOG)的主节点进行时间同步。而同步时间定义了以下规则:

1. CATALOG主节点的LLT(即ULT)通过所有机器的CPU Tick计算

2. 其它节点的LLT通过与CATALOG主节点进行同步ULT来维护

3. 同步的间隔为ULTSyncInterval(默认:60秒)

4. 同步结果需要使用差小于全局容忍误差ULTTolerance(默认: 1ms)

5. ULTTolerance根据时间差同步、网络状态进行动态调整

全局时间的定义及规则确认之后,则可以将其用于分布式事务的实现当中。分布式事务采用二段提交机制实现,结合二段提交的原理,定义了以下几类事务时间:

  • TBT(Transaction Begin Timestamp):事务开始时间

  • TPCT(Transaction Pre-Commit Timestamp):事务的预提交(precommit)时间

  • TCP(Transaction Commit Timestamp):事务的提交时间

其中,同一个事务的TBT和TPCT之间需要有一个事务时间间隔,此间隔取当时ULTTolerance。事务时间间隔也可以定义为不同节点发起的事务时间之间的最小可以容忍的误差。即如果两个不同节点的事务时间之间相关小于事务时间间隔,即认为这两个事务时间有误差的情况下相等。

 

4.2 二段提交实现

巨杉数据库对于分布式事务采用的是经典二段提交(2PC)方式实现的。其采用全局时间来实现全局事务的统一协调管理,使分布式集群中的不同节点进行事务的统一操作。在整个事务操作过程中,客户端发起的事务分为三个部分:

第一部分:事务开始。在这一部分的操作中,客户端向数据库服务器发起“事务开始”的请求。数据库服务器结合其本地逻辑时间生成一个事务开始时间,并记录在案。

第二部分:事务的增删改查操作。此部分是整个事务原子包的系列操作,它包含增删查改四类基本数据操作。在执行事务原子包里面第一条SQL语句时,分布式集群需要判断和校验协调节点和数据节点之间的时间差值。如果此差值大于延时容忍值,则要求COORD节点、DATA节点向CATALOG主节点发起时间同步,然后再重新发起SQL操作。如果时间差在容忍范围内,则直接执行。第一条事务操作执行成功后,说明时间比对成功,接下来的操作则直接执行。

第三部分:事务完成。此部分为事务的结束部分。在此部分中,整个事务执行完成,开始发起事务提交的操作。此操作进入事务的二段提交阶段,即先预提交,预提交成功之后再提交一次,整个提交流程才完成。

巨杉数据库事务实现的具体流程如下图:

 

4.3 并发控制技术

巨杉数据库对于多版本控制(MVCC)技术是通过采用事务锁、内存老版本以及磁盘回滚段重建老版本的设计来实现。此架构设计的理论基础是通过对内存结构的合理利用,存储数据和索引的老版本信息,从而实现数据的快速的并发访问。

此架构的基本原则是:充分利用内存结构缓存老版本以提高读的访问速度,同时结合事务可视性条件和MVCC来满足全局事务的不同隔离级别(RC/RR)的访问要求。在MVCC的实现中,巨杉数据库也平衡兼顾运行时的效率和多版本存储空间的使用,以及回收的开销。

在多版本控制技术的事务锁实现中,RR(可重复读)配置下的读操作可以在使用完记录之后立即释放锁,不需要一直持有,直到事务提交或者回滚。但是写事务操作则需要一直持有插入、更改和删除的锁,直到事务完成提交或者回滚。巨杉数据库锁的实现是采用悲观锁机制,与传统关系型数据库的采用的主流锁机制类似。

在多版控制技术的实现中,除了引入悲观锁的机制以外,巨杉数据库还采用了内存老版本机制提升数据库并发访问及操作的能力。内存老版本是通过在记录锁上附加有一个存储原版本数据和索引相关的结构,于内存中存储了老版本的数据。

所有事务写操作(修改,删除,插入)会在该结构中保存一个事务开始前的记录的拷贝,还包含所有改动过索引的原始版本。当读者试图获取记录锁时,如果记录正在被修改,读者取锁失败时将通过回调函数获得该锁的老版本结构,从而获取上次提交后的数据。在事务提交时,释放记录锁之后异步回收存储老版本记录和索引的空间,用户可以选择打开异步删除涉及到的待删除数据。同时在该锁或记录被下一个写操作用到时,他们都会被同步回收。其中老版本的结构如下:

 

 

巨杉数据库在实现多版本并发控制技术时,除了采用事务锁和内存老版本机制外,还采用了磁盘回滚段对并发控制策略进行了完善与补充。众所周知,内存是高速存储设备,但是其存在存储空间比较小以及断电数据丢失的问题。针对此问题,磁盘回滚段机制通过将内存中的“老版本数据”持久化到磁盘上,保证数据库在掉电等异常情况下不会影响事务的正常操作。

回滚段使用系统集合空间,名为”SYSRBS”。另外,其内部会使用1个集合,命名格式为”SYSRBSXXXX”,其中XXXX为循环编号,范围为0~4096。同时,回滚段使用第一个集合(即:SYSRBS0000)存储RBS的元数据,包括当前RBS集合和最后空闲RBS集合。巨杉数据库会在启动时检查是否支持MVCC,如果支持,则会检查”SYSRBS”集合空间是否存在,不存在的话则会创建此集合空间,同时创建 SYSRBSCL0000 和 SYSRBSCL0001 集合。如果回滚段的集合空间和集合均存在,则会从 SYSRBSCL0000 中读取元数据信息,根据当前RBS集合和最后空闲RBS集合信息创建下一下 SYSRBSCLXXXX。

To further increase the read speed, giant sequoias database disk memory rollback to previous versions combined, the latest version is still hanging in the old record lock oldversionContainer, other older versions put on the disk. So for most short transactions according to the old version read-only memory, without having to read the disk, thus providing a read speed. Considering the master node is abnormal, the control necessary to record multiple versions of the old version rollback data is also synchronized to the standby node, the standby node when the master node liter, can be reconstructed by the old version rollback.

When the transaction ID is less than the global minimum transaction ID (lowTranID), a database of background asynchronous thread is responsible for recycling old versions of records and index node memory. To save the old version of the old version when RBS write memory scrubbing. The old version of Disk Cleanup is a collection of (lastFreeCL) from last free start, one by one the maximum transaction ID (MaxGTID) comparison table, if less than the global minimum transaction ID, you can delete the table (ie SYSRBSCLXXXX).

 

5

to sum up

Sequoia database by using transactional locks, memory, and disk old version rollback rebuild the old version is designed to achieve a multi-version concurrency control technology. This design through the old version information for the rational use of memory structure for storing data and indexes, enabling rapid concurrent access to multiple versions of data.

Guess you like

Origin www.cnblogs.com/sequoiadbsql/p/12175397.html