IPO: Jingdong number Keqiang consistent, high-performance distributed transaction middleware JDTX

Original link: https://mp.weixin.qq.com/s?__biz=MzUzMTA2NTU2Ng==&mid=2247488055&idx=2&sn=1eadfaf8123203f811d0a5ccf88d53d7&chksm=fa496d86cd3ee49010a7d13bf8240984a46dd956cd9deba7ed28222c5c4721e25f17ff46ad0a&mpshare=1&scene=23&srcid=1025hwBdMmWafRLPzq2ts4pz&sharer_sharetim

Source: https: //www.infoq.cn/article/BAXzcfjRTcgmKisa7JHm

In a distributed database, native cloud database, NewSQL and other terms in the emerging field of today's database, change - has been increasingly inevitable in the relatively stable areas. Compared to a fully innovative, progressive enhancement program have precipitated heavy industry is more popular.

Same with all areas of distributed solutions, transparency and data partitioning schemes of divide and rule, a new generation of database to solve massive data core concept. Split level such distributed transaction importance than the vertical split service system further enhanced. In addition, the elastic expansion (contraction) capacity, HTAP concepts focus is a new generation of database. Jingdong SKK open source Apache ShardingSphere has matured in terms of data fragmentation, on top of this scenario development of distributed transaction middleware JDTX with puzzle together to form the core of distributed databases.

JDTX by the number of family data Jingdong R & D team effort to build a distributed transaction middleware. The share is JDTX first public appearance in front of the public view and share content covers core design philosophy and related technical JDTX implementation difficulties, hoping to provide some ideas for the team to build a distributed transaction solutions

background

Database transaction needs to satisfy ACID (atomicity, consistency, isolation, durability) 4 properties.

In a single data node, the transaction is limited to the access control to resources of a single database, called local affairs. Almost all of the mature relational database provides native support for local transactions. However, in a distributed application environment based on micro services, more and more applications scenario requires access to multiple services and multiple database resources corresponding to the same transaction can be included among the distributed transaction came into being.

Although relational database provides the perfect ACID native support for local transactions. However, in a distributed scenario, it has become the shackles of system performance. How to make the database to meet the ACID properties in a distributed scene or find the appropriate alternative, it is the focus of the work of distributed transaction.

Local Services

Without any distributed transaction manager to open the premise, so that each data nodes each managing their own affairs. There is no coordination, and the ability to communicate between them, it does not know the success of other data nodes each transaction or not. Local transactions without any loss in terms of performance, but it is inadequate in terms of strong consistency and eventual consistency.

Two-phase commit

XA distributed transaction protocol earliest X model is proposed by the X / Open International Alliance / Open Distributed Transaction Processing (DTP) model, referred to as the XA protocol.

XA distributed transaction-based protocol to invade a small business. Its biggest advantage is transparent to the consumer, the user can use the same local transactions using XA distributed transaction-based protocol. XA transaction protocol can guarantee strict ACID properties.

Strict security matters ACID properties is a double edged sword. Services will perform the required resources needed in the process, all locked, it is more suitable for short transaction execution time is determined. For long transaction, the entire transaction is exclusive of the data period, it will lead to hot data-dependent business systems concurrent performance decline significantly. Therefore, in the performance of highly concurrent first scene, the author based on the type of two-phase XA protocol for distributed transactions is not the best choice.

Flexible Affairs

If you implement transaction transaction ACID-called rigid elements of the transaction, then the transaction based on the elements of the transaction is called BASE flexible transaction. BASE is available substantially flexible state and a final consistency of abbreviations three elements.

In ACID transaction requirements for consistency and isolation is high, during the transaction, all resources must be occupied. The concept of flexible transaction is through business logic mutex operation to move the operational level from the resource level. By relaxing the requirement for strong consistency and isolation, the entire transaction is only required when the final end, the data is consistent. And during execution of a transaction, any data read operation are obtained may be changed. This design can be weak consistency to enhance the exchange of system throughput. Saga and TCC are typical flexible transaction implementation.

in conclusion

Based on the two-phase transaction ACID and BASE based on the final consistency of the transaction is not a silver bullet, a detailed comparison of the difference between them by the table below.

  Two-phase commit Flexible Affairs
Business transformation no Implement the relevant interface
consistency stand by The final agreement
Isolation stand by Business warrants
Concurrent performance Severe recession Slight decline
Suitable scene Short Affairs & low concurrency Long Affairs & high concurrency

Lack of concurrency protection of two-phase transactions can not be called perfect solution for distributed transactions; and the lack of flexibility ACID transaction support literal are not even called a database transaction, which is more suitable for transaction services layer.

Looking at the current, it is difficult to find without having to weigh the universal solution for distributed transactions.

JDTX Distributed Transaction Solutions

JDTX design goal is strong agreement (ACID transaction support original meaning), high performance (even stronger than local affairs), 1PC (completely abandon the two-phase commit and two-phase locking) completely distributed transaction middleware, currently available for relationship database. It uses a completely open SPI design approach, may provide NoSQL docking, multi-heterogeneous data can be maintained in the same transaction.

design concept

First, through a visual understanding of architecture diagram of what constitutes JDTX.

JDTX by the Transaction Manager (TM) and Resource Manager (RM) components.

Transaction manager for generating a transaction log sequence number (LSN) monotonically increasing global transaction commit and rollback processing core processes, and a local tuples (Tuple) holds uncommitted transactions.

Resource Manager is used to manage active transaction data. JDTX design features is the data in the transaction (referred to as active data) (referred to as off disk data) is not isolated and affairs. After the active data off the disk in the system to write-ahead log (WAL), and save the data to the self-development of multi-version snapshot (MVCC) memory engine. Off disk data is through the brush disc asynchronous manner, the data flow to the engine MVCC controlled in synchronization to a final storage medium (such as: a relational database).

Internal affairs inquiry will drop disk data and active data consolidation, and get out of line with the current version of the data according to the visibility of the transaction isolation level of the current transaction.

Highlights

Lossless services program

JDTX engine using WAL + MVCC ACID way to achieve the original meaning of the transaction.

Atomic & Support consistency

JDTX of MVCC engine can be seen as a centralized cache can be simplified to submit a stage two-phase commit. Maintain the data in a single node atomicity and consistency, the scope of the forthcoming distributed transactions to reduce the scope of local affairs.

+ MVCC engine may extend horizontally and the ability to maintain high availability of the master or slave mode by fragmentation. JDTX ensure that all access to the transaction data are combined means the final end of the data off the disk data through active data MVCC engine + to ensure atomicity and consistency of data.

Isolation Support

JDTX transaction isolation achieved by way of multi-version snapshot. Currently full support read standard isolation levels of four kinds has been submitted and repeatable read, already meet most needs.

Persistence Support

JDTX the active data before being stored in MVCC transaction engine to drop to disk WAL engine, in order to ensure that the server crashes, data loss when memory, active data is still able to fully recover from WAL engine.

high performance

JDTX asynchronous data using the active mode to the brush disc database greatly improves the performance limit data writing. It is written from the database performance bottleneck Processed Processed transferred to the disk to fall to MVCC WAL engine and storage engine.

Similar to the WAL of the database system, also used JDTX the log sequence WAL additional embodiment, can be simply understood as a WAL JDTX Processed = WAL consuming database system. And MVCC cache hash data structure is used, which is smaller than the time-consuming need to maintain the write BTree indexed database write consuming. Therefore, JDTX affairs programs with data update performance even stronger than not open transactions.

In addition, JDTX took no transaction rollback strategy UNDO logs. Uncommitted data and MVCC will not enter the engine, but was held by the local transaction manager. Therefore, as long as the clean up did not submit data to complete the transaction rollback. Design UNDO log without further enhance the performance of the transaction.

High Availability

WAL engines and engines are used MVCC + slice standby mode to ensure JDTX no single point of failure. In the case where the engine completely MVCC unavailable, data can be restored by the WAL mode is synchronized to the database to ensure data integrity.

Cross multi-database transactions

JDTX active transaction data off the disk and data separation design, so that it falls disk data storage terminal without any limitation. All transaction data are stored by active asynchronous disk off the actuator to the back-end database, so if the back-end database is isomorphic, in fact, had no effect.

Use JDTX across diverse storage to ensure that end (such as: MySQL, PostgreSQL or even MongoDB, Redis and other NoSQL) of distributed transactions to maintain the same transaction semantics into.

Difficulty to achieve

MVCC kernel

Transaction isolation level, there are two common implementations that lock MVCC implementation and realization. In addition to Infomix and a few databases, relational databases are used most MVCC implementation.

Uncommitted Read, Read Committed, repeatable read and serializable transaction isolation level of these four criteria, the lock mode is implemented based on ANSI defined. Parallel with the increase of transaction isolation levels and decay, in addition to the lowest degree of concurrency can be serialized, other isolation levels are accompanied by consistency of trade-offs and sacrifices.

The following table is based on the isolation levels achieved lock table.

Isolation Levels Dirty read Non-repeatable read Magic Reading
Uncommitted Read may may may
Read Committed impossible may may
Repeatable read impossible impossible may
Serializable impossible impossible impossible

MVCC achieved through isolation level actually only SI (Snapshot Isolation) and SSI (serializable snapshot isolation) two types. SI and four SSI and ANSI isolation levels and can not be completely controlled. Which read uncommitted, read committed in the implementation of MVCC's no difference in performance, is negligible. SI may thus correspond to read committed and repeatable read two types of isolation level. In fact, even the phantom read in SI isolation level is not the case.

Because snapshots concurrency control and can not guarantee the true sense of the transaction is "serialized", so the concurrent operation between the transaction still may lead to data anomalies. But dirty read anomaly is different from the previously mentioned here, the loss of abnormal updated, but a logical semantic level of abnormal inter-business data, it can be said is unusual due to the failure to meet the semantic constraints between the data produced. This is called partial order write (Write skew), which can be based on detection of multiple concurrent versions of read and write transaction may depend inter serialization graph (The multiversion serialization graph) is achieved, i.e. SSI isolation level.

The following table is based on the isolation levels achieved MVCC table.

Isolation Levels Dirty read Non-repeatable read Magic Reading Write partial order
Uncommitted Read Instead of implementing Instead of implementing Instead of implementing Instead of implementing
Read Committed impossible may may may
Repeatable read impossible impossible impossible may
Serializable impossible impossible impossible impossible

MVCC self-development engine is one of the main difficulties JDTX. JDTX employed with similar MVCC PostgreSQL implementations by xmin and xmax snapshot range marked transaction, and the transaction information stored for each data tuples (Tuple) xmin and xmax the engine in MVCC. Multiple versions of the same data in a data structure stored in the list, to get the version of the data in the current transaction visibility through its snapshot xmin and xmax.

Because MySQL also did not realize SSI isolation level, so the current JDTX just realized the SI isolation level, but also did not realize SSI isolation level.

Cleanup (vacuum) MVCC data is another technical difficulty. Long transactions can lead to excessive MVCC version, leading to take up a lot of storage space. Especially JDTX MVCC is used to store active data memory, so it is more sensitive to the release of the memory space. Because asynchronous JDTX off disk mechanism, so in addition to the standard MVCC garbage collection logic, determines whether the data off the disk becomes an additional clean up logic rules.

SQL query engine

SQL query data through active transaction, another technology is JDTX difficulty. MVCC is not a relational database engine, and not by identifying SQL to query data. JDTX is accumulated by previous Apache ShardingSphere SQL parsing module and its abstract syntax tree (AST) to achieve an understanding of SQL, and query data memory MVCC engine based.

For SPJ (select-project-join) OLTP-type SQL, you can get the data from the primary key in the SQL query results. JDTX disk data will fall out from the back-end database as the basis for the final presentation of data, and on top of this query from MVCC engine in the current transaction visible active data, and merge their results. In other words, each time within a transaction queries are placing orders + active data by the data merge together. Merge engine part with reference to the design of LSM Tree.

For non-OLAP SPJ of type SQL, JDTX is used another query. It can not directly match the packet-based polymeric SQL functions and the primary key data directly off the disk and MVCC backend database engine key data, so a data MVCC to the main engine and SQL data is rewritten to remove the primary key active new SQL, then repeat the query without aggregate data from the backend database for merging.

Limitations

Distributed no silver bullet, this is the architects of the existing distributed systems more generally accepted view. Although JDTX have a lot of advantages, but there are still some use restrictions. Limit its use mainly in the following three points.

  1. You need to access the database through JDTX. By JDTX atomicity, consistency, isolation, and its MVCC transaction engine control, and by controlling WAL durability of transactions. Therefore, the system uses JDTX across transactional middleware directly query the database, transaction data is not correct, and modify the database can cause data disorders.

  2. SQL support requires continuous improvement. MVCC is compatible with SQL query engine dialects you need to continue to improve. With respect to the advantages of lossless original meaning ACID transaction support brought about by the decline of the SQL-compatible, is JDTX bring trade-off.

  3. It does not support non-primary key data. JDTX need to consolidate data, and the database engine MVCC primary key. Therefore, no record can not be processed primary key.

JDTX 与 Apache ShardingSphere

JDBC access terminal provided by the Apache ShardingSphere, can make seamless docking JDTX to Java applications. In addition to JDBC access points, Apache ShardingSphere also provides Proxy access points based on MySQL and PostgreSQL, the JDTX like a single database to provide services as distributed transactions. Apache ShardingSphere the access terminal will release in the future, making it possible to use independent JDTX.

Apache ShardingSphere SPI provides a unified distributed transactions. JDTX by implementing SPI ShardingSphere provided, you can easily integrate Apache ShardingSphere ecology. Apache ShardingSphere conjunction with JDTX, data may be fragmented and distributed transaction seamlessly.

Apache ShardingSphere used independently or JDTX, flexible decoupling, highly customized, can be seen as the basis of Lego components. And it is used in combination, it is possible to produce chemical changes, they have the ability to even make up a distributed database infrastructure. Bridged product frontmost Apache ShardingSphere for parsing SQL, and database protocol data piece; a middle JDTX MVCC and manner for processing by the active data transaction key; only the bottom of the database data is stored as the final end. The figure is ShardingSphere + JDTX organization chart.

Finally, attach the MySQL architecture diagram, the reader understand the similarities of its own.

JDTX follow-up plan

JDTX's own goal is to strive to build into a standard distributed transaction solutions. After the transaction core processes, MVCC engine, WAL engine, high availability core functions grinding mature, JDTX focus will be put in the following areas:

  1. SQL statement to enhance compatibility and multi-database support;

  2. SSI achieve isolation level provides complete isolation level MVCC solutions;

  3. Improve the management and monitoring client end.

In addition to JDTX middleware itself, it will also provide a more integrated level of service distributed database with other databases and other middleware ShardingSphere; and more deep integration with Kubernetes and other native cloud platform to provide services for the cloud native database.

about the author

Zhang Liang, the number of branch data Jingdong R & D leader, Apache ShardingSphere sponsors & PPMC, JDTX responsible person.

I love Open Source, the leading open source projects ShardingSphere (formerly known as Sharding-JDBC) and Elastic-Job. Good at Java-based distributed architecture, respected elegant code, how to write the code has to show force with more research.

At present the main energy into the ShardingSphere and JDTX to build industry-leading data solutions on the level of financial solutions. ShardingSphere has entered the Apache Incubator, Jingdong Group is the first to enter the Apache Foundation's open-source project of the Apache Foundation's first distributed database middleware.

GitHub: https://github.com/terrymanu, 随时欢迎技术交流和指正。

Guess you like

Origin blog.csdn.net/zl1zl2zl3/article/details/102742440