Analysis of the Integrated Architecture of Storage and Computing Separation of Lake and Warehouse Supporting Multi-model Data Analysis and Exploration (Part 2)

When an enterprise needs to build an independent data warehouse system to support BI and analysis services, it has a hybrid architecture of "data lake + data warehouse". But the hybrid architecture brings higher construction costs, management costs and business development costs. With the development of big data technology, by adding distributed transactions, metadata management, extreme SQL performance, SQL and data API interface capabilities to the data lake layer, enterprises can support both data lake and data warehouse businesses based on a unified architecture , This is the lake warehouse integrated architecture. This article continues to introduce Transwarp Inceptor and Apache Delta Lake.

— Transwarp Inceptor — Transwarp Inceptor is a distributed relational analysis engine, which was developed in 2013 and is mainly used in analytical business scenarios of ODS, data lakes and other structured data. Inceptor supports most of the ANSI 92, 99, and 2003 SQL standards, is compatible with traditional relational database dialects, such as Oracle, IBM DB2, Teradata, etc., and supports stored procedures. Since 2014, some domestic bank customers have begun to try to migrate some data processing services that were originally insufficient in performance on DB2 to Inceptor, and these data tasks originally built on relational databases have a large number of concurrent update and delete requirements , and there are multiple data links processing a result table, so there is a strong requirement for concurrent transaction performance. In addition, because users not only expect to complete the migration of data batch processing business from relational database to Hadoop platform, but also expect an order of magnitude improvement in performance, Transwarp Technology began to develop a distributed transaction mechanism based on HDFS in 2014 to support some For the business scenario of the data warehouse, the related technologies were officially released with the Inceptor 4.3 version in 2016, and it was put into production in hundreds of financial industry users in the following years, and the technical maturity has reached the financial level requirements.

The overall architecture of Inceptor is shown above. Because ORC is columnar storage and has very good statistical analysis performance, we choose to re-develop the distributed system based on the ORC file format. Since HDFS does not support random access to files and transaction operations in files, we use the MVCC mechanism for concurrent update and delete of data. Each time the data is updated, the data file is not directly rewritten, but the data A corresponding new version is added to the file, and all data operations within a transaction are written into a delta file. When reading data, read all file data into memory, and merge the same data in multiple files in memory according to transaction order and visibility, which is the Merge on Read mechanism. It needs to be emphasized that the data warehouse is all about data batch processing, and each SQL of the batch data business may operate a large amount of data, and the delay required for a single SQL operation may be tens of seconds or even more than minutes. Its performance requirements for distributed transactions are tens to hundreds of TPS, rather than the tens of thousands or even higher TPS required by distributed databases for transactional businesses. Lock conflicts are a major bottleneck for concurrency performance in batch processing. If multiple ETL tasks are processing a data table at the same time, lock conflicts may occur in this table, resulting in lock waiting between data tasks and slowing down the final processing rhythm. In order to solve the performance problem, we developed an independent Lock Manager to manage the generation and visibility judgment of distributed transactions. At the same time, the lock granularity includes three granularities of Database, Table and Partition, which reduces a lot of unnecessary lock conflicts question.

In addition, in the process of data warehouse processing, many tables are not only the result table of the previous processing task, but also the data source of other data tasks, so there will also be read and write conflicts, which will lead to concurrency restrictions in some cases. In order to solve this problem, we introduce the snapshot mechanism and the isolation level of Serializable Snapshot. The read operation of the data table can directly read a snapshot, so that there is no need to generate transaction conflicts with the write operation. In our design, snapshots do not need to be persisted, and do not need to add a large amount of physical storage. Instead, they are a lightweight, globally consistent logical concept that can quickly determine whether a certain version of data should be included or excluded during transaction processing.

In terms of transaction isolation, Inceptor supports five isolation levels: Read uncommitted, Read committed, Repeatable reads, Serializable, and Serializable Snapshot. In terms of concurrency control technology, Inceptor provides two types of serializable isolation levels: pessimistic and optimistic: strict two-phase lock-based serialization isolation level and snapshot-based serialization isolation level. Users can choose the appropriate isolation level type according to the business scenario. Strict two-stage lock serialization isolation (S2PL Based Serializable Isolation) is a pessimistic concurrency control technology. Its main feature is to acquire locks first, then process read and write, and not release all locks until the transaction is committed. Its implementation is relatively convenient, and the main difficulty is to deal with the deadlock problem (solved through ring detection). The reading and writing of this technology cannot be concurrent, so the concurrency performance of transaction processing is poor. Serializable snapshot isolation (Serializable Snapshot Isolation) can improve the concurrency performance of transaction processing. It adopts serializable snapshot isolation based on optimistic lock. The advantage of this technology is that it will not cause mutual blocking of read and write operations in the two-stage lock technology. The situation makes reading and writing non-blocking and improves concurrency. Although snapshot isolation will not cause dirty reads, non-repeatable reads, and phantom reads, it cannot guarantee serialization. When processing concurrent transactions, exceptions may still occur due to constraints (Constraints), which are called Write partial order problem (Write Skew). In order to solve the problem of write partial order, Inceptor introduces the inspection of serializable snapshot conflicts, and adds the detection of loops in the read-write dependency between transactions under the snapshot isolation level to find related conflict problems and abort abnormal transactions.

As mentioned above, Inceptor has a relatively complete technical accumulation in terms of distributed transaction concurrency, transaction isolation, and SQL performance. In addition, it has been widely adopted by the financial industry since 2016. In the data warehouse scenario, Inceptor’s maturity relatively high. Since Inceptor is not designed for machine learning scenarios, it does not provide a data API layer directly used by the machine learning framework. In addition, Inceptor has not designed a separate architecture for real-time data writing, nor can it effectively support the stream-batch integration architecture. However, Transwarp Technology has solved the stream-batch integration demand problem in the distributed database ArgoDB.

— Apache Delta Lake —

Due to the large number of users running machine learning tasks on Databricks Cloud, the main design goals of Databricks include:

  • Excellent SQL performance
    The performance of data analysis is the core requirement of BI and analysis software, so it is necessary to adopt columnar file formats (such as Parquet, etc.) and other formats suitable for statistical analysis, as well as vector computing engines, data access caches, and hierarchical data Storage (such as cold and hot data separation storage technology) and other technologies to improve the performance of SQL statistical analysis in the data lake and meet the technical requirements of the data warehouse
  • Provide distributed transaction and schema support

Data lakes are mostly stored in the form of files and schemaless, which provides flexibility for data analysis, but cannot realize the ACID management capability of the database. Delta Lake improved file storage, provided a strict database schema mechanism, and then developed a multi-version transaction mechanism based on MVCC, which further provides ACID semantics of the database and supports highly concurrent update and delete SQL operations. In addition, Delta Lake is based on an open data format (Parquet), which can not only directly operate HDFS, but also allow other computing engines to access related data, which improves ecological compatibility.

  • Data API for flexible docking of machine learning tasks and exploratory analysis

Machine learning and AI training tasks are the core business scenarios of Databricks, so Delta Lake pays great attention to ensuring the support of this type of business in design. It not only provides DataFrame API, but also supports programming language interfaces such as Python and R, and strengthens the support for Spark. Integration of machine learning frameworks such as MLlib, SparkR, and Pandas.

Based on the above capabilities, combined with the computing power of Spark and the storage capacity of Delta Lake, a data architecture based entirely on Databricks storage and computing technology can be realized, which can support BI statistical analysis, real-time analysis, and machine learning tasks. In addition, Delta Lake is based on open data storage. format, and can also be connected to other computing engines such as Presto for interactive analysis. In terms of the initial design goals of the project, Hudi focuses on high-concurrency update/delete performance, Iceberg focuses on query performance in the case of large amounts of data, and the core of Delta Lake design is to better support real-time computing and Compute offline. Through in-depth integration with Spark Structured Streaming, the delta table can not only be used as the data source of Streaming, but also directly as the target table of Streaming, and it can also guarantee Exactly-Once semantics. The Delta community combined the multi-hop data architecture to design a set of stream-batch-integrated reference architecture design, which can achieve a data storage similar to the Kappa architecture to respond to the needs of both stream-batch scenarios.

Since Databricks' open source of Delta Lake is relatively limited, some functions need to rely on Databricks File System and Engine to be better, so the attention in the community is not as good as that of Huid and Iceberg. In addition, Delta Lake does not provide primary keys by design, so highly concurrent update/delete is not as good as Hudi, nor does it provide metadata-level query optimization similar to Iceberg, so query performance may not be as good as Iceberg, but Delta Lake emphasizes the combination of Spark. The formed stream-batch integrated data architecture and the native API-level support for machine learning applications have a good universality in applicable business scenarios.

— Summary —

From the perspective of time dimension, Transwarp Inceptor is the first product to explore the ability to provide data warehouses on data lakes, and completed the large-scale production of products in 2016, so the product maturity is relatively high, especially in distributed transactions There are obvious advantages in the completeness of the realization.

Hudi is designed to be suitable for high-concurrency update/delete business scenarios. Similar to Transwarp Inceptor, these two technologies are also based on Hadoop’s ability to provide update and delete. Relatively speaking, Hudi’s performance in distributed transactions The implementation details still need more time and production polishing to improve.

The design of the Iceberg project is suitable for the analysis scenarios of massive data with a large number of partitions but few transaction operations. It is more suitable for Internet enterprises. In addition, Iceberg has made a very good abstraction in software design and has relatively complete support for various computing engines. , the optimization of partition and so on is very meticulous, and it is still very attractive to some scenarios with low transactional requirements.

Databricks' open source for Delta Lake is relatively limited, and some functions need to rely on Databricks File System and Engine to be better, so the attention in the community is not as good as Hudi and Iceberg. Delta Lake has no outstanding design in terms of performance, and the implementation of distributed transactions is relatively simple. The implementation of transaction concurrency and isolation is still in the early stage. At present, the project emphasizes more on the stream-batch data architecture combined with Spark. And native API level support for machine learning applications.

As each project gradually completes the initial design goals, they all want to further expand the applicable scenarios, and they are all entering their respective fields. The rapid development of each project also promotes the rapid iteration of the integrated lake warehouse architecture.

Guess you like

Origin blog.csdn.net/mkt_transwarp/article/details/130199595