Analytic Database: Distributed Analytic Database

Another development direction of the analytical database is to replace the parallel computing of MPP with distributed technology. On the one hand, distributed technology has better scalability than MPP, and has better support for the underlying heterogeneous hardware and software, which can solve the problem of MPP. Several key architectural issues of the database. This article introduces the distributed analytical database.

— Background —

At present, in the field of distributed analytical databases, there is not much research in academia this year, mainly because the industry is promoting the development of related technologies. The storage layer of a distributed analytical database generally uses distributed storage or cloud storage, while the computing engine layer uses an independent distributed computing engine, while the storage layer and computing layer of an MPP database are undertaken by multiple database instances. This is the biggest structural difference between the two.

When Hadoop started to rise, the advantages of scalability and elasticity of the distributed architecture gradually emerged. The big data technology represented by HDFS made the processing of big data progress in terms of scalability, elasticity, fault tolerance, and cost. However, key features of traditional SQL databases such as transactions, SQL statements, relational models, and security control are sacrificed. SQL, as the de facto standard language in the database field, has inherent advantages compared to using APIs (such as MapReduce API, Spark API, etc.) to build solutions for big data analysis. Since 2013, a large number of SQL on Hadoop engines have appeared and matured rapidly, and a large number of domestic and foreign enterprises have obtained a large number of production implementations, which fully demonstrates the importance of SQL.

When a distributed analytical database is used in the construction of a data warehouse, it is necessary to solve distributed transactions and high-concurrency batch processing problems, so it is necessary to rebuild the distributed transaction engine and computing engine. At present, different databases in the industry adopt different technical solutions. Most distributed transaction engines need to be built from 0 to 1, while distributed computing engines use a computing model similar to DAG.

A typical difference between distributed databases and MPP databases is the separation of computing and storage, that is, storage services and computing services are no longer bound in one process, and the numbers can be inconsistent, thus realizing the flexibility of computing. In a real production business, the elastic demand for computing is relatively large, while storage is relatively more predictable. Some manufacturers use self-developed storage methods, such as ArgoDB of Transwarp Technology, while others build their databases directly based on cloud storage, targeting the public cloud directly, such as AWS Redshift, Snowflake, and Databricks SQL. However, the design concepts of analytical databases in private cloud and public cloud scenarios are very different, and the actual architecture differences are also very obvious, which we will expand in subsequent chapters.

— Overall Architecture  

Due to the late start of the distributed database, the designer can fully absorb the advantages of MPP database and Hadoop and other technologies when designing the architecture, avoid the architectural defects of the MPP database, and solve various problems such as flexibility and multi-tenant isolation. question. The logical architecture of the distributed analytical data graph is shown in the figure below, which mainly includes the service layer, SQL engine, distributed transaction engine, distributed computing engine, and storage engine. The main difference from the logical architecture of the MPP database is that the computing engine and storage engine are independent, while the bottom layer of the MPP database is based on a certain relational database, including SQL, transactions, computing and storage capabilities. Since several engines are relatively independent, the flexibility of the architecture ensures that there are many ways to solve the architectural problems of the original MPP database.

 

  • distributed storage engine

At the level of the distributed storage engine, there are currently quite a few highly available distributed storage based on the Paxos or Raft protocol in the industry. Because it is used in analytical scenarios, the data storage format generally adopts columnar data storage, and typical implementations include ORC and Parquet file formats. In the analysis scenario, only the required column data is read without reading other irrelevant columns, which saves IO and has high data read throughput; the data of another column adopts the same encoding method (such as RLE, Delta, dictionary encoding) etc.), so the data has a high data compression ratio, generally a compression ratio of 5~10x can be achieved. In addition, since different columns have been stored separately, an API for parallel data reading is generally designed, and each thread reads different column data, thereby improving the ability to read data in parallel. Another advantage of column-based storage is better support for various structured and unstructured data, so that data analysis of various data types can be supported in one platform.

 

Columnar storage has great performance advantages for reading data. Generally, interfaces are designed to connect with the upper computing layer, providing read and write interfaces such as reading, filter pushdown, and indexing. In terms of the ability to support data writing, columnar storage is not as good as row-based storage. Generally, it needs to be accelerated by adding a memory buffer to the upper layer, such as MariaDB's Version Buffer.

In addition, distributed transactions are also an important feature and requirement of distributed storage, which are generally implemented using MVCC and Compaction mechanisms. It is more complicated to modify the data of a given row in columnar storage, so in actual operation, each transaction operation does not directly modify the value of the corresponding field, but generates a new version, and in the corresponding field block generates a new version of the data block containing only the data to be modified. Each new version operation generates a new data block. When reading data, the relevant data block will be read according to the effective transaction number and merged with the basic data to generate the final data value. As the version of the database increases, the speed of data reading will decrease. Therefore, it is necessary to start the compaction mechanism to merge a large number of multi-version files into a small number of files, so as to achieve an effective balance of read and write capabilities.

 

In the actual function, it is also necessary to consider the capabilities of distributed management and operation and maintenance, including the ability to add and delete disks or nodes, data migration capabilities, etc.

  • Distributed Computing Engine

The distributed computing engine is another important component. An excellent engine includes a computing framework, various distributed computing operators, optimizers, and resource management capabilities. In terms of computing framework, DAG or MPP mode is generally selected according to different design requirements. In terms of calculation operators, a large number of operators can be designed around SQL primitives. For example, the JOIN algorithm can have various implementations including hash join, sort merge join, index scan join, skew scan join, etc. Go to the CBO optimizer to make automated selections. The long-term evolution of this direction is the autonomous database. A large number of optimization rules and machine learning capabilities are used to generate more optimization rules for user scenarios, so that the database can automatically select the most appropriate execution plan without manual intervention by the DBA.

In terms of resource management, effective integration with existing resource management frameworks is also one of the important tasks of distributed databases, including YARN, Kubernetes, and various public cloud platforms. Whether it is open source resource management frameworks such as Spark and Flink, or various commercial analytical databases, they are vigorously promoting the optimization of resource management models, so as to better support multi-tenancy and the combination with cloud computing.

  • Distributed transaction engine

In the field of distributed databases, distributed transaction processing and optimization are very popular key technologies, how to ensure data consistency under complex system architecture and fault-tolerant design, and support for multiple transaction isolation levels (serialization, Repeatability, Read committed, etc.), can expand the database to support more applications. Two-phase commit (2PC), MVCC, and snapshot-based transaction isolation are all important technical implementations. Analytical databases mainly deal with low-concurrency transaction operations, and most of them are batch data modification or insertion, so the requirements for transaction concurrency are not high. In the implementation, even some low-concurrency but simple algorithms can be used, such as two-stage blockade (2PL) and other algorithms.

  • SQL engine

The SQL engine provides developers with SQL development capabilities and is the core interface for business development. Therefore, each database is striving to provide comprehensive SQL support and complete SQL optimization capabilities. Since the SQL functions of databases such as Oracle and Teradata are very complete, providing database compatibility between Oracle and Teradata is a very challenging task and requires long-term continuous investment.

— ArgoDB, the star ring analytical database  

With the application of big data technology in enterprises, the data structure of domestic enterprises has become more and more complex, mainly reflected in the coexistence of offline business and online business, the coexistence of analytical business and retrieval business, and the coexistence of structured data and With the coexistence of unstructured data, the requirements for database performance and multi-tenant service capabilities are getting higher and higher. Enterprises require performance more than flexibility or cost, so there is an urgent need for distributed analytical databases with extreme performance, which is also the main development direction of analytical databases in the private cloud field.

The software design needs to fully consider the characteristics of the hardware. From SAS hard disks, to SATA SSDs, to PCIE-SSDs, and then to Memory, the performance has increased by orders of magnitude, which also promotes the redesign of the database architecture. Driven by both application requirements and technical architecture, Transwarp has been planning an analytical database that does not rely on Hadoop storage since 2014, reusing existing SQL, transaction and distributed computing engine capabilities, and developing a new generation of Flash-based distributed storage, officially launched the flash database ArgoDB in 2018, aiming to serve as a unified solution for data warehouses, data lakes, and data marts.

 

The architecture of ArgoDB is shown in the figure above. The main core components mainly include distributed computing engine Crux, SQL compiler, distributed storage management layer TDDMS and storage engine Holodesk.

The SQL compiler inherits the excellent capabilities of Inceptor products, realizes the complete compatibility of SQL 99, supports PL/SQL and DB2 SQL PL stored procedure specifications, and originally supports the dialects of Oracle, DB2 and Teradata. In order to meet the data warehouse needs of enterprises, ArgoDB also supports distributed transaction management and four isolation levels.

ArgoDB implements its own resource scheduling inside the database to better support concurrent SQL tasks of different businesses, and combines it with the platform's own scheduling system to realize two levels of more detailed resource management and scheduling capabilities. First of all, ArgoDB has a resident service. After the database is started, it pre-applies for CPU and memory resources, and divides the resources into multiple resource pools. In addition to allocating resources for each SQL based on FIFO or FAIR strategies, ArgoDB also adds a Furion mechanism to manage resources based on a tree structure. Each child node under the same tree node allows resources to borrow resources from each other. Each tree Nodes allow different users or applications to set ACLs or affinity. In actual scheduling, as long as a CPU core resource is idle, a task is scheduled to maximize the effective use of resources. In order to better support multiple services, ArgoDB allows setting different priorities and scheduling policies based on characteristics such as user name, IP, service type, and submission time, and allows preemptive scheduling. In addition, each resource pool guarantees the minimum resource, thus avoiding the starvation scheduling problem.

 

ArgoDB deconstructs the distributed storage engine into two parts: the general distributed storage management layer TDDMS and the underlying storage engine Holodesk. TDDMS abstracts the underlying storage engine into a set of interfaces, including storage read and write operation interfaces, transaction operation interfaces, computing engine optimization interfaces, etc. Any storage engine that implements these interfaces can be connected to ArgoDB in the form of a plug-in. TDDMS is based on the storage engine implemented by the distributed consensus protocol Raft, which can realize high availability and backup disaster recovery capabilities of storage engine management, and provide operation and maintenance management capabilities. Since TDDMS uses the engine management of data storage in its design, it can access new dedicated storage, thus solving the problem of dependence on Hadoop ecological technology. TDDMS can be designed to access multi-modal storage, and then cooperate with the upper-level computing engine to realize multi-modal unified storage and unified analysis capabilities. This is an important innovation in actual business, avoiding the vertical implementation of each database. Storage management work.

 

Holodesk uses row-column mixed storage based on flash memory to focus on the powerful random IO capabilities of flash SSDs and the sequential read and write capabilities of ordinary HDD disks. It has made special optimizations for data reading and writing, realizing fast data reading and writing capabilities, and then can be The business acquires better analytical capabilities. Holodesk also supports a variety of auxiliary indexing technologies and supports pre-aggregation at the data block level, which greatly enhances data retrieval performance and better adapts to mixed business scenarios. But Holodesk not only can use SSD, but also supports three-level hybrid storage of memory + flash memory + disk. Multi-level storage allows users to better find a balance between performance and hardware budget.

 

The distributed computing engine Crux is a vectorized computing engine. It adopts a computing framework based on the DAG model. It is composed of multiple stateless executors and can adjust the computing flexibility according to the business load. The calculation engine can not only quickly read batch storage files, but also run simple and complex queries with a small amount of data at high speed. The design of the in-memory data format is compatible with columnar storage, which minimizes the time for data conversion in memory. At the same time, it can dynamically analyze the SQL structure, and select an efficient runtime row-column object model based on the idea of ​​vectorization, which can significantly save memory usage while improving performance. The overall business query architecture of ArgoDB is as follows. The user's SQL business generates an execution plan and Runtime Context through the SQL compiler, and then sends it to the Crux Executor; the Executor accesses the data in the storage layer through TDDMS, among which F1/F2/F3/F4 Each represents a data block. Holodesk uses 3 copies of storage by default, and then accesses the actual data block stored on the local file system through the TDDMS Tablet Server.

 

Crux Executor and TDDMS storage are independently layered, and they can independently elastically expand and contract according to load conditions, thus solving the scalability problem, especially the scalability of computing. In the future, we plan to connect TDDMS Tablet Server with various cloud platforms, which can directly interact with the file system on the cloud at high speed to create data analysis capabilities on the cloud and serve enterprise customers in the public cloud. ArgoDB itself implements resource management in the database. The bottom layer is based on container technology and Kubernetes for system-level resource scheduling, and achieves very good multi-tenant capabilities through a two-layer resource scheduling mechanism.

 

Based on advanced architecture design and planning, ArgoDB has also landed a large number of financial-level production cases within 2 years. In addition, in the TPC-DS data analysis and decision-making test of the international benchmark organization TPC, Transwarp Inceptor is the first product in the world to pass the test, and ArgoDB is the fourth in the world, which fully demonstrates the advanced nature of the overall architecture .

— Summary —

This article introduces the architectural principles of the distributed analytical database and the core capabilities of the ArgoDB analytical database. Compared with the MPP database, the distributed database can realize the separation of storage and calculation, so as to realize the elasticity of calculation. Furthermore, in the traditional enterprise data application, the system data of the enterprise is often scattered in various data stores, and the data analysis requirements are often cross-database. How to solve such requirements? The next article will introduce the data federation architecture.

Guess you like

Origin blog.csdn.net/mkt_transwarp/article/details/130146934