PieCloudDB Database: The Birth Journey of Cloud Native Distributed Virtual Data Warehouse

PieCloudDB Database, the flagship product of Hangzhou OpenPie Technology Development Co., Ltd. (OpenPie), is a cloud-native distributed virtual data warehouse. PieCloudDB integrates physical data warehouses into cloud-native data computing platforms through a variety of innovative technologies. PieCloudDB can dynamically create virtual data warehouses and flexibly calculate on demand, thereby breaking down data islands and supporting the data and calculations required by larger models. 

PieCloudDB released version 1.0 on October 24, 2022, realizing the separation of computing and storage, realizing elastic computing and elastic storage, enabling both computing and storage to be paid on demand, and realizing multi-tenant isolation. On March 14, 2023, Tuosupai officially released the cloud-on-cloud version of PieCloudDB. The cloud-on-cloud version is currently built on Alibaba Cloud and will soon be extended to other cloud platforms. The cloud-on-cloud version will meet the diverse data analysis needs of users and create best practices for public cloud data warehouse services.  

PieCloudDB currently has four product versions, namely: 

  • Cloud-on-Cloud (CoC) version (free trial) 
  • Community Edition (free download) 
  • Enterprise Edition 
  • All-in-one version 

PieCloudDB cloud-on-cloud version builds a rock-solid virtual data warehouse for enterprises, and realizes unlimited data computing possibilities with the optimal configuration of cloud resources. PieCloudDB Enterprise Edition and Community Edition can provide enterprises with brand-new digital solutions based on cloud data warehouses, helping enterprises to establish competition barriers with data assets as the core; environment, reducing operation and maintenance costs and saving development time for Party A. 

The reason why the PieCloudDB Tuoshupai R&D team chose cloud native as the main track is mainly due to various considerations. First of all, in the customer environment of traditional MPP databases, there is often a common pain point, that is: data islands. There are too many MPP database clusters in the customer's production environment. Even if data federation and other processing methods are used, it may cause data consistency problems and cause a certain waste of storage space. Second, data, as a new factor of production, needs to be circulated to generate greater value. In order to solve these pain points, PieCloudDB stores data in shared storage (S3, HDFS, NAS) and metadata information in shared NoSQL database FoundationDB by implementing a storage-computing separation architecture, realizing the integration of storage resources and computing resources. Independent elastic scaling enables more flexible expansion and contraction while maintaining high performance, helping users reduce costs and increase efficiency. 

Let's review the birth of PieCloudDB. PieCloudDB rebuilds PostgreSQL to achieve separation of storage and calculation. The reason why we did not redesign and develop from the bottom layer of the database is because the so-called "technical industry has specialization", just like the manufacture of sports cars does not produce the wheels themselves, coordinating the wheels below to run faster and more stable is what we focus on of. The underlying components of the database are like wheels. PieCloudDB has done a lot of transformation and optimization on the basis of PostgreSQL to realize the separation of distribution and storage and calculation, and make full use of PostgreSQL's continuous innovation capabilities and resources, combined with the innovation capabilities of Tuoshupai's R&D team , a large number of extreme optimizations have been made for distributed, OLAP and cloud-native scenarios, and this cloud-native virtual data warehouse PieCloudDB has been achieved. 

1. The birth of PieCloudDB 

We mainly design and build PieCloudDB from four aspects: metadata management, data storage, data access acceleration, and distribution. 

1.1 Metadata Management 

In order to break the data island, PieCloudDB designed a storage and computing separation scheme. Under this design scheme, users may activate multiple virtual data warehouses to manipulate the same data, so it may happen that one virtual data warehouse is updating data while another virtual data warehouse is reading data. In order to ensure data consistency in the case of multiple virtual data warehouses (multi-coordinator), we designed metadata services to decouple metadata from user data and virtual data warehouses. The metadata service enables the metadata of different tenants under the same organization to be shared to implement distributed transactions and distributed locks. 

The metadata service of PieCloudDB is implemented based on FoundationDB. PieCloudDB simulates the lightweight lock function by using the serialized transactions of FoundationDB, and ensures data consistency by implementing distributed locks. 

Since the metadata of PieCloudDB is stored in FoundationDB, each distributor will read the metadata in FoundationDB, which will put some pressure on FoundationDB. Therefore, PieCloudDB's metadata management reduces access to NoSQL databases through cache design and pre-persistent storage of metadata that will not change.

1.2 Data storage and the birth of JANM 

Since PostgreSQL's HEAP is an OLTP row-storage engine, it is not friendly to analytical scenarios (OLAP). Therefore, in order to be built into an OLAP cloud-native database, PieCloudDB needs a lot of optimization and improvement for OLAP and cloud-native scenarios. 

As a cloud-native virtual data warehouse, PieCloudDB needs to consider the low-cost object storage that supports the cloud platform while also supporting high-performance queries when designing the storage engine. During the implementation process, PieCloudDB needs to be compatible with S3 storage and ensure efficient access performance. The advantage of object storage such as S3 is that it is easy to use and low in price. The disadvantage is that the network delay is relatively large during access and the random access performance of files is poor. 

Aiming at these characteristics of S3, PieCloudDB has created an original storage engine, Jane Mo (JANM). The name Jianmo comes from the "Bamboo Slips and Ink Book". The vertical bamboo pieces are linked horizontally to form a bamboo slip, which vividly illustrates the storage mode of PieCloudDB's mixed row and column storage. Jianmo's unique design can not only take advantage of the advantages of S3, but also overcome the disadvantages of access delay and random read and write through some measures. As the storage engine of the database, Jianmo ensures the consistency of MVCC visibility of all data in an S3 file. In order to improve query sales, Jianmo has taken many optimization measures: for example, the collected statistical data is stored separately in KV storage, which is used to implement Data Skipping during query, and query optimization such as pre-aggregation for aggregate calculations such as SUM and COUNT. In addition, Jianmo also implements features including TDE (transparent data encryption), data compression, support for large-size columns, memory Arrow Format, and Cache friendliness. The data block size of Jianmo is set to 16M, which can avoid the generation of many small files and increase the difficulty of metadata management, and effectively avoid the random read and write operations of files generated when UPDATE/DELETE is executed. 

In addition, in order to take full advantage of the performance of modern hardware, PieCloudDB's storage engine design also considers the design of modern CPU and GPU cache access, and further optimizes the locality of data to support SIMD, SIMT and parallel computing. 

During the selection process of the storage engine, the R&D team has also investigated open source storage formats such as Parquet. In the end, the team decided to create its own storage format instead of directly using open source storage formats such as Parquet, mainly for the following reasons: 

  • No need to store Schema: Many storage formats such as Parquet have their own Schema, while PieCloudDB has less storage requirements for Schema; 
  • Native Postgres-aware storage format: it can avoid some additional deserialization work when reading stored data;
  • Flexible and controllable: When implementing functions similar to TOAST, it will not be constrained, and it is more operable and more flexible and controllable; 

Although open source storage formats such as Parquet are not directly used, in order to ensure the flexibility of users' data storage formats, PieCloudDB supports access to these storage formats through the Foreign Data Wrapper function. 

1.3 Data Access Acceleration 

PieCloudDB's storage engine Jianmo supports object storage such as S3. Because S3 storage itself has some limitations, including bandwidth delay, it is not friendly to random writes. For these bottlenecks, PieCloudDB has made a lot of optimizations to improve data access speed, so that storage is no longer the bottleneck in the query execution process: 

  • Caching: Using caching is a common measure to speed up data access. PieCloudDB implements local caching, and more efficient distributed caching is also planned. In addition, PieCloudDB also implements "cold", "warm", and "Hot" hierarchical management. 
  • Consistent hash cache file: PieCloudDB implements local cache, so it may happen that cached data is read across nodes. In order to avoid cross-node cache read and write, PieCloudDB implements consistent Hash storage cache file to improve the cache hit rate. 
  • Data Skipping: In order to achieve better query performance, PieCloudDB has done a lot of query optimization functions. Data Skipping (Block Skipping) can follow up the pre-calculated data statistics to determine whether there is required data in the data block, thereby skipping all The data block that needs data; 
  • Common optimizations for S3 access: parallelization, read-ahead, async, MPP engine "Steal", etc. 

1.4 Distributed PieCloudDB 

PieCloudDB implements a distributed engine. Metadata only visits FoundationDB on the Coordinator, reducing the number of visits to FoundationDB and avoiding excessive pressure on it. The executor data is mainly dispatched by the distributor accurately and efficiently, and a lot of optimizations have been made for the distributor. 

After completing these four steps, PieCloudDB was born. Next, in order to allow PieCloudDB to have better query performance while maintaining stability, we continued to optimize the performance of the metadata management system, supporting functions such as aggregation pushdown, precomputation, and block skipping. And completed the optimization of mass data modification and enhancement, preliminary backup function, VACUUM enhancement, automatic collection and update of statistical data, etc. 

2. The road is obstructed and long, but the road is approaching 

PieCloudDB is still growing. In the next step, we will continue to polish the PieCloudDB core, iterate on metadata storage, user storage, and computing engine, and make the optimizer more friendly to OLAP scenarios. 

  • metadata storage

In terms of metadata storage, first of all, we will deeply optimize the cache, and further greatly reduce the access to FoundationDB through an efficient cache system. Secondly, the metadata will be decoupled from the state, so that the state that does not need reliable storage does not need to be stored in FoundationDB. Again and through a certain amount of refactoring, the abstraction is more decoupled, the complexity is reduced, and the stability is improved. 

  • user data storage 

For user data storage, more functions including dict page, bloom filter, etc. will be provided according to the priority of computing needs, and it will be optimized in terms of distributed caching and scheduling. 

  • computing engine 

The computing engine will be the focus of the next iteration of PieCloudDB. First we plan to finish the SIMD executor and various computation optimizations. Next, we will complete the Pipeline engine to give full play to the computing power of multi-core CPUs, improve CPU utilization, and fully utilize the performance of a single node, further reducing costs and increasing efficiency. In the third step, we will complete the isolation of computing engine resource scheduling, and make PieCloudDB an "operating system" and cornerstone of data computing, which will be PieCloudDB's long-term goal.

This is how PieCloudDB was born. You are welcome to go to the official website to try PieCloudDB cloud-on-cloud version. We also look forward to joining our technical community and walking hand in hand with us!

Guess you like

Origin blog.csdn.net/OpenPie/article/details/131227183