Data Warehouse Virtualization Technology: PieCloudDB has passed the 2023 "Trusted Database" performance evaluation of China Academy of Information and Communications Technology's strong support

"Trusted Database" is the first database evaluation system in China, and is widely recognized by the industry as one of the important measurement standards for product capabilities. PieCloudDB demonstrated excellent data processing speed, stability and scalability in this evaluation, providing users with powerful data analysis and query capabilities.

From June 15th to 16th, the China Academy of Information and Communications Technology's "Trusted Database" evaluation expert review meeting for the first half of 2023 was successfully concluded. After on-site testing, product data review, test report review, question and answer, centralized evaluation and other review links, a total of 33 products from 28 companies passed the review. Tuoshupai's first data computing engine PieCloudDB cloud-native virtual data warehouse passed the evaluation with its excellent performance, and obtained the distributed analytical database performance test certificate.

 

The Road to Performance Innovation of Cloud Native Virtual Data Warehouse PieCloudDB

The excellent performance of PieCloudDB stems from its innovative architecture and advanced technical design. PieCloudDB realizes the separation of storage and computing on the cloud, data warehouse virtualization, and high-performance storage and computing capabilities, and provides users with efficient and reliable solutions for massive data processing, fast query and analysis.

For a cloud-native virtual data warehouse product, performance is a very important indicator of product success. The PieCloudDB team has developed many innovative technologies, which make PieCloudDB not only have the flexibility and high scalability of cloud-native products, but also have the same performance in data analysis.

1 Storage optimization

PieCloudDB separates metadata from user data on the cloud, and user data is stored in the object storage provided by the public cloud platform (such as AWS S3), which can greatly reduce the user's data storage costs. However, object storage such as S3 has limitations, such as large network delays for reading data, random reading and writing of files are not supported, and so on. PieCloudDB redesigned the storage engine Jianmo (JANM) for the advantages and disadvantages of object storage. The name Jianmo comes from the "Bamboo Slips and Ink Book". Vertical bamboo pieces are connected horizontally to form bamboo slips, which vividly illustrates the storage mode of PieCloudDB's mixed row and column storage. Jianmo's unique design can not only take advantage of the advantages of S3, but also overcome the disadvantages of access delay and random read and write through some measures.

The PieCloudDB team has made a lot of optimizations to Jianmo according to the characteristics of OLAP scenarios and cloud-native scenarios. For example, statistical information is collected at the data block level, and statistical information can be used to implement Data Skipping and query optimization during query; pre-computation for aggregate functions such as SUM and COUNT; optimize storage format to achieve transparent data encryption (TDE), efficient data compression, Cache-friendly and other features.

In addition to the above features, Jianmo also fully considered how to use the architectural features of modern CPUs and GPUs to further support SIMD and SIMT instruction sets to improve data access efficiency during the design process.

2 Data Access Optimization

Jianmo can be said to be a cloud-native data storage engine, which has made a lot of optimizations to the object storage of the public cloud platform. However, in the process of data access, network delay has always been a factor that cannot be ignored. How to improve the data access speed is also an important problem to be solved in the PieCloudDB architecture design process.

PieCloudDB has done a lot of work on data access acceleration:

2.1 Data cache

Using cache is a common measure to speed up data access. PieCloudDB implements local cache, and a more efficient distributed cache is also planned. In addition, PieCloudDB also implements "cold", "warm" and "hot" hierarchical management of cache according to data access frequency. 

2.2 Use consistent hash algorithm to improve cache hit rate

PieCloudDB implements local caching, so cross-node reading of cached data may occur. In order to avoid cross-node cache reading and writing, PieCloudDB implements consistent Hash storage cache files to improve the hit rate of the cache.

2.3 Use Data Skipping to accurately load data

Data Skipping (Block Skipping) can follow up pre-calculated data statistics to determine whether there is required data in the data block, thereby skipping data blocks that do not contain the required data.

2.4 General optimization of S3 access

Parallelize read data, pre-read data, read data asynchronously, etc.

3 powerful query optimizer

If efficient data access is one side of the coin, a powerful query optimizer is the other side of the coin. For a cloud-native data warehouse product, a powerful query optimizer is also an important aspect of the product's success. How to convert the complex SQL of the user's complex OLAP scenario into an efficient query plan is the primary task of the query optimizer.

The PieCloudDB team created a distributed query optimizer "Daqi" to optimize the entire link of users' SQL queries. The processing process of PieCloudDB query optimization is generally divided into four stages: preprocessing stage, scanning/joining optimization stage, optimization stage other than scanning/joining, and post-processing stage.

3.1 Preprocessing stage

In the preprocessing stage, the optimizer "Daqi" will convert the query tree into a simpler and more efficient equation through logically equivalent changes. In addition to the common predicate pushdown, the Daqi optimizer adds a large number of SQL statement rewriting, for example:

  • Convert IN, EXISTS and other types of subqueries into semi-joins
  • Promote the subquery after the FROM keyword to JOIN
  • Convert OUTER JOIN to INNER JOIN/ANTI JOIN
  • distribution constraints
  • build equivalence class
  • Collect external connection information
  • Eliminate useless connections
  • simplified expression

3.2 Scan connection phase

At this stage, the processing of the optimizer "Daqi" can be mainly divided into two steps. First, a scan path is generated for the base table, and the cost of the scan path and the size of the result set are calculated to obtain the cost of subsequent join operations. In the second step, "Daqi" will search the entire connection sequence space to generate the optimal connection path for the connection operation. The complexity of this step is very high. PieCloudDB uses two algorithms, dynamic programming and genetic algorithm, to process it, and selects the algorithm based on the GUC value. If an outer join is involved in the query statement, considering the restriction of the outer join on the connection order, the order of the connection cannot be switched at will like the inner join, which will increase the complexity of this step.

3.3 Optimization phases beyond scanning/joining

At this stage, "Daqi" will first process GROUP BY, aggregation, window function, and DISTINCT, then process the set operation, and finally process ORDER BY. Each of the above operations will generate one or more paths, and "Daqi" will filter these paths according to the cost, and add LockRows, Limit, and ModifyTable to the filtered paths.

3.4 Post-processing stage

After the first three stages, "Daqi" has generated a rough query plan. In the post-processing stage, "Daqi" will convert the selected optimal path into a query plan, and make some adjustments to the optimal plan.

It is worth mentioning that Tuosupai participated in the compilation of the "Database Development Research Report (2023)" released at the 2023 Trusted Database Development Conference by virtue of its outstanding achievements and excellent performance in the cloud-native virtual data warehouse, and was successfully selected into the "China Database Industry Map (2023)".

Databases are complex system software, and distributed databases make the software even more complex. Therefore, it is even more difficult to build an efficient distributed cloud-native database. The results obtained by PieCloudDB fully demonstrate the industry's recognition of PieCloudDB's performance.

 


 

 

 

RustDesk 1.2: Using Flutter to rewrite the desktop version, supporting Wayland accused of deepin V23 successfully adapting to WSL 8 programming languages ​​​​with the most demand in 2023: PHP is strong, C/C++ demand slows down React is experiencing the moment of Angular.js? CentOS project claims to be "open to everyone" MySQL 8.1 and MySQL 8.0.34 are officially released Rust 1.71.0 stable version is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5944765/blog/10089415