How does the data warehouse realize the integrated data analysis of the lake and warehouse?

Introduction: With the popularity of cloud computing and the expansion of data analysis requirements, the integrated analysis capability of the data lake + data warehouse has become the core capability of the next-generation data analysis system. Compared with data warehouses, data lakes have obvious advantages in terms of cost, flexibility, and multi-source data analysis. Among the ten forecasts of China's cloud computing market trends released by IDC in 2021, three are related to data lake analysis. It is foreseeable that cross-system integration capabilities, data control capabilities and more comprehensive data-driven capabilities will be important areas of competition for future data analysis systems.

1. Background

With the popularity of cloud computing and the expansion of data analysis requirements, the integrated analysis capability of the data lake + data warehouse has become the core capability of the next-generation data analysis system. Compared with data warehouses, data lakes have obvious advantages in terms of cost, flexibility, and multi-source data analysis. Among the ten forecasts of China's cloud computing market trends released by IDC in 2021, three are related to data lake analysis. It is foreseeable that cross-system integration capabilities, data control capabilities and more comprehensive data-driven capabilities will be important areas of competition for future data analysis systems.

 

AnalyticDB PostgreSQL version (referred to as ADB PG) is a cloud-native data warehouse product built by the Alibaba Cloud database team based on the PostgreSQL kernel (referred to as PG). In business scenarios such as PB-level data real-time interactive analysis, HTAP, ETL, and BI report generation, ADB PG has unique technical advantages. As a data warehouse product, how does ADB PG have the integrated analysis capabilities of the lake and warehouse? This article will introduce how ADB PG builds data lake analysis capabilities based on the appearance of PG.

 

Picture 1.png

ADB PG inherits the foreign table function of PG. At present, ADB PG's integrated lake and warehouse capabilities are mainly based on appearance. Based on the appearance of PG, ADB PG can query and write data from other data analysis systems. While being compatible with multiple data sources, it reuses the advantages of ADB PG's original optimizer and execution engine. ADB PG's integrated analysis capabilities of the lake and warehouse currently support the analysis or writing of multiple data sources such as OSS, MaxCompute, Hadoop, RDS PG, Oracle, and RDS MySQL. Users can flexibly apply ADB PG to different fields such as data storage, interactive analysis, ETL, etc., and can implement multiple data analysis functions in a single instance. ADB PG can be used to complete the core process of data analysis, and it can also be used as one of many links to build a data link.

 

However, the analysis of external data relies on external SDK and network IO to realize data reading and writing. Because the characteristics of the network itself are very different from the local disk, it needs to be different from the local storage at the technical level, and requires different performance optimization solutions. This article takes OSS external table data reading and writing as an example to introduce some important problems and solutions encountered by ADB PG when building the integrated analysis capabilities of the lake and warehouse.

 

2. Problem analysis

ADB PG kernel can be divided into optimizer, execution engine and storage engine. External table data analysis can reuse the core parts of the original optimizer and execution engine of ADB PG, with only a small amount of modification. The main extension is the transformation of the storage engine layer, which is to read and write external table data through the external interface. The external table data is stored in another distributed system and needs to be connected to the ADB PG through the network. This is the core difference from reading local files. On the one hand, different external data will provide different remote access interfaces, which need to be compatible in engineering. For example, the data reading interfaces of OSS and MaxCompute are different. On the other hand, accessing data on remote machines through the network has certain commonalities, such as network delays, network amplification, bandwidth limitations, and network stability issues.

 

2.png

 

This article will focus on the above core challenges and introduce some important technical points of the ADB PG appearance analysis project in supporting the OSS data analysis process. OSS is a low-cost distributed storage system launched by Alibaba Cloud, which stores a large amount of hot and cold data, and has greater data analysis requirements. In order to facilitate developers to expand, OSS provides SDKs based on mainstream development languages ​​such as Java, Go, C/C++, and Python. ADB PG adopts OSS C SDK for development. At present, ADB PG has perfectly supported various functions of OSS external table analysis. In addition to the different table creation statements, users can access OSS external tables just like local tables. Supports concurrent reading and writing, and supports common data formats such as CSV, ORC, and Parquet.

 

3.png

 

3. Optimization of appearance analysis technology

 

Next, we introduce some core technical problems that ADB PG solved in the process of developing OSS appearance analysis based on OSS C SDK.

 

3.1 Network fragmentation request problem

In analytical database scenarios, the industry generally believes that columnar storage is better than row storage in terms of IO performance. Because columnar storage only needs to scan specific columns when scanning data, and row storage scans the full amount of data after all, columnar storage can save some IO resources. However, during the development process, the team found that in some scenarios, such as large-width table scans with more fields, the column storage format with higher scanning performance turned out to be worse than the scanning CSV row-storage text format. After positioning, it was found that on the one hand, when scanning the ORC/PARQUET format, the client interacts with the OSS server too frequently, on the other hand, the amount of data that ADB PG requests from OSS in a single time is relatively small. These two reasons have brought great performance problems.

 

We know that compared to local disk IO, the round-trip delay generated by network IO can often be amplified by several orders of magnitude. Therefore, if some column storage formats (such as ORC/PARQUET) are parsed, if the network request is treated as a local disk request, the reduction in network bandwidth usage caused by the high compression ratio is not enough to offset the round-trip time caused by fragmented requests Delayed amplification, so the performance test results are lower than expected. The solution to the problem is to reduce fragmented network requests through caching. Each time ADB PG scans OSS data, it will "preload" enough data and cache it. When requesting, it will determine whether the cache is hit or not. If it hits, it will directly return to the cache; otherwise, continue to the next round of "preload", thereby reducing the network The number of requests improves the efficiency of a single request. The "preload" cache size is open for configuration, and the default size is 1MB.

5.png

3.2 column filtering and predicate pushdown

Since the IO performance of the network itself is often lower than the IO performance of the local storage, it is necessary to minimize the bandwidth resource consumption of IO when scanning external data. ADB PG uses column filtering and predicate push-down technology to achieve this goal when processing ORC and Parquet format files.

 

Column filtering, that is, the external table only requests the data columns required by the SQL query and ignores the unnecessary data columns. Because ORC and Parquet are both columnar storage formats, when the external table initiates a network request, it only needs to request the data range of the required column, thereby greatly reducing network I/O. At the same time, ORC and Parquet will compress column data to further reduce I/O.

 

Predicate pushdown is to move the upper filter conditions in the execution plan (such as the conditions in the WHERE clause) to the lower appearance scan node, so that when the appearance scan performs network requests, it filters out data blocks that do not meet the query conditions. Thereby reducing network I/O. In ORC/Parquet format files, statistics such as min/max/sum of each column of data in the block will be saved in the header of each block. When the external table is scanned, the header statistics of the block will be read first, and Compare the query conditions pushed down. If the statistical information of the column does not meet the query conditions, you can skip the column data directly.

 

Here is a brief introduction to the implementation of the predicate pushdown of the appearance of the ORC format. An ORC file is divided into several Stripes according to data rows, and the data in Stripe is stored in columns. Each Stripe is divided into several Row Groups, and every 10,000 rows of all columns form a Row Group. As shown below.

6.png

 

ORC files store 3 levels of statistical information. File-level and Stripe-level statistics are stored at the end of the ORC file, and Row Group-level statistics are stored at the head of each Stripe block. Using these three levels of statistical information, the ORC exterior table can achieve file-level filtering, Stripe-level filtering, and Row Group-level filtering. The specific method is that whenever a new ORC file is scanned, the file-level statistics at the end of the file will be read first. If the query conditions are not met, the scan of the entire file will be skipped; then all Stripe-level statistics at the end of the file will be read Information, filter out stripe blocks that do not meet the conditions; for each stripe block that meets the conditions, read the Row Group level statistical information in the block header to filter out unnecessary data.

 

3.3 "996" problem

OSS C SDK defines a type of error code to indicate abnormal conditions, where 996 is the error code -996 defined in OSS C SDK. There are similar error codes -998, -995, -992, etc. This type of error is usually caused by the failure of the OSS appearance import and export caused by network abnormalities. -996 is the most common one.

 

OSS C SDK internally uses CURL to interact with the OSS server on the network. The corresponding CURL error code is commonly CURL 56 (Connection reset by peer), 52, etc. These network abnormalities are usually caused by the OSS server actively removing client connections that it considers "inactive" when the load is high. When a large-scale OSS data needs to be imported or exported, because the client is at a different stage of the execution plan, it cannot hold the connection for a long time for continuous communication, so it is regarded as an "inactive" client connection by the OSS server and closed.

 

Usually for this situation, the client needs to try to solve it again. In the actual development process, it was discovered that even if an automatic exception retry mechanism is added to the client interface, this exception still cannot be improved. After positioning, it was discovered that OSS C SDK increased the connection pool of CURL handles to improve connection efficiency. However, these abnormal CURL handles on the network will also be stored in the pool. Therefore, even if you try again, you will still use the abnormal CURL handle for processing. Communication, so the problem of 996 abnormality cannot be improved.

 

Now that the root cause is known, the solution is also very intuitive. In the CURL handle recovery interface, we added a check on the state of the CURL handle, and destroyed the abnormal CURL handle instead of adding it back to the connection pool. This avoids invalid CURL handles in the connection pool. When the client interface retry, select a valid or create a new CURL connection to communicate again. Of course, the automatic exception retry mechanism can only target those situations that can be retried.

 

7.png

 

① When ADB PG accesses the OSS external table, it first obtains the connection from the CURL connection pool, and creates a new one if it does not exist.

② ADB PG uses the CURL connection handle to request communication with the OSS Server.

③ The OSS Server returns the communication result through the CURL connection handle.

④ The CURL connection handle returned normally will be added back to the connection pool to be used next time after use.

⑤ The CURL connection handle in the abnormal state is destroyed.

 

3.4 Compatibility issues of memory management schemes

ADB PG is based on the PostgreSQL kernel and inherits the memory management mechanism of PostgreSQL. PostgreSQL's memory management uses a process-safe memory context MemoryContext, while OSS C SDK is a thread-safe memory context APR Pool. In the MemoryContext memory environment, each allocated memory can be explicitly released by calling free, and the memory fragmentation is performed by the MemoryContext, but in the APR Pool, we only see the creation of the memory pool, the application of memory and the memory Operations such as the destruction of the pool, but there is no explicit release interface for memory.

 

This situation means that we need to have a clear understanding of the life cycle of the memory held by the OSS C SDK interface, otherwise problems such as memory leaks and access to released memory are extremely easy to occur. Usually we will apply for APR Pool memory in the following two ways.

· Method 1 is suitable for reentering low-frequency operation interfaces, such as obtaining a list of OSS files.

· Method 2 is suitable for multiple reentrant operation interfaces, such as periodically requesting data from the specified range of the specified file from the OSS.

 

In this way, the memory management incompatibility between ADB PG and OSS C SDK can be solved well.

8.png

3.5 Compatibility and optimization of data format

Most of the data on OSS uses CSV, ORC, Parquet and other formats. Since ORC/Parquet and other formats encode the underlying storage of data, it is not consistent with the data encoding of ADB PG, so when performing external table scans, data type conversion is an indispensable step. Type conversion is essentially to change the data from one encoding to another encoding method. For example, ORC's representation of the Decimal type is different from that of ADB PG. In ORC, the Decimal64 type contains an int64 to store the numeric value of the data, and then precision and scale represent the number of digits and the number of decimal points. In ADB PG, the Decimal type The digital value of the data is stored by the int16 array. The format conversion algorithm needs to perform cyclic division and modulo operations on each data, which is very CPU intensive.

 

In order to reduce the CPU consumption caused by type conversion and further optimize the external table query performance, when using external tables to export data, ADB PG skips the type conversion step and directly writes the ADB PG data into the external table file in binary form. When querying the external table, there is no need to perform any data type conversion. For example, when exporting the ORC external table, the external table can write any data type directly as the Binary type of the ORC. The binary data stored in the ORC is encoded according to the data type of the corresponding ADB PG, so the query is In the case of ORC appearance, the type conversion step can be omitted directly, which reduces the CPU consumption. According to the TPCH query test results, the overall query performance can be improved by about 15%-20%.

 

4. Performance test

For how to use the appearance analysis function in ADB PG, please refer to the Alibaba Cloud product manual ( https://help.aliyun.com/document_detail/164815.html?spm=a2c4g.11186623.6.602.78db2394eaa9rq ). Except for the different table building statements, there is almost no difference between the operation of the external table and the operation of the local table, and the learning difficulty is very low. Let's compare the performance issues of the OSS appearance analysis scenario with the local table analysis scenario.

 

Environmental configuration. The machine used in our test is the Alibaba Cloud ECS d1ne.4xlarge model. A single machine is configured with 16 Intel Xeon E5-2682v4 cores, 64GB of memory, and each ECS is configured with 4 HDD local disks. The read and write speed of each disk is about 200MB/s. In test 1, 4 ECSs were used, two were used as master nodes and 4 were used as segment nodes, a total of 16 segments were deployed. This test uses the TPCH query, using the 1TB data set generated by the official tool.

 

For the local table, we tested the compressed column storage table (AOCS) and HEAP table formats. For the OSS external table, we tested the CSV, ORC, Parquet and JSON formats. The total execution time of 22 TPCH queries is shown in the table below. It can be seen from the test data that among the two local tables, the query performance of the AOCS table is slightly better than that of the HEAP table. In terms of appearance, the query performance of CSV, ORC, and Parquet formats is slightly slower than that of local tables, with a gap of about 50%. The appearance query performance of the JSON format is significantly slower than other formats. This is mainly due to the slow parsing speed of the JSON format itself, which has nothing to do with the appearance.

9.png

The figure below shows the detailed time of 22 TPCH queries. The performance gap between the local table and the external table differs in different queries. Considering the advantages of appearance in terms of storage cost, flexibility, and scalability, the potential of ADB PG appearance analysis in application scenarios is huge.

10.png

V. Summary

The integration of lake and warehouse is an important capability of the next generation of data warehouse products. As a powerful and extensible data warehouse product, ADB PG has developed a variety of data source analysis and writing capabilities based on the appearance of PG, and has accumulated a lot of Performance optimization technology. In the future, ADB PG will continue to make efforts in product functions, cost-effectiveness, cloud-native capabilities, and integration of lakes and warehouses to provide users with more functions, performance and cost optimization.

 

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/115180882
Recommended