Exploration and Practice of HashData-Based Lake Storage Integrated Solution

On April 7, 2023, the twelfth "Data Technology Carnival" (DTC 2023), co-sponsored by the China DBA Union (ACDU) and Motianlun Community, was grandly opened at the Crowne Plaza Beijing New Yunnan. Li Jun, Senior Solution Architect of HashData, delivered a special speech on "Exploration and Practice of HashData-Based Lake Storage Integrated Solution" at the special session 6-" Fusion Application: Lake Storage Technology Innovation " on April 8.

This article is compiled based on the actual recording of the speech. The text of the speech is as follows (it takes more than 20 minutes to read the full text):

1. The evolution of integration of lake and warehouse

The concept of data warehouse was widely accepted after Bill Inmon officially put forward the book "Building the Data Warehouse" published in 1991. After 30 years of development, it has been widely used in various industries such as finance, communication, and aviation.

The data warehouse has the advantages of easy access to BI and reporting systems, and strong data management and control capabilities. However, with the rise of big data, it does not support unstructured data, high cost of proprietary systems, proprietary data formats, and low flexibility. disadvantage.

The concept of data lakes emerged from the emergence of big data around 2010. It has low storage costs and supports unstructured data. Data lakes were once thought to replace data warehouses, but as data lakes were put into practical use, people gradually discovered some of its disadvantages: insufficient support for BI systems, low query performance, non-real-time data interaction, and poor reliability.

There have been fierce debates in the academic and industrial circles between data lakes and data warehouses, and finally a consensus has been basically reached: data warehouses and data lakes are like apples and oranges. They are completely different things and will not replace each other.

Data warehouses and data lakes will not replace each other, they will coexist and together form an enterprise's data platform. The logical data warehouse concept proposed by Gartner includes two parts: data warehouse and data lake, which is also the current status of most enterprises.

But innovators are not satisfied with the status quo. Around 2020, Databrick first proposed the concept of Lakehouse, which is translated into Lakehouse Integration or Lakehouse in China.

It is not difficult to see that Lakehouse is the source of Data Lake for the first half, and Data Warehouse for the second half. Its implication is that Lakehouse absorbs the advantages of data lakes and data warehouses to create a new platform.

Lakehouse puts forward new requirements in data format, data type, data access, reliability, governance and security, performance, scalability, and user scenario support .

In order to meet the above new requirements, the Lakehouse must have the following key capabilities.

  • Separation of deposit and calculation

Key capabilities that data lakes need to improve:

  • affairs

  • BI support

  • performance

  • Data Governance and Security

Key capabilities that data warehouses need to improve:

  • multiple data types

  • machine learning

  • cost

2. Introduction to the development of foreign lake warehouse technology

When it comes to foreign lake warehouse technology, the three most discussed open source solutions are Databrick, Hudi, and Iceberg. Databrick's home solution is DeltaLake. I had the honor to participate in DeltaLake's product training and trials, and I did have key capabilities in transaction, BI support, performance, etc., and the experience was very good.

 Apache Hudi is a DeltaLake competitor.

 Apache Iceberg is another DeltaLake competitor. It is precisely because of the rapid development of open source Hudi and Iceberg that forced DeltaLake to change from commercial to open source.

When it comes to Iceberg, we need to focus on a concept: Table Format (data table format), Table Format is an abstraction layer that helps the computing engine process the underlying storage format (ORC, Parquet, etc.), instead of directly operating the underlying storage as before . This concept is very important and will be used later in the technology sharing.

 The three open source solutions mentioned above, Apahce DeltaLake/Apache Hudi/Apache Icerberg, are all technical routes for the integration of data lakes to data warehouses. As a data warehouse solution, HashData will open a new perspective for everyone to integrate data warehouses into data lakes.

3. HashData Innovation and Exploration Practice

The original product prototype of HashData is based on Greenplum, which is a typical MPP architecture, but it is coupled with storage and calculation, that is, data storage and data calculation are all in one data node.

 After iterative design for cloud native, the architecture of HashData v3 is as follows. It is an architecture that separates service, computing, and storage, which effectively solves the barrel effect problem of traditional MPP, enabling the HashData data warehouse to support ultra-large-scale clusters.

HashData has been successfully applied to the ultra-large-scale data warehouse service of Bank C. By the end of 2022, more than 20,000 data nodes are currently running in production, and the data storage is about 13PB.

Another challenge in integrating data warehouses into data lakes is how to provide low-cost solutions? Data from Huawei Cloud's official website shows that the cost of object storage is only a few tenths of the price of disks and SSDs. If all data is stored in object storage, the overall solution will be greatly reduced. Unfortunately object storage doesn't do very well with IO, which sacrifices performance. Between price and performance, we adopt multi-level storage technology: persistent data is stored in object storage, and hotspot caching technology is added to the computing layer, which solves this problem well.

 The overall cost of the HashData data lake solution using object storage can be reduced to 1/10 of the original, but the performance is guaranteed through the hotspot cache technology. The relevant Benchmark data report shows that the performance is very close to the original level.

 For machine-generated data such as IoT data, HashData supports quasi-real-time writing of streaming computing engines, thereby improving the effectiveness of data analysis.

 In the case of Energy Group A, the unified data lake has already stored 1.7PB of data such as oil reservoirs, geology, exploration, and production. Of course, there are also streaming data generated by the above-mentioned machinery and equipment.

For semi-structured data, basically databases now have good support, which will not be repeated. The focus is on unstructured data. The database can actually store pictures in binary form, but it is cumbersome to use. This is not a good solution.

For unstructured analysis, the solution we currently give is divided into two parts:

  1. Raw files are stored in object storage.

  2. The parsed structured data is stored in the database for easy retrieval and comparison.

    The following is a further illustration with the case of bayonet data analysis on expressways. After the camera captures the license plate information, the original photo is stored in the object storage as original evidence. The parsed license plate number, color, and time are stored in the HashData database to support traffic statistics monitoring, toll evasion audit and other applications.

  3. For machine learning, HashData supports calling functions in SQL to perform machine learning in the library, and now supports more open Python native support. To sum up, the HashData Lake Warehouse Integrated Solution is based on a technical architecture that separates service, computing, and storage, and is a solution for multiple scenarios, including data warehouses, data lakes, and data element markets.

 

4. Thinking and Prospect of the Integration of Lake and Warehouse

After the fusion of lake warehouses, a unified storage + multi-computing engine pattern will be formed. For the fusion of data formats, HashData will introduce Iceberg as TableFormat later.

For more technical platform integration shared today, more topics on models, data governance, and data asset management, please refer to the above two magazines.

Guess you like

Origin blog.csdn.net/m0_54979897/article/details/130153833