Detailed explanation of data warehouse data lake and integration of lake and warehouse

Receive good articles faster than others

With the rise of the concept of data lakes in recent years, the industry has been comparing and even arguing about data warehouses and data lakes. Some people say that the data lake is the next-generation big data platform, and major cloud vendors are also proposing their own data lake solutions. Some cloud data warehouse products have also added the feature of linkage with the data lake.

But what is the difference between a data warehouse and a data lake? Is it a technical route dispute? Is it a data management battle? Are the two incompatible or can they coexist harmoniously or even complement each other?

The author of this article is from Alibaba's computing platform department, and has been deeply involved in the construction of Alibaba's big data/data middle platform. From a historical perspective, he will analyze the ins and outs of data lakes and data warehouses in depth, and explain the new direction of the integration and evolution of the two—— Lake warehouse integration, and an introduction to the lake warehouse integration solution based on Alibaba Cloud MaxCompute/EMR DataLake.

01 Changes and changes in the field of big data in the past 20 years

1.1 Overview

It has been 20 years since the field of big data developed from the beginning of this century to the present. Observing the law of development from a macro level, it can be highly summarized into the following five aspects:

1. Data maintains rapid growth - From the perspective of 5V core elements, the field of big data maintains rapid growth. The Alibaba economy, as a company that heavily uses and focuses on the development of big data, has maintained a high-speed growth in data size in the past five years (annualized 60%-80%), and the growth rate will continue in the foreseeable future. For start-ups, the field of big data is growing more than 200% annually.

2. Big data has been widely recognized as a new factor of production - the migration of value positioning in the field of big data, from "exploration" to "inclusiveness", has become the core department of each enterprise/government and undertakes key tasks. Still taking Alibaba as an example, 30% of employees directly submit big data jobs. As big data inclusiveness enters the production environment, enterprise-level products such as reliability, security, management and control capabilities, and ease of use are enhanced.

3. Data management capabilities have become a new focus - data warehouse (middle office) capabilities have become popular, and how to make good use of data has become the core competitiveness of enterprises.

4. Engine technology has entered a period of convergence  - with Spark (general computing), Flink (stream computing), Hbase (KV), Presto (interactive analysis), ElasticSearch (search), Kafka (data bus) gradually occupying since 2010-2015 In the open source ecology, in the past five years, new engines have become less and less open source, but each engine technology has begun to develop in depth (better performance, production-level stability, etc.).

5. There are two trends in the evolution of platform technology, data lake VS data warehouse - both focus on data storage and management (platform technology), but in different directions.

1.2 Looking at lakes and warehouses from the perspective of big data technology development

First of all, the concept of data warehouse appeared much earlier than data lake, which can be traced back to the 1990s when database was king. Therefore, it is necessary for us to sort out the approximate time, origin and more important reasons behind the emergence of these terms from the context of history. In general, the development of data processing technology in the field of computer science is mainly divided into four stages:

1. Phase 1: The database era. The database was first born in the 1960s, and the relational database that people know today appeared in the 1970s, and it shined brilliantly in the following 30 years or so, giving birth to many excellent relational databases, such as Oracle , SQL Server, MySQL, PostgresSQL, etc., became an indispensable part of the mainstream computer system at that time. By the 1990s, the concept of data warehouse was born.

At this time, the concept of data warehouse is more about the methodology of how to manage multiple database instances in an enterprise, but limited by the processing power of a single-machine database and the high price of multi-machine databases (sub-database and sub-table) for a long time, at this time Data warehouses are still far away from ordinary enterprises and users. People are even arguing which one is more feasible, data warehouse (unified centralized management) or data mart (centralized management by department and field).

2. Phase 2: The "exploration period" of big data technology. The time has entered around 2000. With the explosion of the Internet, billions or tens of billions of pages and massive user clicks have opened up a new era of rapid increase in the amount of global data.

Traditional database solutions are no longer able to provide computing power at an acceptable cost, huge data processing needs are beginning to find breakthroughs, and the era of big data has begun to sprout. In 2003, 2004, and 2006, Google successively published three classic papers (GFS, MapReduce, and BigTable) that laid the foundation for the basic technical framework of this big data era, namely distributed storage, distributed scheduling, and distributed computing models.

Then, almost at the same time, an excellent distributed technology system represented by Google, Microsoft Cosmos and open source Hadoop was born. Of course, this also includes Alibaba's Feitian system. At this time, people are excited to pursue the scale of data processing, that is, "big" data, and there is no leisure to debate whether it is a data warehouse or a data lake.

3. Stage three: the "development period" of big data technology. In the second decade of the 21st century, as more and more resources are invested in the field of big data computing, big data technology has entered a stage of vigorous development, and the whole has begun to change from being usable to being easy to use.

Instead of expensive handwritten MapReduce jobs, various computing engines expressed in SQL have sprung up. These computing engines are optimized for different scenarios, but they all use the SQL language with a very low threshold, which greatly reduces the cost of using big data technology. The unified data warehouse that people dreamed of in the database era has finally become a reality. Various databases The methodology of the times is beginning to rear its head. During this period, the technical route began to be subdivided.

Integrated systems such as AWS Redshift, Google BigQuery, Snowflake, and MaxCompute promoted by cloud vendors are called data warehouses in the era of big data. The open HDFS storage represented by the open source Hadoop system, the open file format, the open metadata service, and the collaborative work mode of multiple engines (Hive, Presto, Spark, Flink, etc.) have formed the prototype of the data lake. .

4. Stage 4: "Popularization period" of big data technology. At present, big data technology is no longer a rocket technology, but has penetrated into all walks of life, and the popularization period of big data has arrived. The market's requirements for big data products, in addition to scale, performance, and ease of use, put forward more comprehensive enterprise-level production requirements such as cost, security, and stability.

  • The open source Hadoop line, the iterative replacement of basic components such as engines, metadata, and storage have entered a relatively stable state, and the public's awareness of open source big data technology has reached an unprecedented level. On the one hand, the convenience of the open architecture has brought a good market share. On the other hand, the loose open architecture has caused the open source solution to encounter bottlenecks in the construction of enterprise-level capabilities, especially in data security, strong control of identity permissions, and data governance. , and the collaboration efficiency is poor (such as Ranger as the authority control component, Atlas as the data governance component, and today’s mainstream engines can’t achieve full coverage). At the same time, the development of the engine itself poses more challenges to the existing open architecture. The emergence of self-closed-loop designs such as Delta Lake and Hudi has caused a certain degree of inadequacy in the basis for a set of storage, a set of metadata, and the collaboration of multiple engines. crack.
  • It is AWS that really popularizes the concept of data lakes. AWS has built a set of open and collaborative product solutions with S3 as centralized storage, Glue as metadata service, and E-MapReduce and Athena as engines. Its openness is similar to the open source system, and Lake Formation was launched in 2019 to solve the problem of security trust between products. Although this architecture is far from the relatively mature cloud data warehouse products in terms of enterprise-level capabilities, it is still very attractive for users of open source technology systems because the architecture is similar and easy to understand. After AWS, various cloud vendors have also followed up the concept of data lakes and provided similar product solutions on their own cloud services.
  • The data warehouse products mainly promoted by cloud vendors have developed well, and the core capabilities of data warehouses have continued to increase. Performance and cost have been greatly improved (MaxCompute has completed a comprehensive upgrade of the core engine and performance leaps and bounds, refreshing the TPCx-BigBench world record for three consecutive years), and data management capabilities have been enhanced unprecedentedly (data center modeling theory, intelligent data warehouse) , enterprise-level security capabilities are greatly prosperous (supporting multiple authorization models such as ACL-based and rule-based, column-level fine-grained authorization, trusted computing, storage encryption, data desensitization, etc.), and general enhancements have been made in federated computing , To a certain extent, the data stored in non-data warehouse itself has begun to be included in the management, and the boundary with the data lake has become increasingly blurred.

To sum up, data warehouse is a concept born in the database era. In the era of big data, it blossomed with various data warehouse services of cloud vendors. At present, it usually refers to the integrated services based on big data technology provided by cloud vendors. The data lake is born out of the open design of the open source technology system in the era of big data. After AWS integration and publicity, it usually consists of a series of cloud products or open source components to form a big data solution.

02 What is a data lake

In recent years, the concept of data lakes has been very popular, but the definition of data lakes is not uniform. Let's first look at the relevant definitions of data lakes.

Wikipedia's definition of a data lake:

A data lake is a system that stores data in a natural format such as large binary objects or files. It usually stores all enterprise data in a unified manner, including both the original copy in the source system and the transformed data, such as those used for reporting, visualization, data analysis and machine learning. Data lakes can include relational database structured data (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (email, files, PDF) and binary data (images, audio ,video). Ways to store data lakes include the Apache Hadoop distributed file system, Azure Data Lake or Amazon Cloud Lake Formation cloud storage services, and solutions such as Alluxio Virtual Data Lake. A data swamp is a degraded data lake that is either inaccessible to users or of little value.

AWS's definition is relatively succinct:

A data lake is a centralized repository that allows you to store all structured and unstructured data at any scale. You can store data as is (without first structuring it) and run different types of analytics – from dashboards and visualizations to big data processing, real-time analytics and machine learning to guide better decision making.

Other cloud vendors such as Azure also have their own definitions, so this article will not repeat them.

But no matter how different the definition of data lake is, the essence of data lake actually includes the following four parts:

1. Unified storage system

2. Store raw data

3. Rich computational models/paradigms

4. The data lake has nothing to do with going to the cloud

Judging from the above four standards, the Hadoop HDFS storage system for open source big data is a standard data lake architecture with a unified original data storage architecture. The data lake, which has been widely discussed recently, is actually a narrow concept, specifically referring to "a data lake system based on a cloud-hosted storage system, and a system that separates storage from computing in terms of architecture." For example, a data lake based on the AWS S3 system or the Alibaba Cloud OSS system. 

The following figure shows the evolution process of the data lake technical architecture, which can be divided into three stages as a whole:

1. Phase 1: Self-built open source Hadoop data lake architecture, the original data is uniformly stored on the HDFS system, the engine is mainly based on the open source ecology of Hadoop and Spark, and storage and computing are integrated. The disadvantage is that enterprises need to operate, maintain and manage the entire cluster by themselves, which is costly and has poor cluster stability.

2. Phase 2: Hadoop data lake architecture hosted on the cloud (that is, EMR open source data lake), the underlying physical servers and open source software versions are provided and managed by cloud vendors, and the data is still uniformly stored on the HDFS system. The engine is based on Hadoop and Spark open source ecology Mainly.

This architecture improves the flexibility and stability of the machine level through the IaaS layer on the cloud, which reduces the overall operation and maintenance cost of the enterprise, but the enterprise still needs to manage and govern the HDFS system and service running status, that is, the operation and maintenance of the application layer Work. At the same time, because storage and computing are coupled together, the stability is not optimal, the two resources cannot be independently expanded, and the cost of use is not optimal.

3. Phase 3: Data Lake architecture on the cloud, that is, the purely managed storage system on the cloud gradually replaces HDFS and becomes the storage infrastructure of the data lake, and the engine richness is also continuously expanded. In addition to the ecological engines of Hadoop and Spark, various cloud vendors have also developed engine products for data lakes.

For example, data lake engines for analysis include AWS Athena and Huawei DLI, and AWS Sagemaker for AI. This architecture still maintains the characteristics of one storage and multiple engines, so the unified metadata service is very important. For example, AWS launched Glue, and Alibaba Cloud EMR will soon release the data lake unified metadata service. The advantages of this architecture over the native HDFS data lake architecture are:

  • Help users get rid of the difficult problem of operation and maintenance of native HDFS system. There are two difficulties in the operation and maintenance of the HDFS system: 1) Compared with the computing engine, the storage system has higher stability requirements and higher operation and maintenance risks; The storage-computing separation architecture helps users decouple storage, and hands it over to cloud vendors for unified operation and maintenance management, which solves stability and operation and maintenance issues.
  • The separated storage system can be expanded independently, no longer need to be coupled with computing, which can reduce the overall cost
  • After the user adopts the data lake architecture, it also objectively helps the customer to complete the storage unification (solve the problem of multiple HDFS data islands)

The figure below is the architecture diagram of Alibaba Cloud EMR data lake. It is a big data platform based on the open source ecology. It supports both the open source data lake of HDFS and the data lake on the cloud of OSS.

Figure 4. Alibaba Cloud EMR data lake architecture

Enterprises use data lake technology to build a big data platform, which mainly includes data access, data storage, calculation and analysis, data management, authority control, etc. The following figure is a reference architecture defined by Gartner. Due to the flexibility and openness of the current data lake technology, it is not very mature in terms of performance efficiency, security control, and data governance, and there are still great challenges when it comes to enterprise-level production requirements (detailed in Chapter 4 elaboration).

03 The birth of the data warehouse and its relationship with the data center

The concept of data warehouse originated from the field of database, mainly dealing with data-oriented complex query and analysis scenarios. With the development of big data technology, a large number of database technologies have been borrowed, such as SQL language, query optimizer, etc., to form a big data data warehouse, which has become the mainstream because of its powerful analysis capabilities.

In recent years, the combination of data warehouse and cloud-native technology has evolved into a cloud data warehouse, which solves the resource supply problem for enterprises to deploy data warehouses. As a high-level (enterprise-level) platform capability of big data, cloud data warehouse has attracted more and more attention because of its out-of-the-box capabilities, unlimited expansion, and easy operation and maintenance.

Wikipedia's definition of a data warehouse:

In computing, a data warehouse (also known as an enterprise data warehouse) is a system for reporting and data analysis that is considered a core component of business intelligence. A data warehouse is a central repository of integrated data from one or more disparate sources. A data warehouse stores current and historical data together to create analytical reports for employees across the enterprise.

A more academic explanation is that the data warehouse was proposed by WHInmon, the father of the data warehouse, in 1990. The data storage structure , for a systematic analysis and arrangement , in order to facilitate various analysis methods such as online analytical processing (OLAP), data mining (Data Mining), and further support such as decision support system (DSS), supervisory information system The creation of (EIS) helps decision makers to quickly and effectively analyze valuable information from large amounts of data, facilitate decision-making and quickly respond to changes in the external environment, and help build business intelligence (BI).

The essence of a data warehouse consists of the following three parts:

1. The built-in storage system provides data in an abstract way (for example, using Table or View), and does not expose the file system.

2. Data needs to be cleaned and transformed, usually by ETL/ELT

3. Emphasis on modeling and data management for business intelligence decisions

Judging from the above criteria, both traditional data warehouses (such as Teradata) and emerging cloud data warehouse systems (AWS Redshift, Google BigQuery, Alibaba Cloud MaxCompute) embody the design essence of data warehouses, and they do not expose file systems to the outside world. It is a service interface that provides data entry and exit.

For example, Teradata provides the CLI data import tool, Redshift provides the Copy command to import data from S3 or EMR, BigQuery provides the Data Transfer service, MaxCompute provides the Tunnel service and the MMA relocation tool for data upload and download. This design can bring several advantages:

1. The engine deeply understands data, and storage and calculation can be deeply optimized

2. Full data life cycle management, perfect lineage system

3. Fine-grained data management and governance

4. Perfect metadata management capabilities, easy to build an enterprise-level data middle platform

Because of this, at the beginning of the construction of the Alibaba Feitian big data platform, the data warehouse architecture was adopted when selecting the model, that is, the MaxCompute big data platform . MaxCompute (formerly ODPS) is not only the big data platform of the Alibaba economy, but also a safe, reliable, high-efficiency, low-cost online big data computing service on Alibaba Cloud that scales from gigabytes to exabytes on demand (Figure 6. It is the MaxCompute product architecture. For details, please click on the address of the official website of Alibaba Cloud MaxCompute).

As an enterprise-level cloud data warehouse in the SaaS model, MaxCompute is widely used in the Alibaba economy, as well as Alibaba Cloud's Internet, new finance, new retail, digital government and other thousands of customers.

Figure 6. MaxCompute cloud data warehouse product architecture

Thanks to the structure of the MaxCompute data warehouse, the upper layers of Alibaba have gradually built management capabilities such as "data security system", "data quality", "data governance", and "data labeling", and finally formed Alibaba's big data middle platform . It can be said that as the earliest proposer of the concept of data center, Alibaba's data center benefited from the architecture of the data warehouse.

04 Data Lake VS Data Warehouse

To sum up, data warehouse and data lake are two design orientations of big data architecture. The fundamental difference in design between the two lies in the control of storage system access, rights management, and modeling requirements.

The data lake-first design brings maximum flexibility to data entering the lake by opening the underlying file storage. Data entering a data lake can be structured, semi-structured, or even completely unstructured raw logs. In addition, open storage brings more flexibility to the upper-layer engines. Various engines can freely read and write data stored in the data lake according to their own scenarios, and only need to follow a fairly loose compatibility agreement (such Loose agreement will of course have hidden dangers, which will be mentioned later).

But at the same time, direct access to the file system makes it difficult to implement many higher-level functions, for example, fine-grained (smaller than the file granularity) rights management, unified file management, and read-write interface upgrades are also very difficult (need to complete each The engine upgrade of the file is considered to be upgraded).

The data warehouse-first design focuses more on enterprise-level growth requirements such as data usage efficiency, large-scale data management, and security/compliance. Data enters the data warehouse through a unified but open service interface. The data usually has a predefined schema, and users access files in the distributed storage system through the data service interface or computing engine.

The data warehouse priority design abstracts the data access interface/authority management/data itself in exchange for higher performance (whether it is storage or computing), a closed-loop security system, and data governance capabilities. Usage is all that matters, and we call it growth.

Flexibility and growth are of different importance to enterprises in different periods.

1. When an enterprise is in the start-up stage, it needs a stage of innovation and exploration from data generation to consumption to gradually settle down. Then the flexibility of the big data system used to support this type of business is more important, and the architecture of the data lake is more suitable.

2. When the enterprise gradually matures and has settled into a series of data processing processes, the problem begins to be transformed into the continuous growth of the data scale, the continuous increase of the cost of data processing, and the continuous increase of personnel and departments participating in the data process, then it is used to support this type of business For a big data system, the quality of growth determines how far the business can develop. The architecture of the data warehouse is more applicable.

This article has observed that a considerable number of enterprises (especially the emerging Internet industry) have built big data technology stacks from scratch, and it is with the popularity of the open source Hadoop system that they have experienced such a process from exploration and innovation to mature modeling. In this process, because the data lake architecture is too flexible and lacks data supervision, control and necessary governance means, the cost of operation and maintenance continues to increase, and the efficiency of data governance decreases. Enterprises fall into the situation of "data swamp", that is, data There is too much data gathered in the lake, but it is difficult to efficiently extract the truly valuable part.

In the end, only by migrating to a big data platform with a data warehouse priority design can the problems of operation and maintenance, cost, and data governance that arise after the business grows to a certain scale be solved. Let’s take Alibaba as an example. Alibaba’s successful data center strategy was gradually completed when the Alibaba Group completed the complete replacement of multiple Hadoop (data lakes) by MaxCompute (data warehouse) around 2015 (moon landing project). Forming.

 

05 The evolution direction of the next generation: integration of lake and warehouse

After an in-depth elaboration and comparison of data lakes and data warehouses, this paper believes that data lakes and data warehouses, as two different evolution routes of big data systems, have their own unique advantages and limitations.

Data lakes and data warehouses are user-friendly for start-ups, while the other is better for growth. For enterprises, must data lakes and data warehouses be a multiple-choice question? Is there a solution that takes into account both the flexibility of the data lake and the growth of the cloud data warehouse, and effectively combines the two to achieve a lower total cost of ownership for users?

Integrating data warehouses and data lakes is also a trend in the industry in recent years, and multiple products and projects have made corresponding attempts:

1. Data warehouse supports data lake access

  • In 2017, Redshift launched Redshift Spectrum, which supports Redshift data warehouse users to access the data of S3 data lake.
  • In 2018, Alibaba Cloud MaxCompute launched the appearance capability, which supports access to various external storages including OSS/OTS/RDS databases.

However, whether it is the external table of Redshift Spectrum or MaxCompute, users still need to create an external table in the data warehouse to incorporate the open storage path of the data lake into the conceptual system of the data warehouse—because a simple open storage cannot describe itself The data itself changes, so creating external tables and adding partitions for these data (essentially creating a schema for the data in the data lake) cannot be fully automated (requires manual or periodic triggering of Alter table add partition or msck). This is acceptable for low-frequency temporary queries, but it is somewhat complicated for production use.

2. The data lake supports data warehouse capabilities 

  • In 2011, Hortonworks, a Hadoop open source system company, started the development of two open source projects, Apache Atlas and Ranger, which respectively correspond to the core capabilities of two data warehouses, data lineage tracking and data authority security. However, the development of the two projects was not smooth, and the incubation was not completed until 2017. Today, the deployment in the community and the industry is far from active enough. The core reason data lakes are inherently flexible. For example, Ranger, as a component for secure and unified management of data permissions, naturally requires all engines to adapt to it to ensure that there are no security vulnerabilities. However, for engines that emphasize flexibility in data lakes, especially new engines, priority will be given to implementing functions and scenarios, while Not taking docking with Ranger as the first priority goal has made Ranger's position on the data lake very awkward.
  • In 2018, Nexflix open sourced an internal enhanced version of the metadata service system Iceberg, providing enhanced data warehouse capabilities including MVCC (multi-version concurrency control), but because the open source HMS has become a de facto standard, the open source version of Iceberg is compatible as a plug-in And with HMS, the data warehouse management ability is greatly reduced.
  • From 2018 to 2019, Uber and Databricks successively launched Apache Hudi and DeltaLake, and launched incremental file formats to support data warehouse functions such as Update/Insert and transactions. The new function brings about changes in the file format and organizational form, breaking the simple agreement on shared storage among the original multiple engines of the data lake. For this reason, in order to maintain compatibility, Hudi had to invent two tables such as Copy-On-Write and Merge-On-Read, and three query types of Snapshot Query, Incremental Query, and Read Optimized Query, and gave a support matrix (as shown in Figure 10), which greatly increases the complexity of use.

DeltaLake chose to ensure the experience of using Spark as the main support engine, relatively sacrificing compatibility with other mainstream engines. This has caused many restrictions and inconvenience for other engines to access Delta data in the data lake. For example, if Presto wants to use the DeltaLake table, it needs to use Spark to create a manifest file first, and then create an external table based on the manifest. At the same time, attention should also be paid to the update of the manifest file; while Hive needs to use the DeltaLake table with more restrictions, which will not only cause confusion at the metadata level , can't even write to the table.

The above-mentioned attempts to build data warehouses on the data lake architecture were unsuccessful, which shows that data warehouses and data lakes are fundamentally different, and it is difficult to build a complete data warehouse on the data lake system. It is difficult to directly combine the data lake and the data warehouse into a system, so the author team began to explore based on the idea of ​​integrating the two.

Therefore, we propose the evolution direction of the next generation of big data technology: the integration of lake and warehouse, that is, to open up the two systems of data warehouse and data lake, so that data and computing can flow freely between the lake and warehouse, so as to build a complete organic big data technology ecosystem.

We believe that three key issues need to be solved to build an integrated lake and warehouse :

1. The data/metadata of the lake and the warehouse are seamlessly connected without manual intervention by the user

2. Lakes and warehouses have a unified development experience, and data stored in different systems can be operated through a unified development/management platform

3. The system is responsible for automatic caching/moving of the data in the data lake and data warehouse. The system can decide which data to put in the data warehouse and which to keep in the data lake according to automatic rules, thereby forming an integration

In the next chapter, we will introduce in detail how the Alibaba Cloud integrated lake warehouse solution solves these three problems.

06 Alibaba Cloud Lake Warehouse Integrated Solution

6.1 Overall Architecture

Based on the original data warehouse architecture, Alibaba Cloud MaxCompute integrates open source data lakes and cloud data lakes, and finally realizes the overall architecture of lake warehouse integration (Figure 11).

In this architecture, although multiple sets of underlying storage systems coexist, through a unified storage access layer and unified metadata management, an integrated encapsulation interface is provided to the upper-layer engine, and users can jointly query the tables in the data warehouse and data lake. The overall architecture also has unified middle-end capabilities such as data security, management, and governance.

In response to the three key issues of the integration of lakes and warehouses proposed in Chapter 5, MaxCompute has implemented the following four key technical points.

1. Quick access

  • MaxCompute's new and self-created PrivateAccess network connection technology, under the premise of following the cloud virtual network security standards, realizes the ability to connect specific user job orientation and IDC/ECS/EMR Hadoop cluster networks in a multi-tenant mode, with low latency and high exclusive bandwidth specialty.
  • After quick and simple activation and security configuration steps, the data lake can be connected to the purchased MaxCompute data warehouse.

2. Unified data/metadata management

  • MaxCompute realizes the integrated metadata management of the lake and warehouse, and realizes the seamless connection between the metadata of the data lake and the MaxCompute data warehouse through the DB metadata one-click mapping technology. MaxCompute directly maps the entire database in the data lake HiveMetaStore to a MaxCompute project by opening the form for users to create external projects. Changes to the Hive Database will be reflected in this project in real time and can be accessed through this project at any time on the MaxCompute side. , calculate the data in it. At the same time, Alibaba Cloud's EMR data lake solution will also launch Data Lake Formation, and the MaxCompute lake warehouse integration solution will also support the one-click mapping capability of the unified metadata service in the data lake. Various operations on the external project on the MaxCompute side will also be reflected on the Hive side in real time, truly realizing the seamless linkage between the data warehouse and the data lake, and completely eliminating the need for manual metadata intervention steps similar to those in the federated query solution.
  • MaxCompute implements a storage access layer that integrates lakes and warehouses. It not only supports built-in optimized storage systems, but also seamlessly supports external storage systems. It supports both HDFS data lakes and OSS cloud storage data lakes, and can read and write various open source file formats.

3. Unified development experience

  • The Hive DataBase in the data lake is mapped to a MaxCompute external project, which is no different from a normal project, and also enjoys the data development, tracking, and management functions in the MaxCompute data warehouse. Based on the powerful data development/management/governance capabilities of DataWorks, it provides a unified lake warehouse development experience and reduces the management costs of the two systems.
  • MaxCompute is highly compatible with Hive/Spark, and supports a set of tasks to run flexibly and seamlessly in the two systems of Hucang.
  • At the same time, MaxCompute also provides an efficient data channel interface, which allows the Hadoop ecological engine in the data lake to directly access it, improving the openness of the data warehouse.

4. Automatic warehouse

  • The integration of lakes and warehouses requires users to reasonably layer and store data between lakes and warehouses according to their own asset usage, so as to maximize the advantages of lakes and warehouses. MaxCompute has developed a set of intelligent caching technology to identify the hotness and coldness of data based on the analysis of historical tasks, so as to automatically use the idle bandwidth to cache the hot data in the data lake in the data warehouse in an efficient file format, further accelerating the data warehouse. Subsequent data processing process. It not only solves the bandwidth bottleneck problem between lake warehouses, but also achieves the purpose of data hierarchical management/governance and performance acceleration without user participation.

6.2 Build a data center integrating lake and warehouse

Based on the integrated technology of MaxCompute's lake and warehouse, DataWorks can further encapsulate the two systems of the lake and warehouse, shield the heterogeneous cluster information of the lake and the warehouse, and build an integrated big data middle platform to realize a set of data and a set of tasks between the lake and the warehouse. Seamless scheduling and management on the Internet.

Enterprises can use the integrated data center capabilities of the lake and warehouse to optimize the data management structure and fully integrate the respective advantages of the data lake and data warehouse. Use data lakes as centralized raw data storage to take advantage of the flexibility and openness of data lakes.

In addition, the production-oriented high-frequency data and tasks are seamlessly dispatched to the data warehouse through the integration of lake and warehouse technology to obtain better performance and cost, and a series of subsequent production-oriented data governance and optimization will ultimately allow enterprises to reduce costs. Find the best balance between efficiency and efficiency.

In general, MaxCompute provides a more flexible, efficient and economical data platform solution for enterprises. It is not only suitable for enterprises building new big data platforms, but also for enterprises with existing big data platforms to upgrade their architecture. It can protect existing investment and realize asset recycling.

6.3 Typical customer case: Sina Weibo uses "hucang integration" to build a hybrid cloud AI computing center

  • case background

Weibo machine learning platform team, mainly doing recommendation in the field of social media Mainly doing recommendation/sorting, text/image classification, anti-spam/anti-cheating and other technologies in the field of social media.

The technical architecture mainly revolves around the open source Hadoop data lake solution, a HDFS storage + multiple computing engines (hive, spark, flink) to meet the needs of AI-based multi-computing scenarios. However, Weibo, as the top social media application in China, has entered the open source "no man's land" due to its current business volume and complexity. The open source data lake solution cannot meet the requirements of Weibo in terms of performance and cost.

With the help of Alibaba's powerful Feitian big data and AI platform capabilities (MaxC+PAI+DW), Weibo solved the performance bottlenecks of feature engineering, model training, and matrix calculations at ultra-large scale, and then formed the Alibaba MaxCompute platform (MaxCompute Warehouse) + open source platform (data lake) coexistence pattern. 

  • core pain point

Weibo hopes to use these two sets of heterogeneous big data platforms to not only maintain the flexibility of AI-oriented various data and computing, but also solve the performance/cost problems of computing and algorithms under ultra-large scale. However, because the two big data platforms are completely separated at the cluster level, data and computing cannot flow freely between the two platforms, which invisibly increases a large amount of costs such as data movement and computing development, which in turn restricts business development.

The main pain points are: 1) Arranging a special person to be responsible for training data synchronization, the workload is huge 2) The large volume of training data takes a long time and cannot meet the requirements of real-time training 3) Newly written SQL data processing query cannot be reused in Hive SQL original query.

  • solution

In order to solve the above-mentioned pain points, the product team of Alibaba Cloud and the machine learning platform team of Weibo jointly built a new technology integrating lakes and warehouses, opened up the Alibaba MaxCompute cloud data warehouse and EMR Hadoop data lake, and built an AI that crosses lakes and warehouses Calculate the middle platform.

MaxCompute products fully upgrade the network infrastructure, open up the user's VPC private domain, and rely on the one-key mapping of the Hive database and the powerful and complete SQL/PAI engine capabilities to seamlessly connect the MaxCompute cloud data warehouse and the EMR Hadoop data lake technology system to realize the lake. The warehouses are unified and intelligently managed and dispatched.

  • case value
  • It not only combines the advantages of data lakes and data warehouses to find the best balance between flexibility and efficiency, but also quickly builds a unified AI computing platform, which greatly improves the business support capabilities of the machine learning platform team. A set of jobs can be seamlessly and flexibly scheduled in MaxCompute clusters and EMR clusters without data or job migration.
  • SQL data processing tasks are widely run on MaxCompute clusters, and the performance has been significantly improved. Based on the rich and powerful algorithm capabilities of Alibaba PAI, it encapsulates a variety of algorithm services that are close to business scenarios to meet more business needs.
  • MaxCompute's cloud-native elastic resources and EMR cluster resources complement each other, and the resources between the two systems are peak-shaving and valley-filling, which not only reduces job queuing, but also reduces overall costs.

07 Summary

Data lakes and data warehouses are two data architecture design orientations for building distributed systems under today's big data technology conditions. It depends on whether the direction of balance is more biased towards flexibility or enterprise-level features such as cost, performance, security, and governance.

However, the boundaries between data lakes and data warehouses are gradually blurring, and the governance capabilities of data lakes and the ability of data warehouses to extend to external storage are all being strengthened. Against this background, MaxCompute took the lead in proposing the integration of lakes and warehouses, presenting to the industry and users a structure in which data lakes and data warehouses and lakes complement each other and work together.

Such an architecture provides users with the flexibility of data lakes and many enterprise-level features of data warehouses at the same time, and further reduces the total cost of ownership of users using big data. We believe that this is the evolution direction of the next-generation big data platform.

Guess you like

Origin blog.csdn.net/xljlckjolksl/article/details/131195582
Recommended