4D detailed explanation of data warehouse, data lake, data center and lake warehouse

This article directory:

1. Introduction
2. Concept Analysis

database
data lake
data center

Third, the specific difference

Data warehouse vs data lake
Data warehouse VS data center
Summarize

Four, lake and warehouse integration

Current data storage solutions
Data Lakehouse

I. Introduction

The wave of digital transformation has rolled up all kinds of old and new concepts. Data lakes, data warehouses, and data middle platforms are taking turns to swipe the screen in the circle of friends. Some people say "data middle platforms are nothing, data lakes are the trend", and some people say "goodbye." Data lakes, data warehouses, and data middle-offices have become the climate"...

50,000-word detailed nanny-level tutorial on digital warehouse construction, covering offline and real-time

Before the enterprise opened the door to digitalization, it was first stumbled by various concepts. So what's the difference between the three? Don't worry, let me share with you two interesting analogies first.

1. Library VS street stall

If a data warehouse is likened to a "library", then a data lake is a "street stall". Go to the library to borrow books (data), the quality of the books is guaranteed, but you have to wait, what are you waiting for? You can only get the book you want exactly when the administrator finds out which category the book belongs to and which shelf it is on; and no one will check for you on the stall, there are all kinds of books, you can search for them by yourself, The process is much more convenient than the library, but the process of finding books is not experienced and can be reused. Occasionally, we may not know if we take more and less.

2. Upgraded Bank

It is assumed that the data warehouse, data lake, and data center are all banks, which can provide various services such as cash and gold. In the past, before entering the bank, everyone had to ask the doorman, which service does the number on each door number correspond to? Is it cash or gold? Then push open the corresponding door and take out the item. With the "Data Center" bank, as soon as you come in, you can see the window marked with "cash" and "gold" Chinese characters.

The above two examples are not necessarily comprehensive, but they can basically explain the advantages and disadvantages of the three. The data warehouse is standardized, but the process of fetching and using data is long; the data lake is more real-time and has a large storage capacity, but the data quality is difficult to guarantee; the data center can respond to business needs accurately and quickly, and is closest to the business side.

In order to distinguish the three more clearly, let's take a look at their respective definitions and application differences.

2. Concept Analysis

1. Data warehouse

The data warehouse was born in 1990, and it is definitely a "old-timer". It is a relatively specific functional concept. The current mainstream definition of data warehouse is a large-capacity repository located on multiple databases. Its function is to store a large amount of structured data, and to perform frequent and repeatable analysis to help enterprises build business intelligence (BI) .

Specific definition :

Data warehouse (Data Warehouse) is a subject-oriented (Subject Oriented), integrated (Integrated), relatively stable (Non-Volatile), reflecting historical changes (Time Variant) data collection, used to support management decisions and information Shared globally. Its main function is to analyze a large amount of data accumulated over the years through the online transaction processing (OLTP) of the information system, and analyze the valuable information through the data storage structure unique to the data warehouse theory.

The so-called topic : refers to the key aspects that users care about when using the data warehouse to make decisions, such as: revenue, customers, sales channels, etc.; the so-called topic-oriented means that the information in the data warehouse is organized by topic, not like business The support system is organized according to business functions.
The so-called integration : means that the information in the data warehouse is not simply extracted from various business systems, but undergoes a series of processing, sorting and aggregation processes, so the information in the data warehouse is consistent global information about the entire enterprise.
The so-called change over time : It means that the information in the data warehouse does not only reflect the current state of the enterprise, but records the information from a certain point in the past to the current stage. Through this information, quantitative analysis and prediction can be made on the development history and future trends of the enterprise.

The role of the data warehouse:

The role of the data warehouse system can realize data integration across business lines and systems, and provide unified data support for management analysis and business decision-making. A data warehouse can fundamentally help you transform your company's operational data into high-value accessible information (or knowledge) and deliver the right information to the right people at the right time and in the right way.

It is a tool for data integration, analysis and presentation of business analysis and performance appraisal for middle and senior management of enterprises;
It is mainly used for historical, comprehensive and in-depth data analysis ;
The data source is ERP (eg: SAP) system or other business system;
Able to provide flexible, intuitive, concise and easy-to-operate multi-dimensional query analysis;
It is not a daily transaction operating system and cannot directly generate transaction data;

real-time data warehouse

Real-time data warehouses are very similar to offline data warehouses. The background of their birth is mainly that in recent years, enterprises have increasingly demanded real-time data services. The data model inside will also be divided into several layers like the middle platform: ODS, CDM, ADS. However, the overall requirements for real-time performance are extremely high, so the general storage will consider using Kafka, a log base MQ, and the computing engine will use a stream computing engine, such as Flink.

2. Data Lake

A data lake is an ever-evolving and scalable infrastructure for big data storage, processing, and analysis. It is like a large warehouse for storing diverse raw data of an enterprise. It is data-oriented and realizes any source, any speed, any scale, Full acquisition, full storage, multi-mode processing and full life cycle management of any type of data . Possess strong information processing capabilities and the ability to handle virtually unlimited concurrent tasks or jobs.

The data lake obtains the original data from multiple data sources of the enterprise. The data may be any type of information, from structured data to completely unstructured data, and supports various enterprises through interaction and integration with various external heterogeneous data sources. level application. Combining advanced data science and machine learning technologies can help enterprises build more optimized operating models, and can also provide enterprises with other capabilities, such as predictive analysis, recommendation models, etc., which can stimulate the subsequent growth of enterprise capabilities.

Entering the Internet age, there are two most important changes.

One is that the scale of data is unprecedented. A successful Internet product can exceed 100 million daily active users. Just like Toutiao, Douyin, Kuaishou, and NetEase Cloud Music you are familiar with, hundreds of billions of user behaviors are generated every day. Traditional data warehouses are difficult to expand and simply cannot carry such massive amounts of data.

The other is that data types have become heterogeneous. In the Internet era, in addition to structured data from business databases, there are also front-end embedded data from apps and web, or back-end embedded logs of business servers. These data are generally is semi-structured or even unstructured. The traditional data warehouse has strict requirements on the data model. Before the data is imported into the data warehouse, the data model must be defined in advance, and the data must be stored according to the model design.

Therefore, due to the limitation of data scale and data type, traditional data warehouses cannot support business intelligence in the Internet era.

In 2005, Hadoop was born. Hadoop has two main advantages over traditional data warehouses:

Completely distributed, easy to expand, can use low-cost machines to build a cluster with strong computing and storage capabilities to meet the processing requirements of massive data;
The data format is weakened. After the data is integrated into Hadoop, no data format can be retained. The data model is separated from the data storage. When the data (including the original data) is used, it can be read according to different models to meet the needs of heterogeneous data. Flexible analysis needs. The data warehouse pays more attention to the data that can be used as a factual basis.

With the maturity of Hadoop and object storage, the concept of data lake was proposed in 10 years: A data lake (Data Lake) is a repository or system that stores data in a raw format (meaning that the underlying layer of the data lake should not be associated with any storage coupling).

Correspondingly, if the data lake is not well governed (lack of metadata, defining data sources, developing data access policies and security policies, and moving data, cataloguing data), it becomes a data swamp .

In terms of product form, data warehouses are often independent and standardized products. The data lake is more like a kind of architectural guidance - it needs to cooperate with a series of peripheral tools to realize the data lake required by the business.

3. Data center

The application of large-scale data has gradually exposed some problems.

In the early stage of business development, in order to quickly meet business requirements, chimney-style development led to data fragmentation between different business lines of the enterprise, and even between different applications of the same business line. The same indicators of the two data applications show inconsistent results, resulting in a decrease in the operation's trust in the data. If you are in operation, when you want to look at the sales of goods, you find two values in the indicators called sales on two reports. How do you feel? Your first reaction must be that the data is wrong, you I dare not continue to use this data.

Another problem of data fragmentation is that a large number of repeated calculations and development lead to waste of R&D efficiency, waste of computing and storage resources, and the application cost of big data is getting higher and higher.

If you are an operation, when you want a piece of data, the development tells you that it will take at least a week, you must think it is too slow, can it be faster?
If you are a data developer, when faced with a large number of demands, you must be complaining that there are too many demands, too few people, and the work cannot be done.
If you are the owner of a business, when you see your monthly bills increasing exponentially, you must think that it is too expensive, can you save a little more, or it will be too much.

At the root of these problems is that data cannot be shared. In 2016, Alibaba took the lead in putting forward the slogan of "Data Center". The core of the data center is to avoid the repeated calculation of data, improve the sharing ability of data and empower data applications through data service . Before, the data was nothing, and the intermediate data was difficult to share and could not be accumulated. Now, after building a data center, you can do whatever you want. The speed of data application research and development is no longer limited by the speed of data development. Overnight, we can incubate many data applications according to the scene, and these applications make data valuable.

Data center template

In the process of building Zhongtai, the following points are generally emphasized:

Efficiency, quality and cost are the keys to determining whether data can support the business well. The goal of building a data middle platform is to achieve high efficiency, high quality, and low cost.
Processing data only once is the core of building a data center, which is essentially to realize the sinking and reuse of public computing logic.
If your enterprise has more than 3 data application scenarios, and data products are still being developed and updated, you must seriously consider building a data center.

Then let's take a look at Alibaba's practice of data middle platform.

As mentioned above, processing data only once is the core of building a data center, which is essentially to realize the sinking and reuse of public computing logic . Ali Data Center mentioned various one ideas, such as:

OneData: Only one copy of public data is stored
OneService: exposed through a service interface

Third, the specific difference

1. Data warehouse vs data lake

In comparison, data lakes are relatively new technologies with evolving architectures. Data lakes store raw data in any form (including structured and unstructured) and in any format (including text, audio, video, and images). By definition, data lakes are not subject to data governance, but experts agree that good data management is integral to preventing data lakes from turning into data swamps. Data lakes create schemas during data reads. Compared to data warehouses, data lakes are less structured and more flexible, and offer greater agility. It is worth mentioning that data lakes are ideal for performing various tasks using machine learning and deep learning, such as data mining and data analysis, and extracting unstructured data.

2. Data warehouse VS data center

The starting point of data warehouse and traditional data platform is a supporting technical system, that is, we must first consider what data I have, and then what can I do, so special emphasis is placed on data quality and metadata management; The starting point is not data but business. At the beginning, you don’t need to look at what data you have in your system, but what kind of data services you need to solve your business problems.

In the specific technical processing link, the two are also significantly different, and the data preprocessing process is changing from the traditional ETL structure to the ELT structure. The traditional data warehouse integration processing architecture is ETL structure, which is an important part of building a data warehouse, that is, users extract the required data from the data source, and after data cleaning, the data is loaded into the data warehouse. The architecture system in the context of big data is the ELT structure, which extracts the desired raw data from the data center at any time for modeling and analysis according to the application requirements of the upper layer.

3. Summary

According to the above concept discussion and comparison of data warehouse, data lake and data center, we make the following summary:

There is no direct relationship between data center, data warehouse and data lake;
The data center, data warehouse and data lake have different emphasis on the form of generating value for the business in a certain dimension;
The data center is an enterprise-level logical concept, which reflects the ability of enterprise data to transform into business value. The main way to provide services for business is data API;
Data warehouse is a relatively specific functional concept, which is a collection of storage and management of one or more subject data. The main way to provide services for business is to analyze reports;
The data center is closer to the business and can respond more quickly to business and application development needs, thereby providing faster services for the business;
The data warehouse is to support management decision-making and analysis, while the data center provides the data to the business system after serving it, which is not limited to analytical scenarios, but also applies to transactional scenarios;
The data center can be built on the data warehouse and data platform, and it is the middle layer that accelerates the process of enterprises from data to business value.

Four, lake and warehouse integration

Some people say that "the integration of lakes and warehouses will become the next beacon, and the data warehouse and data lake architecture will soon be withdrawn from group chats."

In 2020, Big Data DataBricks first proposed the concept of Data Lakehouse, hoping to integrate data lake and data warehouse technologies into one. As soon as this concept came out, various cloud vendors followed suit.

Data Lakehouse is a new data architecture that absorbs the advantages of data warehouse and data lake at the same time. Data analysts and data scientists can operate on data in the same data storage, and it can also It brings more convenience to the company's data governance.

1. Current data storage solutions

Historically, we have used two data stores to structure our data:

Data warehouse : It mainly stores structured data organized by relational database. The data is transformed, consolidated, and cleaned, and imported into the target table. In the data warehouse, the structure of the data storage is strongly matched with its defined schema.
Data Lake : Store any type of data, including unstructured data like pictures and documents. Data lakes are usually larger and their storage costs are cheaper. The data stored in it does not need to meet a specific schema, and the data lake does not attempt to enforce a specific schema on it. Instead, the owner of the data typically parses the schema when reading the data (schema-on-read), and applies transformations to it when processing the corresponding data.

Nowadays, many companies often build two storage architectures of data warehouse and data lake at the same time, one large data warehouse and multiple small data lakes. In this way, the data will have some redundancy in the two storages.

2. Data Lakehouse

The emergence of Data Lakehouse attempts to integrate the differences between data warehouses and data lakes. By building data warehouses on data lakes, storage becomes cheaper and more flexible. At the same time, lakehouse can effectively improve data quality and reduce Data redundancy . In the construction of lakehouse, ETL plays a very important role, it can convert unstructured data lake layer data into data warehouse layer structured data.

The following is explained in detail:

Data Lakehouse :

According to DataBricks' definition of Lakehouse: a new paradigm that combines the advantages of data lakes and data warehouses, addressing the limitations of data lakes. Lakehouse uses a new system design: implementing similar data structures and data management functions as in a data warehouse directly on low-cost storage for the data lake.

Explanation expansion :

The integration of lake and warehouse, a simple understanding is to combine enterprise-oriented data warehouse technology with data lake storage technology to provide enterprises with a unified and sharable data base.

Avoid data movement between traditional data lakes and data warehouses, and store raw data, processed and cleaned data, and modeled data in an integrated "lake warehouse", which can achieve high concurrency, precision, and high efficiency for business. Performance historical data, real-time data query services, and can carry analytical reports, batch processing, data mining and other analytical services.

The emergence of the integrated solution of lake and warehouse helps enterprises to build a new and integrated data platform. Through the support of machine learning and AI algorithms, the closed loop of data lake + data warehouse is realized to improve business efficiency. The capabilities of the data lake and the data warehouse are fully combined to form complementarity, and at the same time connect to the diverse computing ecology of the upper layer.

Lakehouse has the following key features :

Transaction Support : Lakehouse In enterprise applications, many data pipelines often read and write data simultaneously. Usually multiple parties use SQL to read or write data at the same time, and Lakehouse guarantees to support the consistency of ACID transactions.
Schema implementation and governance : Lakehouse should have a way to support schema implementation and evolution, supporting DW schema specifications such as star/snowflake-schemas. The system should be able to reason about data integrity and should have robust governance and auditing mechanisms.
BI support : Lakehouse can use BI tools directly on the source data. This reduces staleness and latency, improves recency, and reduces the cost of having to operate two copies of data in the data lake and warehouse.
Separation of storage and computing : In practice, this means that storage and computing use separate clusters, so these systems are able to scale to more concurrent users and larger data volumes. Some modern data warehouses also have this property.
Compatibility : The storage formats used by Lakehouse are open and standardized, such as Parquet, and it provides multiple APIs, including machine learning and Python/R libraries, so various tools and engines can access data directly and efficiently.
Supports multiple data types from unstructured to structured data : Lakehouse can be used to store, optimize, analyze and access the data types required by many new data applications, including images, video, audio, semi-structured data and text .
Supports a variety of work scenarios : including data science, machine learning, and SQL analytics. These may rely on multiple tools to support work scenarios, all of which depend on the same data repository.
End-to-end streaming tasks : Real-time reporting is an everyday need for many businesses. Support for stream processing eliminates the need for a separate system dedicated to serving real-time data applications.

The above picture is the reference diagram of the architecture evolution given by DataBricks.

We can see that the traditional data warehouse has a very clear goal and is suitable for business BI analysis and reporting after merging various business data sources. As enterprises need to process more and more types of data, including customer behavior, IoT, pictures, videos, etc., the scale of data increases exponentially.

Data lake technology was introduced and used to assume the role of a general data storage and processing platform. Due to its distributed storage and computing capabilities, data lakes can also better support machine learning computing. In the era of data lakes, we can usually see To DataLake and Data Warehouse will still exist at the same time.

With the advent of the era of big data, is it possible for big data technology to replace traditional data warehouses and form a unified data processing architecture? and practice.