Database, Data Warehouse, Data Lake and Huawei Smart Data Lake Solution

Nowadays, with the continuous development of technologies such as the Internet and the Internet of Things, more and more data are produced, and data management tools have also been developed rapidly. Big data related concepts have sprung up like mushrooms after rain, such as databases, data warehouse , data lake, integration of lake and warehouse, etc. What do these concepts refer to and how are they related? At the same time, what are Huawei's corresponding products and solutions? This article will compare and introduce them one by one.

What is a database?

A database is " a warehouse that organizes, stores and manages data according to the data structure ".

The database in a broad sense has been applied in computers in the 1960s. However, the database structure at this stage is mainly hierarchical or networked, and there is a very strong dependence between data and programs, so the application is relatively limited.

Now commonly referred to as the database refers to the relational database. A relational database refers to a database that uses a relational model to organize data. It stores data in the form of rows and columns, and has the advantages of high structuring, strong independence, and low redundancy. The birth of the relational database in 1970 truly and completely separated the data and programs in the software and became an indispensable part of the mainstream computer system. Relational databases have become the most important member of current database products. Almost all new database products released by database manufacturers support relational databases. Even some non-relational database products almost all have interfaces that support relational databases.

Relational databases are mainly used for online transaction processing OLTP (On-Line Transaction Processing) for basic and daily transaction processing, such as bank transactions and other scenarios.

What is a data warehouse?

With the large-scale application of databases, the data in the information industry has grown explosively. In order to study the relationship between data and tap the hidden value of data, more and more people need to use OLAP (On-Line Analytical Processing) for data analysis to explore some deep relationships and information. However, it is difficult to share data between different databases, and there are also great challenges in the integration and analysis of data.

In order to solve the data integration and analysis problems of enterprises, Bill Enmen, the father of data warehouse, proposed Data Warehouse in 1990. The main function of the data warehouse is to carry out OLAP through the large amount of data accumulated by OLTP over the years through the unique data storage structure of the data warehouse, and finally help decision makers to quickly and effectively analyze valuable information from large amounts of data and provide decision support. Since the emergence of data warehouses, the information industry has slowly developed from an operational system based on relational databases to a decision support system.

Compared with the database, the data warehouse mainly has the following two characteristics:

1) The data warehouse is subject-oriented integration . The data warehouse is established to support various businesses, and the data comes from scattered operational data. Therefore, the required data needs to be extracted from multiple heterogeneous data sources, processed and integrated, reorganized according to themes, and finally entered into the data warehouse.

2) The data warehouse is mainly used to support enterprise decision-making analysis, and the data operations involved are mainly data queries . Therefore, the data warehouse improves query speed and reduces overhead through table structure optimization and storage method optimization.

Table 1  Comparison between data warehouse and database

dimension

database

database

Application Scenario

OLAP

OLTP

Data Sources

multiple data sources

single data source

data standardization

Non-standardized Schema

Highly standardized static schema

Data Reading Advantages

Optimized for read operations

Optimized for write operations

What is a data lake?

Within an enterprise, it has become a consensus that data is an important asset. With the continuous development of enterprises, data continues to accumulate. Enterprises hope to fully preserve all relevant data in production and operation, carry out effective management and centralized governance, and mine and explore the value of data.

It is against this background that the data lake was born. A data lake is a large-scale data warehouse that centrally stores various structured and unstructured data. It can store raw data from multiple data sources and various data types. The data can be accessed, retrieved, and retrieved without structured processing. Processing, Analysis and Transmission. Data lakes can help enterprises quickly complete federated analysis of heterogeneous data sources, mine and explore data value.

The essence of a data lake is a solution consisting of "data storage architecture + data processing tools".

1) Data storage architecture: It must have sufficient scalability and reliability to store massive data of any type, including structured, semi-structured and unstructured data.

2) Data processing tools are divided into two categories:

The first category of tools focuses on how to "move" data to the lake. Including defining data sources, formulating data synchronization strategies, moving data, compiling data catalogs, etc.

The second category of tools focuses on how to analyze, mine, and utilize data in the lake . Data lakes need to have comprehensive data management capabilities, diverse data analysis capabilities, comprehensive data lifecycle management capabilities, and secure data acquisition and data release capabilities. Without these data governance tools and missing metadata, the data quality in the lake cannot be guaranteed, and the data lake will eventually turn into a data swamp.

With the development of big data and AI, the value of data in the data lake is gradually rising, and the value is redefined. Data lakes can bring a variety of capabilities to enterprises, such as realizing centralized management of data, helping enterprises build more optimized operating models, and providing enterprises with other capabilities, such as predictive analysis and recommendation models. These models can stimulate Subsequent growth of enterprise capabilities.

The difference between a data warehouse and a data lake can be compared to the difference between a warehouse and a lake: a warehouse stores goods from a specific source; while a lake’s water comes from rivers, streams, and other sources and is raw data.

Table 2  Comparison of Data Lake and Data Warehouse

dimension

data lake

database

Application Scenario

All types of data can be exploratoryly analyzed , including machine learning, data discovery, profiling, prediction, and more

Data analysis through historical structured data

The cost

Low initial cost, high later cost

High initial cost, low later cost

data quality

Contains a large amount of raw data, which needs to be cleaned and standardized before use

High quality and can be used as a factual basis

Suitable

Data scientists, data developers mainly

business analyst

Huawei Intelligent Data Lake Solution

Huawei's data enablement service DAYU tailors an intelligent management solution for data resources that spans isolated systems and senses business for large-scale government and enterprise customers, realizes global data into the lake, and helps government and enterprise customers mine data from multiple perspectives, levels, and granularities value and realize data-driven digital transformation.

The core of DAYU is Huawei's intelligent data lake FusionInsight, which includes computing engines such as databases, data warehouses, and data lakes, and the data governance center DataArts Studio platform. Full life cycle management of management and data open services.

Huawei FusionInsight solution, the corresponding services are as follows:​

database:

Relational databases include: ApsaraDB for RDS, ApsaraDB for GaussDB (for MySQL), ApsaraDB for GaussDB, ApsaraDB for PostgreSQL, ApsaraDB for SQL Server, etc.

Non-relational databases include: document database service DDS, cloud database GaussDB NoSQL (including Influx, Redis, Mongo, Cassandra), etc.

Data Warehouse: Data Warehouse Service DWS.

Data lake: cloud-native big data MRS, data lake exploration DLI, etc.

Data governance platform: DataArts Studio, the data governance center.

Reposted from: https://support.huaweicloud.com/dataartsstudio_faq/dataartsstudio_03_0004.html

Guess you like

Origin blog.csdn.net/fuhanghang/article/details/132170673
Recommended