Detailed explanation of big data storage architecture: data warehouse, data mart, data lake, data grid, lake warehouse integration

foreword

This article belongs to the column "Theoretical System of Big Data". This column is original by the author. Please indicate the source for the citation. Please point out the shortcomings and mistakes in the comment area, thank you!

For the directory structure and references of this column, please refer to the Big Data Theory System


mind Mapping

insert image description here


database

The data warehouse is a 面向主题(Subject Oriented), 集成(Integrate),
相对稳定(Non-Volatile), 反映历史变化(Time Variant) data collection.

The main goal of a data warehouse is to provide consistent, reliable, and easily accessible data
to support business decision-making and analysis.

It can help companies understand their business, market and customers,
and provide decision support and predictive analysis capabilities.

Data warehouses are widely used in business intelligence and data analysis.

For more information about data warehouses, please refer to my blog - what is a data warehouse?

Please refer to my blog about business intelligence - what is business intelligence (BI)?


Database VS Data Warehouse

the difference database database
Design goals Support the day-to-day business operations of the enterprise Support enterprise decision making and analysis
data structure application-oriented design theme-oriented design
Data processing method Online transaction processing ( OLTP) mode OLAP ( OLAP) mode
data range current status data Store historical, complete data that reflects historical changes
data change Support frequent addition, deletion, modification and query operations Can be added, not deleted, not changed, and reflects historical changes
design theory Follow the Three Paradigms and Avoid Redundancy Violation of normal form, appropriate redundancy
Processing capacity Frequent, small batch, high concurrency, low latency Infrequent, high-volume, high-throughput, delayed

For details on the comparison between databases and data warehouses, please refer to my blog - the difference between data warehouses and databases?

OLTP vs OLAP

Compare items OLTP OLAP
user Operators, low-level managers decision makers, senior managers
Function Daily Operations analysis decision
DB design Based on ER model, application-oriented Star/snowflake/constellation models, subject-oriented
DB size GB to TB ≥TB
data up-to-date, detailed, two-dimensional, discrete historical, aggregated, multidimensional, integrated
storage size Read/write several (even hundreds) records Read millions (or even hundreds of millions) of records
Operating frequency very often (in seconds) Relatively loose (by the hour or even by the week)
unit of work strict affairs complex query
User number hundreds to tens of millions several to hundreds
measure transaction throughput Query throughput, response time

For details about the comparison between OLTP and OLAP, please refer to my blog - the difference between OLTP and OLAP?

data warehouse layering

insert image description here

For details on the layering of the data warehouse, please refer to my blog - How is the data warehouse layered?

Data Warehouse Modeling

insert image description here

For details about modeling methodology, please refer to my following 2 blogs:

  1. Typical Data Warehouse Modeling Methodology
  2. How is a data warehouse modeled?

data mart

A data mart is a subset of a data warehouse specialized for a particular business unit or subject area.
It focuses on storing a small portion of a company's selected data in a larger storage system
and obtains data from fewer sources than a data warehouse.

For more information about data marts, please refer to my blog - what is a data mart? What is the difference between data mart and data warehouse

If the data warehouse is regarded as the data collection of the whole company, the data mart can be regarded as one of the departments, which is only responsible for processing the data of specific business.

Data Mart vs Data Warehouse

A Data Warehouse is a repository for the entire enterprise that contains integrated data from different businesses, systems, and departments. It is built on an enterprise-wide data model and targets enterprise-wide topics.

Features of a data warehouse include:

  • Enterprise-wide coverage: The data warehouse provides decision support for departments and operations across the enterprise.
  • Integrated data: The data warehouse brings together data from multiple businesses, systems, and departments, and through data cleaning, integration, and transformation, it meets the analysis and reporting needs of the enterprise.
  • Enterprise-level architecture: A data warehouse is an enterprise-level solution, usually designed, built and maintained by a professional team.
  • Enterprise-oriented theme: The theme of the data warehouse is related to the operation of the entire enterprise, such as sales, customers, supply chain, etc.

Data Mart is a themed data repository for a specific business domain or functional unit. It is usually departmental and provides decision support to managers within a local area.
Features of a data mart include:

  • Department-level application: The data mart mainly serves the business needs of a specific department or functional unit, and provides data analysis and reports for the department.
  • Department-oriented theme: The theme of the data mart is related to a specific business or functional unit, such as sales performance, marketing, finance, etc.
  • Data source: The data source of the data mart can be obtained from the data warehouse (subordinate data mart), or from various production systems (independent data mart).
  • Relatively small scale: The scale of the data mart is usually on the order of tens of gigabytes, which is relatively small compared to the data warehouse.

Below is a table describing the difference between a data warehouse and a data mart:

database data mart
scope of application entire enterprise specific department or functional unit
Data Sources Integrate data from different businesses, systems and departments Available from the data warehouse, or from each production system
scale Larger (enterprise class) relatively small (departmental level)
architecture Enterprise Architecture departmental structure
theme For enterprise themes departmental theme
Target Decision support for all departments across the enterprise Decision Support for Specific Sectors
Function Provide enterprise-wide data analysis and reporting Provide department-level data analysis and reporting

data lake

A data lake is an organizational method for storing large-scale and diverse data . It can store 结构化, 非结构化and 半结构化high-quality data. It is a large-scale, flexible data storage warehouse that can integrate all data sources of an enterprise.

For more information about data lakes, please refer to my blog - what is a data lake? Why do you need a data lake?

structured, semi-structured and unstructured data

Structured, semi-structured and unstructured data are different types of data classification.

  1. Structured data: Structured data refers to data that can be represented and stored using a relational database, usually in the 二维表form of . Structured data has the following characteristics:

    • The data is in units of rows, each row of data represents the information of an entity, and the attributes of each row are the same.
    • Data can be represented by a unified structure, such as numbers, symbols, etc.
    • Data can be implemented in a logical representation of a two-dimensional table structure, including attributes and tuples. For example, a transcript could be an attribute, and a score of 90 could be a corresponding tuple.
    • There are certain rules in storage and arrangement, which are convenient for operations such as query and modification.
  2. Semi-structured data: Semi-structured data is a form of structured data that does not fully conform to the specifications of relational data. Semi-structured data has the following characteristics:

    • Semi-structured data has both data and structure, but the structure is not strictly fixed.
    • Semi-structured data can use various data representation formats, such as XML, JSONetc.
    • The structure of the data may vary from record to record, but is still somewhat parseable and organized.
    • Semi-structured data is commonly found in scenarios such as web data, log files, and configuration files.
  3. Unstructured data: Unstructured data refers to data without a fixed structure and format, and usually cannot be stored and represented in the form of a relational database. Unstructured data has the following characteristics:

    • The data does not have a clear organizational structure, and may be free 文本, 图像, 音频, 视频and other forms of data.
    • Unstructured data is not suitable for storage and management using traditional relational databases.
    • The analysis and processing of unstructured data requires the use of specific technologies and tools, such as natural language processing, image processing, audio processing, etc.
    • Unstructured data is commonly found in social media content, emails, documents, multimedia files, and more.

To sum up, structured data is data with a fixed structure and regular arrangement, semi-structured data is a data form between structured data and unstructured data, and unstructured data is data without a clear structure and format . These different types of data require different methods and tools to process and manage when analyzing and processing them.

Data Warehouse vs Data Lake

parameter database data lake
data storage structured data structured, semi-structured and unstructured data
data preparation Cleaned and processed data Raw data, no preprocessing required
data structure Predefined schemas with strict schema No fixed schema, data is stored in raw form
data purpose Support for business intelligence and analytics Support for exploratory analysis and machine learning
user Business Analysts and Business Users Data Scientists and Engineers
data access SQL query 多种工具和技术,如Apache Spark和Hadoop
数据规模 相对较小(相对于数据湖) 可以存储大规模数据,包括PB级数据
数据处理方式 提取、转换和加载(ETL) 提取、加载和转换(ELT)
数据处理速度 高性能,适合历史数据分析 高度灵活,适合实时和流式数据分析
数据架构 星型或雪花型 没有特定的数据架构
成本 相对较高,需要预定义模式和规划 相对较低,可以存储各种类型的数据

数据网格

数据网格(DataMesh)是一个新兴的概念,旨在帮助组织更好地管理和利用分散在不同系统和应用程序中的数据资产。它强调将数据资产转化为可重用、可组合、可交互的数据元素,以支持组织内部和跨组织的业务创新和数字化转型。

DataMesh的核心理念是基于事件驱动的架构,通过将业务事件和数据元素相结合,将数据资产转化为可编程的、可组装的服务和功能。这种方法可以帮助组织更好地理解和利用其数据资产,并支持更高效、更灵活的业务流程和数据处理。

DataMesh还强调数据治理和数据安全,以确保数据的准确性、可靠性和安全性。它提供了一组数据管理和治理工具,以帮助组织更好地管理其数据资产,并确保符合法规和标准的要求。

关于数据网格的详情请参考我的博客——数据网格(Data Mesh)是什么?

数据仓库 VS 数据网格

特征 Data Warehouse(数据仓库) DataMesh(数据网格)
来源 传统上,数据仓库是将各种异构数据源集成到一个集中的位置(通常是一个数据库)中。 数据网格将数据分散在不同的领域团队中,每个团队负责自己的数据产品。
数据拥有权 数据仓库通常由中央团队负责管理和维护。 数据网格将数据拥有权下放给领域团队,每个团队可以自主管理和拥有自己的数据。
架构 数据仓库通常采用集中式架构,将数据集成到一个中心存储中。 数据网格采用分布式架构,数据存储在不同的领域团队中,通过标准化的规则和语法进行连接和交互。
数据冗余性和业务对齐 数据仓库通常会合并和整合数据,以消除冗余并满足业务需求。 数据网格允许数据在不同的领域团队之间存在冗余,以满足各自的业务需求。
数据观测性的重要性 数据仓库需要观测数据质量,以确保数据的高质量和可靠性。 数据网格同样需要观测数据质量,确保数据的可靠性和可发现性。
目标 数据仓库旨在提供一个一致、可信赖的数据源,用于企业的决策支持和分析。 数据网格旨在通过领域团队拥有的数据产品,实现更快速的洞察和分析,并推动数据驱动的决策制定。

湖仓一体

湖仓一体是一个全新的开放式数据架构,它将数据湖和数据仓库的优势组合在一起,
提供了数据湖的灵活性和可扩展性以及数据仓库的数据管理功能
这个架构是在数据湖较低成本的数据存储基础设施上构建的,
它不仅保留了数据湖的特点,如存储非结构化数据和半结构化数据
还可以支持事务、数据治理和数据模型化等功能,这些特点是数据仓库所具备的。

关于湖仓一体的详情请参考我的博客——湖仓一体(Lakehouse)是什么?

数据仓库 VS 湖仓一体

特征 数据仓库 湖仓一体
数据存储方式 结构化数据 结构化、半结构化和非结构化数据
数据处理方式 批量处理 批量处理和实时处理
数据集成 集成的 非集成的
数据模型 事实和维度模型 没有明确的数据模型
数据更新频率 周期性更新 实时或近实时更新
数据访问方式 预定义的查询 自助查询
数据可伸缩性 受限制 高度可伸缩
数据安全性 严格的访问控制 灵活的访问控制
数据处理工具和技术 ETL工具和SQL 大数据处理工具和技术
目标用户 决策者和分析师 决策者、分析师和数据科学家

总结

数据库、数据仓库、数据集市、数据湖、数据网格和湖仓一体是数据管理和存储的不同解决方案,它们在以下方面有所区别:

  1. 数据库(Database)是一个存储相关数据的地方,用于捕获特定情况的数据。它可以是结构化、关系型、非结构化或NoSQL数据库。数据库主要用于在线事务处理(OLTP),处理实时的事务数据,并具有特定的目的和应用。
  2. 数据仓库(Data Warehouse)是组织的核心分析系统,用于存储历史数据和支持数据分析。数据仓库与操作数据存储(Operational Data Store,ODS)一起工作,将各种数据库中的数据捕获并统一存储在一个位置。数据仓库采用提取-转换-加载(Extract-Transform-Load,ETL)或类似的ELT过程,将数据从数据库中提取出来,经过转换和清洗后加载到数据仓库中。数据仓库通常使用SQL查询数据,并使用表、索引、键、视图和数据类型进行数据组织和完整性。数据仓库主要用于在线分析处理(OLAP),支持企业内部的数据分析和商业智能。
  3. 数据集市(Data Mart)是数据仓库的子集,为特定的业务部门或业务单元提供数据支持。数据集市通常是针对特定需求进行建立的,以满足某个部门的数据分析和决策需求。数据集市包含在数据仓库中,其中的数据集是为了实时分析和行动结果而使用。
  4. 数据湖(Data Lake)是一个用于存储原始数据的大型存储库,可以存储结构化、半结构化和非结构化数据。数据湖接收来自不同来源的数据,而不对其进行特定格式的转换和处理。数据湖存储的数据可以在需要时进行处理和分析。数据湖适用于需要存储大量原始数据,并进行灵活的数据分析和探索的场景。
  5. 数据网格(DataMesh)是一种数据组织和架构的概念,旨在实现数据的自治和共享。DataMesh鼓励将数据所有权和管理责任下放给数据所有者,以便更好地支持跨组织和跨团队的数据共享和协作。
  6. 湖仓一体(LakeHouse)是将数据湖和数据仓库集成在一起的解决方案。它结合了数据湖的灵活性和数据仓库的结构化分析能力,使得用户可以同时进行原始数据探索和历史数据分析。

综上所述,数据库主要用于在线事务处理,数据仓库用于存储历史数据和支持数据分析,数据集市是数据仓库的子集,满足特定业务部门的需求,数据湖存储原始数据并支持灵活的数据分析,数据网格鼓励数据自治和共享,湖仓一体则是将数据湖和数据仓库集成在一起的解决方案。

下面是一个表格,描述了数据库、数据仓库、数据集市、数据湖、数据网格和湖仓一体之间的主要区别:

数据库(Database) 数据仓库(Data Warehouse) 数据集市(Data Mart) 数据湖(Data Lake) 数据网格(DataMesh) 湖仓一体(LakeHouse)
定义 存储相关数据的地方 存储历史数据和支持数据分析 针对特定业务部门的数据子集 存储原始数据的大型存储库 数据的自治和共享 将数据湖和数据仓库集成的解决方案
用途 在线事务处理(OLTP) 在线分析处理(OLAP) 特定业务部门的数据分析和决策支持 灵活的数据分析和探索 跨组织和跨团队的数据共享和协作 原始数据探索和历史数据分析
数据类型 结构化、关系型、非结构化、NoSQL 结构化 结构化 结构化、半结构化、非结构化 结构化、半结构化、非结构化 结构化、半结构化、非结构化
数据处理 实时事务数据处理 提取-转换-加载(ETL)或类似ELT过程 针对特定需求的数据提取和整合 原始数据存储,按需处理和分析 数据所有者自治,分布式数据共享 结合原始数据探索和历史数据分析
查询 SQL查询 SQL查询 SQL查询 按需处理和分析 分布式数据查询和共享 结合原始数据探索和历史数据分析
数据组织 表、索引、键、视图、数据类型 表、索引、键、视图、数据类型 表、索引、键、视图、数据类型 灵活的数据组织 分布式数据组织和架构 灵活的数据组织
数据共享 有限的共享能力 针对特定用户和部门的共享 Sharing for Specific Business Units Emphasis on sharing across organizations and teams Emphasis on data autonomy and sharing Combining the sharing capabilities of data lakes and data warehouses
data analysis Real-time transactional data analysis Historical data analysis and business intelligence Data analysis and decision support for specific business units Flexible Data Analysis and Exploration Data analysis and collaboration across organizations and teams Combine raw data exploration and historical data analysis

Guess you like

Origin blog.csdn.net/Shockang/article/details/131512410