1.前言

本文将新兴的数据湖技术和数据仓库技术进行了对比，然后简要介绍三种常见的数据湖实施方案。

2.数据仓库痛点

没有存储非结构化的数据

这里并不是说数仓不能存储非结构化的数据，而是数仓的分层模型决定了数据会被规整计算为结构化的数据，然后在处理完成的数据上进行建模、分析等。

一般的数仓分层模型：ODS-> DWD-> DWS-> APP。数据分析人员一般会在 APP或 DWS层上进行分析，而不会直接针对 ODS（原始数据层）进行分析。
没有保留原始数据

企业出于成本考虑，ODS层的原始数据通常只有一定时间的保存期限。

数据湖的核心理念如下：

目前业界流行的数据湖实施方案有三种：

Apache Hudi

Hudi is a rich platform to build streaming data lakes with incremental data pipelines
on a self-managing database layer, while being optimized for lake engines and regular batch processing.
Apache Iceberg

Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, and Hive to safely work with the same tables, at the same time.
Delta Lake

Delta Lake is an open-source storage framework that enables building a
Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python.

下面是对三种方案的一些对比：

本文通过对比数据湖和数据仓库的各项特性，发现数据湖的出现并不是为了替代数据仓库，而是补齐数仓缺失的一些能力。