What is Delta Lake?

foreword

This article belongs to the column "Big Data Technology System". This column is original by the author. Please indicate the source for the citation. Please point out the deficiencies and mistakes in the comment area, thank you!

For the directory structure and references of this column, please refer to Big Data Technology System


background

Data lakes are very useful and convenient, let's analyze the needs and usage of data lakes.

Please refer to my blog about data lakes - what is a data lake? Why do you need a data lake?

Hadoop systems and data lakes are often mentioned together.

Data is loaded into the Hadoop Distributed File System (HDFS) and stored on many computer nodes in a Hadoop cluster in a deployment based on a distributed processing architecture.

However, data lakes are increasingly built using cloud object storage services rather than Hadoop.

Some NoSQL databases are also used as platforms for data lakes.

About NoSQL, please refer to my blog - what is NoSQL?

Large data sets containing structured, unstructured, and semi-structured data are often stored in data lakes.

Most data warehouses built on relational systems are not well suited for this situation (ROLAP).

About ROLAP, please refer to my blog - What are the implementation methods of OLAP?

Relational systems can usually only store structured transactional data because they require a fixed data schema (Schema).

Data lakes do not require any upfront definition and support various schemas.

As a result, they can now manage various data types in different forms.

Data lakes built using the Hadoop framework also lack a very basic feature, ACID compliance .

Hive tries to overcome some limitations by providing update functionality, but the whole process is messy.

Different companies in the industry have different solutions to the above problems. Databricks (the company behind Spark) has proposed a unique solution, namely Delta Lake.

Delta Lake allows ACID transactions on existing data lakes.

It can be seamlessly integrated with many big data frameworks, such as Spark, Presto, Athena, Redshift, Snowflake, etc.


WHAT

insert image description here

Delta Lake is an open source project that can run on your existing data lake, build a lake-warehouse architecture on the data lake, and is fully compatible with Apache Spark API.

Please refer to my blog about the integration of lake and warehouse - what is Lakehouse?

Delta Lake provides ACID transactions, scalable metadata processing, and unifies streaming and batch data processing on existing data lakes such as S3, ADLS, GCS, and HDFS.

Specifically, Delta Lake offers:

  • ACID transactions on Spark: Serializable isolation levels ensure readers never see inconsistent data.
  • Scalable metadata processing: Use Spark's distributed processing capabilities to easily process all metadata of petabyte-level data tables with billions of files.
  • Stream-batch unification: The tables in Delta Lake are both batch tables and sources and sinks for stream processing. Streaming data ingestion, batch history backfilling, and interactive queries are all available out of the box.
  • Schema Enforcement: Automatically handle schema (Schema) changes to prevent bad records from being inserted during ingestion.
  • Time Travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
  • upsert/delete: Supports merge, update, and delete operations to enable complex scenarios such as change data capture, slowly changing dimension (SCD) operations, stream upgrades, etc.

What do we do with Delta Lake?

Big data architectures are currently challenging to develop, operate and maintain.

In modern data architectures, real-time computing, data lakes, and data warehouses are often used in at least three ways.

Business data is transported over streaming networks, such as Apache Kafka, that prioritize faster delivery.

About Apache Kafka, please refer to my blog - what is Kafka?

Data is then collected in Data Lakes, which are used for large-scale, inexpensive storage, including Apache Hadoop or Amazon S3.

The most important data is uploaded to the data warehouse because, unfortunately, data lakes alone cannot support high-end business applications in terms of performance or quality.

These storage costs are significantly higher than data lakes, but after considerable performance, concurrency and security.

Batch and stream processing systems prepare records concurrently in the Lambda architecture, which is a common record preparation technique.

The results are then combined during the survey to provide a comprehensive response.

This architecture is notorious for its strict latency requirements for handling recently generated events as well as older ones.

The main disadvantage of this architecture is the development and operational burden of maintaining two separate systems.

In the past, there have been attempts to integrate batch and stream processing into one system.

On the other hand, companies are not always successful in their attempts.

A key component of most databases is ACID.

However, in terms of HDFS or S3, it is challenging to provide the same reliability as ACID databases.

Delta Lake implements ACID transactions in the transaction log by tracking all commits to the record directory.

The Delta Lake architecture provides a serialized isolation level to ensure data consistency across many users.


Parquet VS Delta Lake

The storage format of the underlying data of Delta lake is a wrapper of the Parquet data format, but it provides some additional functions on top of Parquet.

About parquet, please refer to my blog - what is Parquet?

Here are the differences:

Parquet Delta Lake
columnar data storage ACID transactional storage layer
type declaration encoding Scalable Metadata Processing
Data version not supported Support data version
Applicable to any project in the Hadoop ecosystem, regardless of the data processing framework Only applicable to Spark processing framework, can be integrated with Presto, Athena, etc.

Architecture of Delta Lake

insert image description here

The architecture of Delta Lake is generally divided into three areas.

Here, Bronze tables are typical data lakes, with massive amounts of data constantly pouring in. At this point the data may be dirty (i.e. the data quality is low) because it comes from different sources, some of which are not so clean.

For data quality, please refer to my blog - how to evaluate data quality?

After that, the data flows into the Silver table continuously, just like the source of the data stream connected to the data lake, fast moving and constant flow.

When data flows downstream, it is cleaned and filtered through different functions, filters, and query transformations, and becomes more pure as the data flows.

When it gets to the data processing downstream i.e. our Gold table it goes through some final cleansing and rigorous testing to make it ready because the consumers i.e. machine learning algorithms, data analytics etc are very picky and don't Will tolerate polluted data.


How does Delta Lake work?

In order to understand how Delta Lake works, it is necessary to understand how the transaction log works.

The running thread of the transaction log runs through many important functions, including ACID transactions, scalable metadata processing, time travel and so on.

Whenever a user executes any modified command, Delta Lake breaks it down into a series of steps consisting of one or more actions.

Action is a concept in Spark. For details, please refer to my blog—— Spark Core core concepts all in one go

These actions include:

  • add file: it adds data file
  • DELETE FILE: It deletes data files
  • Update Metadata: It updates table metadata.
  • setup transaction: it documents the structure streaming job that creates the micro-batch with id
  • Change Protocol: Increased security by transferring Delta Lake to the latest security protocols.
  • Commit Info: It contains information about the commit.

Guess you like

Origin blog.csdn.net/Shockang/article/details/126804682