Heavy | Delta Lake that the Apache Spark community is looking forward to is open source

Heavy | Delta Lake that the Apache Spark community is looking forward to is open source

Past memory big data Past memory big data The
original text of this article (click below to read the original text to enter) https://www.iteblog.com/archives/2545.html

At the Spark+AI Summit 2019 held in San Francisco, USA on April 24, 2019, Databricks co-founder and CEO Ali Ghodsi announced that Delta Lake in Databricks Runtime will be open sourced based on the Apache License 2.0 protocol. Delta Lake is a storage layer that provides ACID transaction capabilities for Apache Spark and big data workloads. It provides consistent reads during data writing through optimistic concurrency control between write and snapshot isolation. Bring reliability to data lakes built on HDFS and cloud storage. Delta Lake also provides built-in data version control for easy rollback. The current Delta Lake project address is https://delta.io/, and the code maintenance address is https://github.com/delta-io/delta .
For the video of this conference, follow the iteblog_hadoop public account and reply to detla to obtain the download address of the high-definition version of this video.
Heavy | Delta Lake that the Apache Spark community is looking forward to is open source

Why do you need Delta Lake

Many companies now have data lakes in their internal data architectures. The data lake is a large data repository and processing engine. It can store a large amount of various types of data, has powerful information processing capabilities and the ability to handle almost unlimited concurrent tasks or jobs. It was first proposed by Pentaho Chief Technology Officer James Dixon in 2011. Although Data Lake has taken a big step in terms of data scope, it also faces many problems, mainly summarized as follows:

  • The reading and writing of the data lake is unreliable . Data engineers often encounter the problem of unsafe writing to the data lake, causing readers to see junk data during writing. They must construct methods to ensure that readers always see consistent data during writing.
  • The quality of the data in the data lake is very low. It is very easy to dump unstructured data into the data lake. But this comes at the cost of data quality. There is no mechanism to verify the model and data, resulting in poor data quality in the data lake. Therefore, analysis projects that try to mine these data will also fail.
  • As the data increases, the processing performance is poor. As the amount of data stored in the data lake increases, so will the number of files and directories. Data processing jobs and query engines spend a lot of time in processing metadata operations. In the case of streaming jobs, this problem is even more obvious.
  • It is very difficult to update the data in the data lake. Engineers need to build complex pipelines to read the entire partition or table, modify the data and write it back. This model is inefficient and difficult to maintain.

Because of these challenges, many big data projects fail to realize their vision, and sometimes even fail completely. We need a solution that enables data practitioners to use their existing data lake while ensuring data quality. This is the background of Delta Lake.

Introduction to Delta Lake Open Source Project

Delta Lake solves the above problems well to simplify the way we build data lakes. Delta Lake provides the following main functions:
Heavy | Delta Lake that the Apache Spark community is looking forward to is open source
If you want to learn about Spark, Hadoop or HBase-related articles in time, please follow the WeChat public account: iteblog_hadoop

Support ACID transaction

Delta Lake provides ACID transaction guarantees between multiple concurrent writes. Each write is a transaction, and the sequence of writes is recorded in the transaction log. The transaction log tracks file-level writes and uses optimistic concurrency control, which is very suitable for data lakes, because multiple writes/modifications to the same file rarely happen. In the case of conflicts, Delta Lake will throw concurrent modification exceptions so that users can handle them and retry their jobs. Delta Lake also provides a powerful serializable isolation level, allowing engineers to continue writing to directories or tables, and allowing consumers to continue to read from the same directories or tables. Readers will see the latest snapshot that existed at the beginning of the reading.

Schema management

Delta Lake automatically verifies whether the DataFrame mode being written is compatible with the table mode. Columns that exist in the table but are not in the DataFrame are set to null. If the DataFrame has columns that do not exist in the table, this operation will throw an exception. Delta Lake has the ability to explicitly add DDL for new columns and automatically update the mode.

Scalable metadata processing

Delta Lake stores the metadata information of the table or directory in the transaction log instead of in the Metastore. This allows Delta Lake to list files in large directories in a constant time, while being very efficient when reading data.

Data version

Delta Lake allows users to read previous snapshots of tables or directories. When a file is modified, Delta Lake will create a newer version of the file and keep the old version of the file. When users want to read an old version of a table or directory, they can provide a timestamp or version number in the Apache Spark read API, and Delta Lake builds a complete snapshot of the timestamp or version based on the information in the transaction log. This allows the user to reproduce the previous data and restore the table to the old version of the data when needed.
Unified Streaming and Batch Processing Sink
In addition to batch writing, Delta Lake can also be used as an efficient streaming sink for Apache Spark structured streaming. Combining ACID transactions and scalable metadata processing, efficient streaming sinks can now implement a large number of near real-time analysis use cases without the need to maintain complex streaming and batch pipelines at the same time.

The data storage format is open source

All data in Delta Lake is stored in Apache Parquet format, which enables Delta Lake to utilize Parquet's native efficient compression and encoding scheme.

Record update and deletion

This function will be available immediately. Delta Lake will support DML commands such as merge, update and delete. This allows data engineers to easily insert/update and delete records in the data lake. Since Delta Lake tracks and modifies data at file-level granularity, it is more effective than reading and overwriting entire partitions or tables.

Data exception handling

Delta Lake will also support new APIs to set table or catalog data exceptions. The engineer can set a Boolean condition and adjust the alarm threshold to deal with data anomalies. When an Apache Spark job writes to a table or directory, Delta Lake will automatically verify the record, and when there is an abnormality in the data, it will process the record according to the provided settings.

100% compatible with Apache Spark API

This point is very important. Developers can use Delta Lake with their existing data pipelines with only minor modifications. For example, we previously saved the processing results as Parquet files. If you want to use Delta Lake, you only need to make the following modifications:
Heavy | Delta Lake that the Apache Spark community is looking forward to is open source

Guess you like

Origin blog.51cto.com/15127589/2678477