This article takes you to understand Lakehouse's concurrency control: Are we too optimistic?

This article takes you to understand Lakehouse's concurrency control: Are we too optimistic? 

1 Overview

Transactions on data lakes are now considered a key feature of Lakehouse. But what has actually been accomplished so far? What methods are there currently? How do they perform in the real world? These questions are the focus of this blog.

Having had the pleasure of working on various database projects - RDBMS ( Oracle ), NoSQL key-value stores ( Voldemort ), streaming databases ( ksqlDB ), closed source real-time data stores, and of course Apache Hudi, I can say for sure that the workload varies It profoundly affects the concurrency control mechanisms employed in different databases. This blog will also describe how we rethink the concurrency control mechanism of the Apache Hudi data lake.

First, let's get straight to the point, RDBMS databases offer the richest set of transactional features and the widest range of concurrency control mechanisms , different isolation levels, fine-grained locking, deadlock detection/avoidance, and more, as they must support row-level Alterations and reads across multiple tables while enforcing key constraints and maintaining indexes . Whereas NoSQL stores offer very weak guarantees, such as only eventual consistency and simple row-level atomicity, in exchange for better scalability for simpler workloads. Whereas traditional data warehouses provide more or less the full set of features you'd find in an RDBMS based on columnar storage, enforcing locking and key constraints, cloud data warehouses seem to focus more on the storage-compute separation architecture while providing fewer isolation levels . As a surprising example, key constraints are not enforced .

2. Pitfalls in Data Lake Concurrency Control

Historically, data lakes have been seen as batch jobs reading/writing files on cloud storage, it will be interesting to see how most new jobs extend this view and use some form of " optimistic concurrency control " (OCC) to implement file version control. OCC jobs take table-level locks to check if they affect overlapping files, abort the operation if there is a conflict, and the locks are sometimes even just JVM-level locks held on a single Apache Spark Driver node, which is useful for older styles that primarily append files to tables Lightweight coordination of batch jobs may be fine, but not widely applicable to modern data lake workloads. Such methods are built with immutable/append-only data models in mind, which are not suitable for incremental data processing or keyed update/delete. The OCC is very optimistic that a real conflict will never happen. The developer sermon comparing OCC to the full-fledged transactional capabilities of an RDBMS or traditional data warehouse is simply wrong, citing Wikipedia directly - "If there is frequent contention for data resources, the cost of repeatedly restarting transactions can significantly hurt performance , in which case other methods of concurrency control might be more appropriate." When conflicts do occur, they cause a lot of wasted resources, because you have batch jobs that fail after a few hours of each attempt to run!

Imagine a real-world scenario with two write processes: an ingest write job that generates new data every 30 minutes and a delete job that does GDPR and takes 2 hours to complete the delete. These are likely to overlap with random deletion of files, and delete jobs are almost guaranteed to starve and fail to submit every time. On the database side, mixing long-running transactions with optimism can lead to disappointment, because the longer the transactions, the more likely they are to overlap.

So what are the alternatives? Lock? Wikipedia also says - "However, lock-based ("pessimistic") approaches may also provide poor performance, since locks greatly limit effective concurrency even if deadlocks are avoided.". This is where Hudi takes a different approach, which we believe is better suited for modern data lake transactions, which are often long-running or even continuous. Data lake workloads share more characteristics with high-throughput stream processing jobs than standard read/writes of databases, and that's where we draw from. In stream processing, events are serialized into a single ordered log, avoiding any locking/concurrency bottlenecks, and users can process millions of events per second continuously. Hudi implements a file-level, log-based concurrency control protocol on the Hudi Timeline , which in turn relies on minimal atomic writes to cloud storage. By building the event log as a core part of inter-process coordination, Hudi is able to provide some flexible deployment models that offer higher concurrency than pure OCC approaches that only track table snapshots.

3. Model 1: Single write, inline table service

The simplest form of concurrency control is no concurrency at all. Data lake tables typically run common services on them to ensure efficiency, reclaiming storage space from old versions and logs, merging files (Clustering in Hudi), merging deltas (Compaction in Hudi), and more. Hudi can simply eliminate the need for concurrency control and maximize throughput by supporting these out-of-the-box table services and running them inline after each table write.

Execution plans are idempotent, persist to the timeline and automatically recover from failures. For most simple use cases, this means that just writing is enough to get a well-managed table that doesn't require concurrency control.

https://hudi.apache.org/assets/images/SingleWriterInline-d18346421aa3f1d11a3247164389e1ce.gif

4. Model 2: Single write, asynchronous table service

Our delete/ingest example above is not that simple. While ingest/write might just update the last N partitions on the table, deletes might even span the entire table, mixing them in the same workload can significantly impact ingest latency, so Hudi provides a service to run the table asynchronously option where most of the heavy lifting (such as actually rewriting the column data via the compaction service) is done asynchronously, eliminating any repeated wasteful retries, while also using the Clustering technique. So a single write can use both regular updates and GDPR deletes and serialize them to the log. Given that Hudi has record-level indexes and avro log writes are much cheaper (the latter can be 10x or more expensive compared to writing to parquet), ingestion latency can be sustained while enjoying excellent traceability. In fact we were able to scale this model to 100 petabytes of data at Uber , by ordering all deletes and updates into the same source Apache Kafka topic, concurrency control is more than just locks, Hudi does all this without any external locks .

https://hudi.apache.org/assets/images/SingleWriterAsync-3d7ddf7312381eab7fdb91a7f2746376.gif

5. Model 3: Multiple writes

But it is not always possible to serialize deletes into the same write stream, or require sql based deletes. With multiple distributed processes, some form of locking is unavoidable, but just like a real database, Hudi's concurrency model is smart enough to separate what's actually being written to the table from the table service that manages or optimizes the table Come. Hudi provides similar optimistic concurrency control across multiple writers, but table services can still execute completely lock-free and asynchronously. This means that delete jobs can only encode deletes, ingest jobs can log updates, and the compression service again applies updates/deletions to the base file. While delete jobs and ingest jobs can compete and starve each other as we mentioned above, they run at a much lower time and waste because compression does the heavy lifting of parquet/column data writes.

https://hudi.apache.org/assets/images/MultiWriter-6068037346e21d41e0e620fb514e2342.gif

To sum up, on this basis, we still have many ways to improve.

  • First, Hudi has implemented a marking mechanism that keeps track of all files that are part of an active write transaction, and a heartbeat mechanism that keeps track of active writers to a table. This can be used directly by other active transactions/writers to detect what other writers are doing, and if a conflict is detected, abort early , returning cluster resources to other jobs faster.

  • While optimistic concurrency control is attractive when serializable snapshot isolation is required, it is neither the best nor the only way to handle concurrency among writers. We plan to use CRDTs and widely adopted stream processing concepts to achieve fully lock-free concurrency control through our log merge API , which has been proven to sustain huge sequential write volumes for the data lake.

  • When it comes to key constraints, Hudi is the only lake transaction layer today that ensures unique key constraints , but only on the table's record keys. We'll be looking to extend this functionality to non-primary key fields in a more general form, using the newer concurrency model described above.

Finally, for a successful transformation of a data lake to Lakehouse, we must learn from the failure of the "Hadoop Warehouse" vision, which has similar goals to the new "Lakehouse" vision. Designers did not pay close attention to the missing technology gaps associated with data warehousing, and created unrealistic expectations for actual software. As transactions and database capabilities finally become mainstream in data lakes, we must apply these lessons and be upfront about current shortcomings. If you're building a Lakehouse, I hope this article has encouraged you to think carefully about the various operational and efficiency aspects surrounding concurrency control. Join our rapidly growing community by trying out Apache Hudi , or join Slack for further communication.

Lakehouse Concurrency Control: Are we too optimistic? | Apache Hudi

 

Guess you like

Origin blog.csdn.net/m0_62396648/article/details/124414300