Databend + lakeFS: Embed data versioning into your analytics workflow

Author: Shang Zhuoran (PsiACE)

Master student at Macau University of Science and Technology, Databend R&D engineer intern

Apache OpenDAL(Incubating) Committer

https://github.com/PsiACE

picture

Cloud computing provides cheap, elastic, shared storage services for data-centric applications, which provides obvious benefits for modern data processing workflows: massive data, high concurrent access, large throughput, and more and more cases Start migrating the old technology stack to the data lake architecture.

When we put the data lake in the cloud, new problems arise:

  • The old data warehouse/big data analysis technology may not be designed specifically for cloud and object storage. The performance and compatibility may not be ideal, and a lot of resources need to be invested in maintenance. How to provide a truly modern and low-cost data lake? , high-performance, high-quality analysis services?

  • The demand for data management has only strengthened, providing higher requirements for the reproducibility of analysis results and the shareability of data sources. How to provide elasticity and manageability for data, so that data scientists, data analysts and data Engineers collaborate closely on a logically consistent view?

If you have questions, there will be answers!

Databend creates a truly cross-cloud and native data warehouse based on object storage on the cloud. Designed using the serverless concept, it provides a high-performance query engine that is distributed, elastically scalable, and easy to operate and maintain. It supports common structured and semi-structured data and can be tightly integrated with modern data technology stacks.

lakeFS is dedicated to providing solutions for sharing and collaborating on data. Empower object storage with Git-like operating logic, use versioning solutions to provide a logically consistent view of data, embed meaningful branch names and submission information into modern data workflows, and provide solutions for the integration of data and documents.

In this article, we will combine the two and provide a simple and clear workshop to help you quickly build a modern data workflow.

Why you need Databend

As the amount of data increases, traditional data warehouses face huge challenges. They cannot effectively store and process massive amounts of data, and it is difficult to flexibly adjust computing and storage resources according to workload, resulting in high usage costs. In addition, data processing is complex, requiring a lot of resources to be invested in ETL, and data rollback and version control are also very difficult.

Databend aims to solve these pain points. It is an open source, elastic, load-aware cloud data warehouse developed using Rust, which can provide cost-effective and complex analysis capabilities for very large-scale data sets.

  • Cloud-friendly: Seamlessly integrates various cloud storages, such as AWS S3, Azure Blob, CloudFlare R2, etc.

  • High performance: Developed in Rust, utilizing SIMD and vectorization processing to achieve extremely fast analysis.

  • Economic flexibility: Innovative design, independent scaling of storage and computing, optimizing cost and performance.

  • Simple data management: Built-in data preprocessing capabilities, no external ETL tools required.

  • Data version control: Provides multi-version storage similar to Git, supporting data query, cloning and rollback at any point in time.

  • Rich data support: Supports multiple data formats and types such as JSON, CSV, Parquet, etc.

  • AI-enhanced analysis: Integrate AI functions to provide data analysis capabilities driven by large models.

  • Community-driven: It has a friendly and continuously growing community and provides an easy-to-use cloud analysis platform.

picture

The picture above is the Databend architecture diagram, taken from datafuselabs/databend .

Why you need lakeFS

Since object storage often lacks atomicity, rollback and other capabilities, data security cannot be well guaranteed, and quality and recoverability also decrease. In order to protect the data in the production environment, it is often necessary to use isolated copies for rehearsal testing, which not only consumes resources, but also makes it difficult to truly collaborate.

When it comes to collaboration, you may think of Git, but Git is not designed for data management. In addition to the inconvenience of binary data management, Git LFS's limit on the size of a single file also restricts its applicable scenarios.

lakeFS came into being, which provides open source data version control for data lakes—branch, commit, merge, revert, just like using Git to manage code. With support for zero-copy development/test isolation environments, continuous quality verification, atomic rollback of erroneous data, repeatability and other advanced features, you can even easily verify ETL workflows on production data without worrying about damage to your business.

picture

The picture above shows the data workflow recommended by lakeFS, taken from https://lakefs.io/ .

Workshop: Use lakeFS to support your analysis business

In this workshop, we will use lakeFS to create branches for the repository, and use Databend to analyze and transform the preset data.

Since the experimental environment contains some dependencies, the first startup may take a long time. We also recommend using the combination of Databend Cloud + lakeFS cloud , so that you can skip the time-consuming environment setup part and directly start experiencing data analysis and conversion .

Environment settings

In addition to lakeFS, the environment used this time will also include MinIO as the underlying object storage service, as well as commonly used data science tools such as Jupyter and Spark. You can check out this git repository for more information.

picture

The picture above is a schematic diagram of this experimental environment, taken from treeverse/lakeFS-samples .

Clone repository

git clone https://github.com/treeverse/lakeFS-samples.git
cd lakeFS-samples

Start the full stack experimental environment

docker compose --profile local-lakefs up

Once the experimental environment is started, you can log in to lakeFS and MinIO using the default configuration to observe data changes in subsequent steps.

Data observation

During the environment setup process, a repository named quickstart will be prepared in lakeFS in advance. In this step, we will make some simple observations about it.

If you use your own deployed lakeFS + MinIO environment

  • You may need to manually create the corresponding bucket in MinIO first.

picture

  • Then create the corresponding repository in LakeFS and check to populate the sample data.

picture

lakeFS

Open lakeFS ( http://127.0.0.1:8000) in the browser , enter the Access Key ID and Secret Access Key to log in to lakeFS.

Then open the quickstart repository, you can see that some default data already exists, and it also contains a default tutorial.

picture

The data repository mode of lakeFS almost corresponds to code repositories such as GitHub, and there is almost no learning cost: among them, is the lakes.parquetpre-prepared data, and datathe in the folder lakes.source.mdintroduces the source of the data; scriptsthe folder contains the The script, whose complete workflow can _lakefs_actionsbe found in the directory, is written in a format similar to GitHub Actions; README.mdit corresponds to the Markdown source file of the tutorial below, and imagescontains all the images used.

MinIO

quickstartSince we use MinIO as the underlying storage in the experimental environment, a bucket named can also be found in MinIO . StorageNamespaceThis is determined by lakeFS when creating the repository .

picture

Among them, dummythe file is created when creating a new lakeFS repository to ensure that we have sufficient permissions to write to the bucket.

The _lakefsdirectory only contains two files, created when importing data from data sources such as S3, to identify some references to the original location of the imported files.

New objects written through lakeFS will be located datain the directory.

Data correspondence

Open datathe directory, we can find some files, but it is difficult to correspond to the data in lakeFS.

picture

Let's go back to lakeFS, click the gear icon on the right side of the file, and then select it Object Infoto easily find the corresponding relationship.

picture

Data analysis and transformation

In this step, we will deploy the Databend service, mount the data in lakeFS through Stage and analyze it, and replace the data files denmark-lakesin the branch with the converted results .lakes.parquet

Deploy Databend

Databend's storage engine also supports advanced features such as Time Travel and atomic rollback, so there is no need to worry about operational errors.

Here we use the single-node Databend service with MinIO as the storage backend. For the overall deployment process, you can refer to the Databend official documentation . Some details that need to be noted are as follows:

  • Since we have deployed the MinIO service in the above steps, we only need to open and 127.0.0.1:9000create a databendBucket named.

  • Next, you need to prepare relevant directories for logs and Meta data.

sudo mkdir /var/log/databend
sudo mkdir /var/lib/databend
sudo chown -R $USER /var/log/databend
sudo chown -R $USER /var/lib/databend

  • Secondly, because the default http_handler_porthas been occupied by the previous service, you need to edit databend-query.tomland make some changes to avoid conflicts:
http_handler_port = 8088

  • In addition, we also need to configure the administrator user according to Docs | Configuring Admin Users . Since it is just a workshop, we choose the simplest way here, just cancel [[query.users]]the field and the root user's comments:
[[query.users]]
name = "root"
auth_type = "no_password"

  • Since we are using MinIO as the storage backend, we need to [storage]configure it.
[storage]
# fs | s3 | azblob | obs | oss
type = "s3"

# To use S3-compatible object storage, uncomment this block and set your values.
[storage.s3]
bucket = "databend"
endpoint_url = "http://127.0.0.1:9000"
access_key_id = "minioadmin"
secret_access_key = "minioadmin"
enable_virtual_host_style = false

Next, you can start Databend normally:

./scripts/start.sh

We strongly recommend that you use BendSQL as the client. Due to http_handler_portthe port change, you need to use bendsql -P 8088the Databend service to connect. Of course, we also support multiple access forms such as MySQL Client and HTTP API.

Create a branch

The usage of lakeFS is similar to GitHub. Open the branches page of the Web UI , click Create Branchthe button, and create a denmark-lakesbranch named.

picture

Create stage

Databend can mount the data directory located in the remote storage service through Stage. Since lakeFS provides S3 Gateway API, we can configure the connection according to s3 compatible services. It should be noted that the URL here needs to s3://<repo>/<branch>be constructed according to , and ENDPOINT_URLthe port of lakeFS is 8000.

CREATE STAGE lakefs_stage
URL='s3://quickstart/denmark-lakes/'
CONNECTION = (
  REGION = 'auto'
  ENDPOINT_URL = 'http://127.0.0.1:8000'
  ACCESS_KEY_ID = 'AKIAIOSFOLKFSSAMPLES'
  SECRET_ACCESS_KEY = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY');

By executing the following SQL statement, we can filter out the Parquet format data files in the directory.

LIST @lakefs_stage PATTERN = '.*[.]parquet';

picture

Since Databend already supports the SELECT form Stage capability, basic queries can be performed without importing data.

SELECT * FROM @lakefs_stage/lakes.parquet LIMIT 5;

picture

Create a table and run some simple queries

Before cleaning the data, let's import the data into Databend and perform some simple queries.

Due to Databend's built-in Infer Schema (inferred data structure) capability, tables can be easily created from files.

CREATE TABLE lakes AS SELECT * FROM  @lakefs_stage/lakes.parquet;

picture

Next, let's list the 5 countries with the most lakes.

SELECT   country, COUNT(*)
FROM     lakes
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;

picture

Data cleaning

The goal of this data cleaning is to construct a small lake data set, retaining only Danish lake data. This goal can be easily met using DELETE FROMthe statement.

DELETE FROM lakes WHERE Country != 'Denmark';

picture

Next let's query the lake data again and check if only Danish lakes remain.

SELECT   country, COUNT(*)
FROM     lakes
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;

picture

Use PRESIGN to write results back to lakeFS

denmark-lakesIn this step, we need to replace the Parquet files in the branch with the cleaned results .

First, we can use COPY INTO <location>syntax to export data to the built-in anonymous Stage.

COPY INTO @~ FROM lakes FILE_FORMAT = (TYPE = PARQUET);

Next, let's list @~the result files under this Stage.

LIST @~ PATTERN = '.*[.]parquet';

picture

Executing PRESIGN DOWNLOADthe statement, we can get the URL for downloading the resulting data file:

PRESIGN DOWNLOAD @~/<your-result-data-file>;

picture

Open a new terminal and use curlthe command to complete the data file download.

curl -O '<your-presign-download-url>'

Next, using PRESIGN UPLOADthe statement, we can obtain the pre-signed URL for uploading the data file. The purpose of using here @lakefs_stage/lakes.parquet;is to lakes.parquetreplace it with our cleaned Danish lake data.

PRESIGN UPLOAD @lakefs_stage/lakes.parquet;

picture

Open the terminal and use curlthe command to complete the upload.

curl -X PUT -T <your-result-data-file> '<your-presign-upload-url>'

At this point, the file has been replaced with the cleaned data. List the Parquet files in the Stage again. You can see that the file size and last modification time have changed.

LIST @lakefs_stage PATTERN = '.*[.]parquet';

Query the data file again for verification to confirm that it is clean data.

SELECT   country, COUNT(*)
FROM     @lakefs_stage/lakes.parquet
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;

picture

Commit changes

In this step we will submit the changes to lakeFS for saving.

In the lakeFS Web UI interface, open the Uncommitted Changes page and ensure that denmark-lakesthe branch is selected.

Click the button in the upper right corner Commit Changes, write the submission information, and confirm the submission.

picture

Check the original data in the master branch

denmark-lakesThe original data in has been replaced with the cleaned smaller data set, let's switch back to mainthe branch and check whether the original data has been affected.

Similarly, mount the data files by creating a Stage.

CREATE STAGE lakefs_stage_check
URL='s3://quickstart/main/'
CONNECTION = (
  REGION = 'auto'
  ENDPOINT_URL = 'http://127.0.0.1:8000'
  ACCESS_KEY_ID = 'AKIAIOSFOLKFSSAMPLES'
  SECRET_ACCESS_KEY = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY');

Then query the lake data and list the five countries with the largest number of lakes.

SELECT   country, COUNT(*)
FROM     @lakefs_stage_check/lakes.parquet
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;

picture

mainEverything in the branch remains as is, and we get a cleaned Danish lakes dataset without disturbing the original data.

Extra challenge

In this workshop, we learned how to create isolated branches for data, and performed some simple queries and cleaning work in Databend.

If you want to challenge more, you can refer to the official tutorial of IakeFS to try branch merging and data rollback capabilities; you can also refer to the official tutorial of Databend to experience data cleaning and time travel capabilities in the data import stage.

We also welcome the introduction of Databend and IakeFS into production environments for validation in real workloads.

About Databend

Databend is an open source, flexible, low-cost, new data warehouse based on object storage that can also perform real-time analysis. We look forward to your attention and exploring cloud native data warehouse solutions together to create a new generation of open source Data Cloud.

Databend Cloud:https://databend.cn

Databend documentation: https://databend.rs/

Wechat:Dating evening

GitHub:https://github.com/datafuselabs/databend

Alibaba Cloud suffered a serious failure, affecting all products (has been restored). The Russian operating system Aurora OS 5.0, a new UI, was unveiled on Tumblr. Many Internet companies urgently recruited Hongmeng programmers . .NET 8 is officially GA, the latest LTS version UNIX time About to enter the 1.7 billion era (already entered) Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is .NET 8 on NuttX Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released. Microsoft launches a new "Windows App"
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5489811/blog/10140579