Author: Shang Zhuoran (PsiACE)
Master student at Macau University of Science and Technology, Databend R&D engineer intern
Apache OpenDAL(Incubating) Committer
Cloud computing provides cheap, elastic, shared storage services for data-centric applications, which provides obvious benefits for modern data processing workflows: massive data, high concurrent access, large throughput, and more and more cases Start migrating the old technology stack to the data lake architecture.
When we put the data lake in the cloud, new problems arise:
-
The old data warehouse/big data analysis technology may not be designed specifically for cloud and object storage. The performance and compatibility may not be ideal, and a lot of resources need to be invested in maintenance. How to provide a truly modern and low-cost data lake? , high-performance, high-quality analysis services?
-
The demand for data management has only strengthened, providing higher requirements for the reproducibility of analysis results and the shareability of data sources. How to provide elasticity and manageability for data, so that data scientists, data analysts and data Engineers collaborate closely on a logically consistent view?
If you have questions, there will be answers!
Databend creates a truly cross-cloud and native data warehouse based on object storage on the cloud. Designed using the serverless concept, it provides a high-performance query engine that is distributed, elastically scalable, and easy to operate and maintain. It supports common structured and semi-structured data and can be tightly integrated with modern data technology stacks.
lakeFS is dedicated to providing solutions for sharing and collaborating on data. Empower object storage with Git-like operating logic, use versioning solutions to provide a logically consistent view of data, embed meaningful branch names and submission information into modern data workflows, and provide solutions for the integration of data and documents.
In this article, we will combine the two and provide a simple and clear workshop to help you quickly build a modern data workflow.
Why you need Databend
As the amount of data increases, traditional data warehouses face huge challenges. They cannot effectively store and process massive amounts of data, and it is difficult to flexibly adjust computing and storage resources according to workload, resulting in high usage costs. In addition, data processing is complex, requiring a lot of resources to be invested in ETL, and data rollback and version control are also very difficult.
Databend aims to solve these pain points. It is an open source, elastic, load-aware cloud data warehouse developed using Rust, which can provide cost-effective and complex analysis capabilities for very large-scale data sets.
-
Cloud-friendly: Seamlessly integrates various cloud storages, such as AWS S3, Azure Blob, CloudFlare R2, etc.
-
High performance: Developed in Rust, utilizing SIMD and vectorization processing to achieve extremely fast analysis.
-
Economic flexibility: Innovative design, independent scaling of storage and computing, optimizing cost and performance.
-
Simple data management: Built-in data preprocessing capabilities, no external ETL tools required.
-
Data version control: Provides multi-version storage similar to Git, supporting data query, cloning and rollback at any point in time.
-
Rich data support: Supports multiple data formats and types such as JSON, CSV, Parquet, etc.
-
AI-enhanced analysis: Integrate AI functions to provide data analysis capabilities driven by large models.
-
Community-driven: It has a friendly and continuously growing community and provides an easy-to-use cloud analysis platform.
The picture above is the Databend architecture diagram, taken from datafuselabs/databend .
Why you need lakeFS
Since object storage often lacks atomicity, rollback and other capabilities, data security cannot be well guaranteed, and quality and recoverability also decrease. In order to protect the data in the production environment, it is often necessary to use isolated copies for rehearsal testing, which not only consumes resources, but also makes it difficult to truly collaborate.
When it comes to collaboration, you may think of Git, but Git is not designed for data management. In addition to the inconvenience of binary data management, Git LFS's limit on the size of a single file also restricts its applicable scenarios.
lakeFS came into being, which provides open source data version control for data lakes—branch, commit, merge, revert, just like using Git to manage code. With support for zero-copy development/test isolation environments, continuous quality verification, atomic rollback of erroneous data, repeatability and other advanced features, you can even easily verify ETL workflows on production data without worrying about damage to your business.
The picture above shows the data workflow recommended by lakeFS, taken from https://lakefs.io/ .
Workshop: Use lakeFS to support your analysis business
In this workshop, we will use lakeFS to create branches for the repository, and use Databend to analyze and transform the preset data.
Since the experimental environment contains some dependencies, the first startup may take a long time. We also recommend using the combination of Databend Cloud + lakeFS cloud , so that you can skip the time-consuming environment setup part and directly start experiencing data analysis and conversion .
Environment settings
In addition to lakeFS, the environment used this time will also include MinIO as the underlying object storage service, as well as commonly used data science tools such as Jupyter and Spark. You can check out this git repository for more information.
The picture above is a schematic diagram of this experimental environment, taken from treeverse/lakeFS-samples .
Clone repository
git clone https://github.com/treeverse/lakeFS-samples.git
cd lakeFS-samples
Start the full stack experimental environment
docker compose --profile local-lakefs up
Once the experimental environment is started, you can log in to lakeFS and MinIO using the default configuration to observe data changes in subsequent steps.
Data observation
During the environment setup process, a repository named quickstart will be prepared in lakeFS in advance. In this step, we will make some simple observations about it.
If you use your own deployed lakeFS + MinIO environment
- You may need to manually create the corresponding bucket in MinIO first.
- Then create the corresponding repository in LakeFS and check to populate the sample data.
lakeFS
Open lakeFS ( http://127.0.0.1:8000) in the browser , enter the Access Key ID and Secret Access Key to log in to lakeFS.
Then open the quickstart repository, you can see that some default data already exists, and it also contains a default tutorial.
The data repository mode of lakeFS almost corresponds to code repositories such as GitHub, and there is almost no learning cost: among them, is the lakes.parquet
pre-prepared data, and data
the in the folder lakes.source.md
introduces the source of the data; scripts
the folder contains the The script, whose complete workflow can _lakefs_actions
be found in the directory, is written in a format similar to GitHub Actions; README.md
it corresponds to the Markdown source file of the tutorial below, and images
contains all the images used.
MinIO
quickstart
Since we use MinIO as the underlying storage in the experimental environment, a bucket named can also be found in MinIO . StorageNamespace
This is determined by lakeFS when creating the repository .
Among them, dummy
the file is created when creating a new lakeFS repository to ensure that we have sufficient permissions to write to the bucket.
The _lakefs
directory only contains two files, created when importing data from data sources such as S3, to identify some references to the original location of the imported files.
New objects written through lakeFS will be located data
in the directory.
Data correspondence
Open data
the directory, we can find some files, but it is difficult to correspond to the data in lakeFS.
Let's go back to lakeFS, click the gear icon on the right side of the file, and then select it Object Info
to easily find the corresponding relationship.
Data analysis and transformation
In this step, we will deploy the Databend service, mount the data in lakeFS through Stage and analyze it, and replace the data files denmark-lakes
in the branch with the converted results .lakes.parquet
Deploy Databend
Databend's storage engine also supports advanced features such as Time Travel and atomic rollback, so there is no need to worry about operational errors.
Here we use the single-node Databend service with MinIO as the storage backend. For the overall deployment process, you can refer to the Databend official documentation . Some details that need to be noted are as follows:
-
Since we have deployed the MinIO service in the above steps, we only need to open and
127.0.0.1:9000
create adatabend
Bucket named. -
Next, you need to prepare relevant directories for logs and Meta data.
sudo mkdir /var/log/databend
sudo mkdir /var/lib/databend
sudo chown -R $USER /var/log/databend
sudo chown -R $USER /var/lib/databend
- Secondly, because the default
http_handler_port
has been occupied by the previous service, you need to editdatabend-query.toml
and make some changes to avoid conflicts:
http_handler_port = 8088
- In addition, we also need to configure the administrator user according to Docs | Configuring Admin Users . Since it is just a workshop, we choose the simplest way here, just cancel
[[query.users]]
the field and the root user's comments:
[[query.users]]
name = "root"
auth_type = "no_password"
- Since we are using MinIO as the storage backend, we need to
[storage]
configure it.
[storage]
# fs | s3 | azblob | obs | oss
type = "s3"
# To use S3-compatible object storage, uncomment this block and set your values.
[storage.s3]
bucket = "databend"
endpoint_url = "http://127.0.0.1:9000"
access_key_id = "minioadmin"
secret_access_key = "minioadmin"
enable_virtual_host_style = false
Next, you can start Databend normally:
./scripts/start.sh
We strongly recommend that you use BendSQL as the client. Due to http_handler_port
the port change, you need to use bendsql -P 8088
the Databend service to connect. Of course, we also support multiple access forms such as MySQL Client and HTTP API.
Create a branch
The usage of lakeFS is similar to GitHub. Open the branches page of the Web UI , click Create Branch
the button, and create a denmark-lakes
branch named.
Create stage
Databend can mount the data directory located in the remote storage service through Stage. Since lakeFS provides S3 Gateway API, we can configure the connection according to s3 compatible services. It should be noted that the URL here needs to s3://<repo>/<branch>
be constructed according to , and ENDPOINT_URL
the port of lakeFS is 8000.
CREATE STAGE lakefs_stage
URL='s3://quickstart/denmark-lakes/'
CONNECTION = (
REGION = 'auto'
ENDPOINT_URL = 'http://127.0.0.1:8000'
ACCESS_KEY_ID = 'AKIAIOSFOLKFSSAMPLES'
SECRET_ACCESS_KEY = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY');
By executing the following SQL statement, we can filter out the Parquet format data files in the directory.
LIST @lakefs_stage PATTERN = '.*[.]parquet';
Since Databend already supports the SELECT form Stage capability, basic queries can be performed without importing data.
SELECT * FROM @lakefs_stage/lakes.parquet LIMIT 5;
Create a table and run some simple queries
Before cleaning the data, let's import the data into Databend and perform some simple queries.
Due to Databend's built-in Infer Schema (inferred data structure) capability, tables can be easily created from files.
CREATE TABLE lakes AS SELECT * FROM @lakefs_stage/lakes.parquet;
Next, let's list the 5 countries with the most lakes.
SELECT country, COUNT(*)
FROM lakes
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
Data cleaning
The goal of this data cleaning is to construct a small lake data set, retaining only Danish lake data. This goal can be easily met using DELETE FROM
the statement.
DELETE FROM lakes WHERE Country != 'Denmark';
Next let's query the lake data again and check if only Danish lakes remain.
SELECT country, COUNT(*)
FROM lakes
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
Use PRESIGN to write results back to lakeFS
denmark-lakes
In this step, we need to replace the Parquet files in the branch with the cleaned results .
First, we can use COPY INTO <location>
syntax to export data to the built-in anonymous Stage.
COPY INTO @~ FROM lakes FILE_FORMAT = (TYPE = PARQUET);
Next, let's list @~
the result files under this Stage.
LIST @~ PATTERN = '.*[.]parquet';
Executing PRESIGN DOWNLOAD
the statement, we can get the URL for downloading the resulting data file:
PRESIGN DOWNLOAD @~/<your-result-data-file>;
Open a new terminal and use curl
the command to complete the data file download.
curl -O '<your-presign-download-url>'
Next, using PRESIGN UPLOAD
the statement, we can obtain the pre-signed URL for uploading the data file. The purpose of using here @lakefs_stage/lakes.parquet;
is to lakes.parquet
replace it with our cleaned Danish lake data.
PRESIGN UPLOAD @lakefs_stage/lakes.parquet;
Open the terminal and use curl
the command to complete the upload.
curl -X PUT -T <your-result-data-file> '<your-presign-upload-url>'
At this point, the file has been replaced with the cleaned data. List the Parquet files in the Stage again. You can see that the file size and last modification time have changed.
LIST @lakefs_stage PATTERN = '.*[.]parquet';
Query the data file again for verification to confirm that it is clean data.
SELECT country, COUNT(*)
FROM @lakefs_stage/lakes.parquet
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
Commit changes
In this step we will submit the changes to lakeFS for saving.
In the lakeFS Web UI interface, open the Uncommitted Changes page and ensure that denmark-lakes
the branch is selected.
Click the button in the upper right corner Commit Changes
, write the submission information, and confirm the submission.
Check the original data in the master branch
denmark-lakes
The original data in has been replaced with the cleaned smaller data set, let's switch back to main
the branch and check whether the original data has been affected.
Similarly, mount the data files by creating a Stage.
CREATE STAGE lakefs_stage_check
URL='s3://quickstart/main/'
CONNECTION = (
REGION = 'auto'
ENDPOINT_URL = 'http://127.0.0.1:8000'
ACCESS_KEY_ID = 'AKIAIOSFOLKFSSAMPLES'
SECRET_ACCESS_KEY = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY');
Then query the lake data and list the five countries with the largest number of lakes.
SELECT country, COUNT(*)
FROM @lakefs_stage_check/lakes.parquet
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
main
Everything in the branch remains as is, and we get a cleaned Danish lakes dataset without disturbing the original data.
Extra challenge
In this workshop, we learned how to create isolated branches for data, and performed some simple queries and cleaning work in Databend.
If you want to challenge more, you can refer to the official tutorial of IakeFS to try branch merging and data rollback capabilities; you can also refer to the official tutorial of Databend to experience data cleaning and time travel capabilities in the data import stage.
We also welcome the introduction of Databend and IakeFS into production environments for validation in real workloads.
About Databend
Databend is an open source, flexible, low-cost, new data warehouse based on object storage that can also perform real-time analysis. We look forward to your attention and exploring cloud native data warehouse solutions together to create a new generation of open source Data Cloud.
Databend Cloud:https://databend.cn
Databend documentation: https://databend.rs/
Wechat:Dating evening
GitHub:https://github.com/datafuselabs/databend
Alibaba Cloud suffered a serious failure, affecting all products (has been restored). The Russian operating system Aurora OS 5.0, a new UI, was unveiled on Tumblr. Many Internet companies urgently recruited Hongmeng programmers . .NET 8 is officially GA, the latest LTS version UNIX time About to enter the 1.7 billion era (already entered) Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is .NET 8 on NuttX Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released. Microsoft launches a new "Windows App"