Data management process of ByteHouse and Apache Airflow

Hands-on attention

ceb605227b1392768467e01b83d4f524.gif

Dry goods do not get lost

Apache Airflow combined with ByteHouse provides a powerful and efficient solution for managing and executing data flows. This article highlights the key benefits and features of using Apache Airflow with ByteHouse, showing how to simplify data workflows and drive business success.

Main advantages:

  1. Scalable and reliable data flow: Apache Airflow provides a powerful platform for designing and orchestrating data flow, allowing you to easily handle complex workflows. With ByteHouse, a cloud-native data warehouse solution, you can efficiently store and process large amounts of data, ensuring scalability and reliability.

  2. Automated Workflow Management: Airflow's intuitive interface makes it easy to create and schedule data workflows with a visual DAG (Directed Acyclic Graph) editor. By integrating with ByteHouse, you can automate the extract, transform, and load (ETL) process, reducing manual effort and enabling more efficient data management.

  3. Simple Deployment and Management: Both Apache Airflow and ByteHouse are designed to be simple to deploy and manage. Airflow can be deployed on-premises or in the cloud, while ByteHouse provides a fully managed cloud-native data warehouse solution. This combination makes the setup and maintenance of the data infrastructure seamless.

customer scene

Business scene

In this customer scenario, an analytics firm called Data Insights Ltd. (pseudonym) uses Apache Airflow as a data pipeline orchestration tool. They chose ByteHouse as their data warehouse solution to take advantage of its powerful analytics and machine learning capabilities.

Data Insights LLC operates in the e-commerce industry and collects large amounts of customer and transaction data stored in AWS S3. They need to regularly load this data into ByteHouse and perform various analysis tasks to gain insights into business operations.

data link

Using Apache Airflow, Data Studio Ltd. set up a data loading pipeline based on specific events or schedules. For example, they can configure Airflow to trigger the data loading process at a specific time of day, or when a new data file is added to a specified AWS S3 bucket. When a trigger event occurs, Airflow initiates the data loading process by retrieving the relevant data files from AWS S3. It ensures secure authentication and connection to S3 buckets using proper credentials and API integration. Once the data is ingested from AWS S3, Airflow coordinates the transformation and loading of the data into ByteHouse. It leverages ByteHouse's integration capabilities to efficiently store and organize data according to predefined schemas and data models.

After successfully loading data into ByteHouse, Data Insights Ltd. can leverage ByteHouse's capabilities for analytics and machine learning tasks. They can use ByteHouse's SQL-like language to query data, perform complex analytics, generate reports, and reveal meaningful insights about customers, sales trends, and product performance.

Additionally, Data Insights Ltd. leveraged ByteHouse's capabilities to create interactive dashboards and visualizations. They can build dynamic dashboards that display real-time metrics, monitor key performance indicators, and share actionable insights with stakeholders across the organization.

Finally, Data Insights Ltd. utilizes ByteHouse's machine learning capabilities to develop predictive models, recommender systems or customer segmentation algorithms. ByteHouse provides the necessary computing power and storage infrastructure for training and deploying machine learning models, enabling Data Insights Limited to obtain valuable predictive and prescriptive insights.

Summarize

By using Apache Airflow as a data pipeline orchestration tool and integrating it with ByteHouse, Data Insights Ltd. realized a smooth and automated process of loading data from AWS S3 to ByteHouse. They leverage ByteHouse's powerful analytics, machine learning, and dashboard capabilities to gain valuable insights and drive data-drivenness within their organizations.

ByteHouse<>AirFlow Quick Start

prerequisites

Install pip in your virtual/local environment. Install ByteHouse CLI in your virtual/local environment and log in to ByteHouse account. Refer to ByteHouse CLI for installation help. Example using Homebrew on macOS:

brew install bytehouse-cli

Install Apache Airflow

In this tutorial, we use pip to install Apache Airflow in your local or virtual environment. For more information, see the official Airflow documentation.

# airflow需要一个目录,~/airflow是默认目录,
# 但如果您喜欢,可以选择其他位置
#(可选)
export AIRFLOW_HOME=~/airflow

AIRFLOW_VERSION=2.1.3
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"

# 例如:3.6
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"

If you can't install it with pip, try pip3 install to install it.

After the installation is complete, run the command airflow info to get more information about Airflow.

Airflow initialization

Initialize Airflow's web server by executing the following command:

# 初始化数据库
airflow db init


airflow users create \
--username admin \
--firstname admin \
--lastname admin \
--role Admin \
--email admin

# 启动Web服务器,默认端口是8080
# 或修改airflow.cfg设置web_server_port
airflow webserver --port 8080

Once the web server is set up, you can log in to the Airflow console at http://localhost:8080/ using the username and password you set up earlier.

472958cd201c74d331a88d8f34a934d4.png

In a new terminal, use the following command to set up the Airflow scheduler. Then, refresh http://localhost:8080/.

YAML configuration

Enter the Airflow folder with the cd ~/airflow command. Open the configuration file named airflow.cfg.

Add configuration and connect to database. By default, you can use SQLite, but you can also connect to MySQL.

# 默认情况下是SQLite,也可以连接到MySQL
sql_alchemy_conn = mysql+pymysql://airflow:[email protected]:8080/airflow

# authenticate = False
# 禁用Alchemy连接池以防止设置Airflow调度器时出现故障 https://github.com/apache/airflow/issues/10055
sql_alchemy_pool_enabled = False

# 存放Airflow流水线的文件夹,通常是代码库中的子文件夹。该路径必须是绝对路径。
dags_folder = /home/admin/airflow/dags

Create a directed acyclic graph (DAG) job

Create a folder called dags in the Airflow path, and create test_bytehouse.py to start a new DAG job.

~/airflow
mkdir dags
cd dags
nano test_bytehouse.py

Add the following code in test_bytehouse.py. The job can connect to ByteHouse CLI and use BashOperator to run tasks, queries or load data into ByteHouse.

from datetime import timedelta
from textwrap import dedent

from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email': ['[email protected]'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}
with DAG(
    'test_bytehouse',
    default_args=default_args,
    description='A simple tutorial DAG',
    schedule_interval=timedelta(days=1),
    start_date=days_ago(1),
    tags=['example'],
) as dag:
    
    tImport  = BashOperator(
        task_id='ch_import',
        depends_on_past=False,
        bash_command='$Bytehouse_HOME/bytehouse-cli -cf /root/bytehouse-cli/conf.toml "INSERT INTO korver.cell_towers_1 FORMAT csv INFILE '/opt/bytehousecli/data.csv' "',
    )

    tSelect  = BashOperator(
        task_id='ch_select',
        depends_on_past=False,
        bash_command='$Bytehouse_HOME/bytehouse-cli -cf /root/bytehouse-cli/conf.toml -q "select * from korver.cell_towers_1 limit 10 into outfile '/opt/bytehousecli/dataout.csv' format csv "'
    )
    
    tSelect >> tImport

Run python test_bytehouse.py at the current file path to create a DAG in Airflow.

Refresh the page in your browser. You can see the newly created DAG named test_bytehouse in the DAG list.

c93f7c87a35e832f252e60cde561999f.png

execute DAG

Run the following Airflow command in a terminal to see the list of DAGs and subtasks in the test_bytehouse DAG. You can test query execution and data import tasks separately.

#打印"test_bytehouse" DAG中的任务列表
[root@VM-64-47-centos dags]# airflow tasks list test_bytehouse
ch_import
ch_select

#打印"test_bytehouse" DAG中任务的层次结构
[root@VM-64-47-centos dags]# airflow tasks list test_bytehouse --tree
<Task(BashOperator): ch_select>
<Task(BashOperator): ch_import>

After running the DAG, check the query history page and database module in your ByteHouse account. You should be able to see the result of the successful execution of the query/load data.

2248b93b6f08f3f6bc8b57f3d70049dc.png

About Volcano Engine ByteHouse

ByteHouse is a cloud-native data warehouse product independently developed by ByteDance. It has restructured its technical architecture based on the open-source ClickHouse engine, and realized the deployment and operation and maintenance management of cloud-native environments, separation of storage and computing, and multi-tenant management. There are huge improvements in scalability, stability, operability, performance, and resource utilization. ByteHouse has more than 18,000 internal deployments in ByteDance, and more than 2,400 single clusters. After hundreds of internal application scenarios and tens of thousands of users, it has been promoted and applied in many external corporate customers.

Learn more about technical dry goods and the latest activities, welcome to scan the code to enter the official exchange group

b0ce01ef6b5545abdd1fad0bdc1eea63.png

4d273f5c41d96c5f1a4e2d1f2fe061a6.png Click "Read the original text" to experience it immediately

Guess you like

Origin blog.csdn.net/ByteDanceTech/article/details/131297955
Recommended