Data pipeline management tool airflow

     

What is ETL

ETL is a commonly used data processing. In previous companies, ETL was almost the basis of data processing, requiring very stable, high fault tolerance, and good monitoring. The full name of ETL is Extract, Transform, Load. Generally, the messy data is preprocessed and then placed in the storage space. It can be SQL or NoSQL, and can also be directly stored in file mode.

At the beginning, my design idea was to use several cron jobs and celery to handle all the processing, and then store our log files in hdfs and some data in mysql, which runs about once a day. The core is to be able to scale, stability, fault tolerance, roll back. Our data warehouse is placed on the cloud, and it is easy to handle.

With my own ETL system, I feel very at ease, and it will be relatively convenient to be able to do data processing and machine learning in the future.

Here comes the problem

The idea of ​​my design at the beginning is very similar to Uber's ETL at the beginning, because I think it is very convenient. But I found a very serious problem, I can't do it all by myself. First of all, at least write a front-end UI to monitor cron jobs, but the ones on the market are very poor. Secondly, the fault-tolerant autorestart is very laborious to write, maybe I haven't found a good way to deal with it. The final deployment is quite troublesome. If I want to write these things, it will take at least a month for me, and it may not be particularly robust. After trying to write some fragmentation scripts for 2 or 2 days, I found that the time has dragged on for too long.

Highly recommended tool

Airbnb is my favorite company. They have a lot of open source tools. I think airflow is the most practical representative. Airflow can manage data pipelines, and can even be used as a more advanced cron job. Nowadays, most large manufacturers do not say that their data processing is ETL, and it is called data pipeline, which may be related to what Google advocates. Airbnb's airflow is written in python, which can schedule workflows and provide more reliable processes, and it also has its own UI (probably related to airbnb's design leadership). Without further ado, here are two screenshots:


Paste_Image.png

Screen-Shot-2015-06-02-at-10.09.23-AM.png

What is DAG

One of the most important concepts in airflow is DAG.

DAG is a directed asyclic graph, which has applications in many machine learning, which is the so-called directed acyclic graph. But in airflow, you can think of it as a small project, a small process, because each small project can have many "directed" tasks, and finally achieve a certain purpose. In the introduction on the official website, the characteristics of dag are said:

  • Scheduled: each job should run at a certain scheduled interval
  • Mission critical: if some of the jobs aren’t running, we are in trouble
  • Evolving: as the company and the data team matures, so does the data processing
  • Heterogenous: the stack for modern analytics is changing quickly, and most companies run multiple systems that need to be glued together

YEAH! It's awesome, right? After reading all of these, I found it was perfectly fit Prettyyes.

how to install

Installing airflow is super easy, just use pip. The current version of airflow is 1.6.1, but there is a small bug. I will tell you how to modify it later.

pip install airflow

There is a pit here, because airflow involves a lot of data processing packages, so pandas and numpy will be installed (this Data Scientist should be familiar with), but the domestic pip install installation is very slow, and there are some small problems with the source of douban. My solution is to install numpy and pandas directly from the Douban source, and then install airflow. You can adjust the order in requirements.txt during automated deployment.

how to run

Taken from the official website

# airflow needs a home, ~/airflow is the default,
# but you can lay foundation somewhere else if you prefer
# (optional)
export AIRFLOW_HOME=~/airflow

# install from pypi using pip
pip install airflow

# initialize the database
airflow initdb

# start the web server, default port is 8080
airflow webserver -p 8080

Then you can view all dags on the web ui to monitor your process.

How to import dag

Generally, after the first run, airflow will generate the airflow folder in the default folder, and then you only need to create a new file dag in it. The file tree I deployed on Alibaba Cloud looks like this.


Paste_Image.png

The following is one of the small dags that I wrote in our company prettyyes that need to process logs every day:

from airflow import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta
import ConfigParser


config = ConfigParser.ConfigParser()
config.read('/etc/conf.ini')
WORK_DIR = config.get('dir_conf', 'work_dir')
OUTPUT_DIR = config.get('dir_conf', 'log_output')
PYTHON_ENV = config.get('dir_conf', 'python_env')

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime.today() - timedelta(days=1),
    'retries': 2,
    'retry_delay': timedelta(minutes=15),
}

dag = DAG('daily_process', default_args=default_args, schedule_interval=timedelta(days=1))

templated_command = "echo 'single' | {python_env}/python {work_dir}/mr/LogMR.py"\
    .format(python_env=PYTHON_ENV, work_dir=WORK_DIR) + " --start_date {{ ds }}"


task = BashOperator(
    task_id='process_log',
    bash_command=templated_command,
    dag=dag
)

After writing, just put this dag into the dag folder created earlier, and then run:

python <dag_file>

to make sure there are no syntax errors. In the test you can see my

schedule_interval=timedelta(days=1)

In this way, our data processing task is equivalent to running once a day. More importantly, airflow also provides a lot of Hadoop interfaces in addition to bash processing interfaces. It can provide convenience for connecting to the hadoop system in the future. Many specific functions can be seen in the official documentation.

one of the small bugs

Airflow 1.6.1 has a small bug in the website. After the installation is successful, click the log in the dag and the following page will appear:


Paste_Image.png

This only needs to be

airflow/www/utils.py

Just replace the file with the latest utils.py file on airflow github. The specific problem is here:

fixes datetime issue when persisting logs

Deamon with supervisord

Airflow itself does not have a deamon mode, so it is ok to use supervisord directly. We only need to write 4 lines of code.

[program:airflow_web]
command=/home/kimi/env/athena/bin/airflow webserver -p 8080

[program:airflow_scheduler]
command=/home/kimi/env/athena/bin/airflow scheduler

I think airflow is especially suitable for small teams, it's powerful, and it's really easy to deploy. It can be seamlessly connected with hadoop and mrjob, which greatly improves our business.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326464082&siteId=291194637
Recommended