What is ETL
ETL is a commonly used data processing. In previous companies, ETL was almost the basis of data processing, requiring very stable, high fault tolerance, and good monitoring. The full name of ETL is Extract, Transform, Load. Generally, the messy data is preprocessed and then placed in the storage space. It can be SQL or NoSQL, and can also be directly stored in file mode.
At the beginning, my design idea was to use several cron jobs and celery to handle all the processing, and then store our log files in hdfs and some data in mysql, which runs about once a day. The core is to be able to scale, stability, fault tolerance, roll back. Our data warehouse is placed on the cloud, and it is easy to handle.
With my own ETL system, I feel very at ease, and it will be relatively convenient to be able to do data processing and machine learning in the future.
Here comes the problem
The idea of my design at the beginning is very similar to Uber's ETL at the beginning, because I think it is very convenient. But I found a very serious problem, I can't do it all by myself. First of all, at least write a front-end UI to monitor cron jobs, but the ones on the market are very poor. Secondly, the fault-tolerant autorestart is very laborious to write, maybe I haven't found a good way to deal with it. The final deployment is quite troublesome. If I want to write these things, it will take at least a month for me, and it may not be particularly robust. After trying to write some fragmentation scripts for 2 or 2 days, I found that the time has dragged on for too long.
Highly recommended tool
Airbnb is my favorite company. They have a lot of open source tools. I think airflow is the most practical representative. Airflow can manage data pipelines, and can even be used as a more advanced cron job. Nowadays, most large manufacturers do not say that their data processing is ETL, and it is called data pipeline, which may be related to what Google advocates. Airbnb's airflow is written in python, which can schedule workflows and provide more reliable processes, and it also has its own UI (probably related to airbnb's design leadership). Without further ado, here are two screenshots:
What is DAG
One of the most important concepts in airflow is DAG.
DAG is a directed asyclic graph, which has applications in many machine learning, which is the so-called directed acyclic graph. But in airflow, you can think of it as a small project, a small process, because each small project can have many "directed" tasks, and finally achieve a certain purpose. In the introduction on the official website, the characteristics of dag are said:
- Scheduled: each job should run at a certain scheduled interval
- Mission critical: if some of the jobs aren’t running, we are in trouble
- Evolving: as the company and the data team matures, so does the data processing
- Heterogenous: the stack for modern analytics is changing quickly, and most companies run multiple systems that need to be glued together
YEAH! It's awesome, right? After reading all of these, I found it was perfectly fit Prettyyes.
how to install
Installing airflow is super easy, just use pip. The current version of airflow is 1.6.1, but there is a small bug. I will tell you how to modify it later.
pip install airflow
There is a pit here, because airflow involves a lot of data processing packages, so pandas and numpy will be installed (this Data Scientist should be familiar with), but the domestic pip install installation is very slow, and there are some small problems with the source of douban. My solution is to install numpy and pandas directly from the Douban source, and then install airflow. You can adjust the order in requirements.txt during automated deployment.
how to run
Taken from the official website
# airflow needs a home, ~/airflow is the default,
# but you can lay foundation somewhere else if you prefer
# (optional)
export AIRFLOW_HOME=~/airflow
# install from pypi using pip
pip install airflow
# initialize the database
airflow initdb
# start the web server, default port is 8080
airflow webserver -p 8080
Then you can view all dags on the web ui to monitor your process.
How to import dag
Generally, after the first run, airflow will generate the airflow folder in the default folder, and then you only need to create a new file dag in it. The file tree I deployed on Alibaba Cloud looks like this.
The following is one of the small dags that I wrote in our company prettyyes that need to process logs every day:
from airflow import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta
import ConfigParser
config = ConfigParser.ConfigParser()
config.read('/etc/conf.ini')
WORK_DIR = config.get('dir_conf', 'work_dir')
OUTPUT_DIR = config.get('dir_conf', 'log_output')
PYTHON_ENV = config.get('dir_conf', 'python_env')
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.today() - timedelta(days=1),
'retries': 2,
'retry_delay': timedelta(minutes=15),
}
dag = DAG('daily_process', default_args=default_args, schedule_interval=timedelta(days=1))
templated_command = "echo 'single' | {python_env}/python {work_dir}/mr/LogMR.py"\
.format(python_env=PYTHON_ENV, work_dir=WORK_DIR) + " --start_date {{ ds }}"
task = BashOperator(
task_id='process_log',
bash_command=templated_command,
dag=dag
)
After writing, just put this dag into the dag folder created earlier, and then run:
python <dag_file>
to make sure there are no syntax errors. In the test you can see my
schedule_interval=timedelta(days=1)
In this way, our data processing task is equivalent to running once a day. More importantly, airflow also provides a lot of Hadoop interfaces in addition to bash processing interfaces. It can provide convenience for connecting to the hadoop system in the future. Many specific functions can be seen in the official documentation.
one of the small bugs
Airflow 1.6.1 has a small bug in the website. After the installation is successful, click the log in the dag and the following page will appear:
This only needs to be
airflow/www/utils.py
Just replace the file with the latest utils.py file on airflow github. The specific problem is here:
fixes datetime issue when persisting logs
Deamon with supervisord
Airflow itself does not have a deamon mode, so it is ok to use supervisord directly. We only need to write 4 lines of code.
[program:airflow_web]
command=/home/kimi/env/athena/bin/airflow webserver -p 8080
[program:airflow_scheduler]
command=/home/kimi/env/athena/bin/airflow scheduler
I think airflow is especially suitable for small teams, it's powerful, and it's really easy to deploy. It can be seamlessly connected with hadoop and mrjob, which greatly improves our business.