Python workflow -Airflow

Python workflow -Airflow 

Apache Airflow  is a complex choreography for calculating workflow and data processing pipeline open source tools. If you find yourself running a long time to perform the task cron script or batch jobs big data, Airflow may be able to help you solve the plight of the artifact. This article will give those students who want to find new tools did not know or have knowledge of this tool work writing Airflow line offers introductory tutorial.

Airflow workflow design known to have a directed acyclic graph (DAG). This means that, at the time of writing workflow, you should consider how your task into multiple independent tasks can be performed, then, these tasks into one logical whole, combining them into a graph, in order to achieve our workflow results, such as:

airflow-example-dag.png

Figure determines the shape of the overall logic of your workflow. Airflow DAG may include multiple branches, you can decide which branch to go or what to skip a branch in the workflow execution. This creates a very flexible design for our work, because if an error occurs, each task can be retried several times, or even stopped completely, and to restore the workflow runs through the last restart an unfinished task .

When designing Airflow operating node, it is important to remember that they may be executed more than once. Each task should be idempotent , that has the ability to multiple applications without causing unintended consequences.

Airflow term

The following is a brief overview of some of the terms used when designing Airflow workflow:

  • Airflow DAGs is a combination of a lot of Tasks.
  • Each Task is implemented as a Operator
  • When a start of the DAG, Airflow will be created in a database  DagRun record
  • When a task execution, in fact, is the creation of a Task instance is running, it runs in the context of DagRun.
  • AIRFLOW_HOME Airflow is looking DAG base directory and plug-ins.

Preparing the Environment

Airflow is written in Python, which allows us to get very simple to install on the machine. I used here is Python3.5 version of Python, still using Python2 brothers, it quickly out of the pit, 3 will make you more obsessed Python. Although Airflow is supported Python2 version, can support a minimum seems to Python2.6, but I recommend that you use the wall cracked Python3. Next, I will use virtualenv to manage the development environment and the subsequent series of experiments.

image_180.png

Installation Airflow

For convenience, I've created a separate airflow for user experiments while using the user's home directory  /home/airflow as the working directory airflow, if you wish, and I see the same effect, so I hope to follow my steps together :

image_181.png

Here are just entering virtualenv environment, the next step is to install airflow, as of the time I write a blog, the latest version of airflow is 1.8, so I am here to use version 1.8:

After a period slightly longer waiting time, our airflow should be the installation was successful, we can see during the installation process, airflow is dependent on a large number of other libraries, this follow-up we will slowly come. Now is whether to configure airflow of the environment.

The first is the need to configure the  AIRFLOW_HOME environment variable, which is the basis for the work of the airflow, DAG and subsequent Plugin will be launched on this basis, because they are based on  AIRFLOW_HOME as the root directory lookup. According to our previous description, we should HOME directory  /home/airflow, so it can be set up:

image_183.png

Haha, here we can say that one of the most simple configuration even if it is done, look at something useful of it, try the input  airflow version command to see:

image_184.png

如果你看到了上面一般的输出,那么说明我们的 airflow 是安装和配置成功的,同时,我们使用 ls -al 命令看看,应该在 /home/airflow 目录上能够发现以下两个文件:

image_186.png

打开 airflow.cfg 文件你会发现里面有很多配置项,而且大部分都是有默认值的,这个文件就是 airflow 的配置文件,在以后的详解中我们会接触到很多需要修改的配置,目前就以默认的配置来进行试验。如果你现在就迫不及待得想自己修改着玩玩,那么 Airflow Configure 这篇文档可以帮助你了解各个配置项的含义。

初始化 Airflow 数据库

可能你会有点震惊了,为啥要初始化数据库?是的,因为 airflow 需要维护 DAG 内部的状态,需要保存任务执行的历史信息,这些都是放在数据库里面的,也就是说我们需要先在数据库中创建表,但是,因为使用的是 Python,我们不需要自己使用原始的 SQL 来创建,airflow 为我们提供了方便的命令行,只需要简单得执行:

这里值得注意的是,默认的配置使用的是 SQLite,所以初始化知道会在本地出现一个 airflow.db 的数据库文件,如果你想要使用其他数据库(例如 MySQL),都是可以的,只需要修改一下配置,我们后续会讲到:

image_187.png

Airflow Web 界面

Airflow 提供了多种交互方式,主要使用到的有两种,分别是:命令行 和 Web UI。Airflow 的 Web UI 是通过 Flask 编写的,要启动起来也是很简单,直接在 AIRFLOW_HOME 目录运行这条命令:

然后你就可以通过浏览器看到效果了,默认的访问端口是:8080,所以打开浏览器,访问以下 URL:http://localhost:8080/admin,神奇的事情就这么发生了,你将看到类似这样的页面:

image_185.png

第一个 DAG

从一开始就说了, Airflow 的两个重大功能就是 DAG 和 Plugin,但是直到现在我们才开始讲解 DAG。DAG 是离散数学中的一个概念,全称我们称之为:有向非循环图(directed acyclic graphs)。图的概念是由节点组成的,有向的意思就是说节点之间是有方向的,转成工业术语我们可以说节点之间有依赖关系;非循环的意思就是说节点直接的依赖关系只能是单向的,不能出现 A 依赖于 B,B 依赖于 C,然后 C 又反过来依赖于 A 这样的循环依赖关系。

那么在 Airflow 中,图的每个节点都是一个任务,可以是一条命令行(BashOperator),可以是一段 Python 脚本(PythonOperator)等等,然后这些节点根据依赖关系构成了一条流程,一个图,称为一个 DAG,每个 Dag 都是唯一的 DagId。

创建一个 DAG 也是很简单得,首先需要在 AIRFLOW_HOME 目录下创建一个 dags 目录,airflow 将会在这个目录下去查找 DAG,所以,这里我们先创建一个,创建完之后新建一个 tutorial.py 文件:

image_189.png

然后,再来看下我们的 DAG 文件是怎么写的:

image_191.png

我们可以从 Web UI 上看到这个 DAG 的依赖情况:

image_192.png

这就定义了几个任务节点,然后组成了一个 DAG,同时也可以发现,依赖关系是通过 set_downstream 来实现的,这只是一种方式,在后面我们将会看到一个更加简便的方式。

让 DAG 跑起来

为了让 DAG 能够运行,我们需要触发 DAG 任务,这里有几种触发的方式,但是,最天然的当属定时器了,例如,在我们上面的任务中,可以发现设置了一个参数:schedule_interval,也就是任务触发的周期。但是,你光设置了周期是没有用的,我们还需要有个调度器让他调度起来,所以需要运行调度器:

image_193.png

I used here is LocalExecutor, Airflow there are three actuators are:

  • SequentialExecutor: sequentially designate DAG
  • LocalExecutor: Use local process execution DAG
  • CeleryExecutor: Use Celery execution DAG

The first SequentialExecutor which can be used in the development of the commissioning phase, do not use in the build environment. The second and third can also be used for the production of development and testing, but for many tasks, it is recommended to use a third: CeleryExecutor.

to sum up

This article Airflow environment installation from the start, so simple that describes how to use Airflow, but the positioning of this paper is always an introductory article for the advanced features Airflow, are described in this blog will have a lot of follow-up article, please Search self-understanding.

Reference

  1. Airflow Tutorial
  2. A Summer Intern’s Journey into Airflow @ Agari
  3. Get started developing workflows with Apache Airflow
 

Guess you like

Origin www.cnblogs.com/jeasonit/p/11982413.html