Task orchestration and scheduling artifact: Apache Airflow

Introduction : Apache Airflow is an open source workflow scheduling system created by Airbnb and open sourced in 2015. It is written in Python and can be used to create, schedule and monitor workflows. Airflow's workflow is composed of a series of tasks, which form a directed acyclic graph (DAG) according to their dependencies. The essence of Airflow is its flexibility and extensibility. Workflows are defined using Python code, so that workflows can be versioned, tested, and refactored like ordinary programs. At the same time, Airflow supports various task types, including but not limited to bash commands, Python functions, SQL queries, and can also support more types of tasks and systems through the plug-in mechanism.

Role : Mainly used for workflow management of data processing. For example, you can use Airflow to define and schedule an ETL workflow that extracts data from multiple sources, cleans and transforms the data, and loads the data into a data warehouse. In addition, Airflow can also be used for machine learning workflow, automated testing, infrastructure management, etc.

Workflow : In Airflow, a workflow is defined as a DAG (Directed Acyclic Graph). Each DAG is composed of multiple tasks (Task), each task is a work unit that can run independently, and dependencies can be set between tasks. Airflow's scheduler will automatically schedule and run tasks according to the DAG structure and task dependencies.

History Raiders:

flask+apscheduler+ enterprise WeChat message robot push

Python: Celery+Redis implements timing tasks

Python: Celery+Redis+Flower installation and use

Python + Jenkins + Selenium-Grid realizes distributed web-ui automated testing (centos+win10 as an example)

GitLab+GitLabRunner+Docker+Dockerfile+docker-compose+Flask continuous integration deployment

centos7.6: install python, miniconda

Environment installation :

1、安装conda:详见历史攻略

2、创建专属环境:conda create --name airflow66 python=3.8

3、激活环境:conda activate airflow66

4、安装依赖:
   pip install -i https://mirrors.aliyun.com/pypi/simple/ numpy
   pip install -i https://mirrors.aliyun.com/pypi/simple/ "apache-airflow==2.4.3"
   pip install -i https://mirrors.aliyun.com/pypi/simple/ requests


5、初始化:airflow db init

6、启动web服务:airflow webserver -p 8666 -D

7、创建账号:airflow users create --username admin --firstname tom --lastname lucky --role Admin --email tom@123.com
   根据提示:输入密码、确认密码

8、访问浏览器并登录: http://<ip>:8666

9、启动airflow调度:airflow scheduler -D

Log in:

picture

Example:

picture

Airflow builds automated test cases:

1. Create dags in the airflow directory

2. Create a file under the dags path: test.py

# -*- coding: utf-8 -*-
# time: 2023/5/13 13:00
# file: test.py
# 公众号: 玩转测试开发

import unittest
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
import requests


# 定义单元测试
class TestMath(unittest.TestCase):
    def test_add(self):
        self.assertEqual(add(1, 2), 3)

    def test_sub(self):
        self.assertEqual(sub(3, 2), 1)


# 定义任务函数
def add(a, b):
    return a + b


def sub(a, b):
    return a - b


def run_tests(**kwargs):
    suite = unittest.TestLoader().loadTestsFromTestCase(TestMath)
    result = unittest.TextTestRunner(verbosity=0).run(suite)
    return {
    
    
        "total": result.testsRun,
        "failures": len(result.failures),
        "errors": len(result.errors),
        "details": [str(f[1]) for f in (result.failures + result.errors)]
    }


def notify_wechat(**kwargs):
    ti = kwargs['ti']
    result = ti.xcom_pull(task_ids='run_tests')
    content = f"测试结果:\n\n总测试数:{
      
      result['total']}\n失败数:{
      
      result['failures']}\n错误数:{
      
      result['errors']}\n详细信息:{
      
      ' '.join(result['details'])}"
    data = {
    
    
        "msgtype": "text",
        "text": {
    
    "content": content}
    }
    requests.post('https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR-KEY',
                  json=data)


# 定义DAG
default_args = {
    
    
    'owner': 'airflow',
    'start_date': days_ago(2),
}

with DAG(
    'test_dag',
    default_args=default_args,
    description='A simple test DAG',
    schedule_interval='* * * * *',  # 每分钟执行一次
    catchup=False,
) as dag:
    t1 = PythonOperator(
        task_id='run_tests',
        python_callable=run_tests,
        provide_context=True,
    )

    t2 = PythonOperator(
        task_id='notify_wechat',
        python_callable=notify_wechat,
        provide_context=True,
    )

    t1 >> t2

3. The console opens or closes the task

picture

Results of the:

picture

Precautions:

1. The definition of DAG needs to be completed in Python code, so a certain level of Python programming ability is required.

2. Airflow's scheduler uses a strategy called "latest priority scheduling", which may cause some tasks to be delayed.

3. Airflow itself does not execute tasks, but generates commands for tasks and executes them through subprocesses or remote working machines. Therefore, you need to ensure that all dependencies required to execute the task (such as Python libraries, system commands, etc.) are available in the environment where the task is run.

4. Airflow uses SequentialExecutor to execute tasks by default, which means that all tasks will be executed sequentially in the same process. If you need to execute multiple tasks in parallel, you need to configure an other executor, such as LocalExecutor or CeleryExecutor.

5. Airflow's web interface can provide a lot of useful information, but it is not suitable for complex operations, such as modifying DAG or task definitions. All of these operations should be done in Python code.

6. For large-scale workflows, Airflow may consume a lot of resources. In this case, you need to consider how to optimize your DAG, or use more powerful hardware to run Airflow.

Takeaway: A key feature of Airflow is its programming model, which enables workflows to be written and managed as code, making the complex logic of tasks easier to understand and maintain. In addition, Airflow also has powerful scheduling and monitoring functions, allowing you to easily schedule tasks, monitor task running status, and receive alerts when tasks fail.

picture

Guess you like

Origin blog.csdn.net/hzblucky1314/article/details/130664604
Recommended