Helping the Industrial Internet of Things, the service domain of industrial big data: the architectural components of AirFlow [32]

Knowledge point 05: Architecture components of AirFlow

  • Goal : Understand the architectural components of AirFlow

  • path

    • step1: Architecture
    • step2: components
  • implement

    • architecture

      image-20211005110933346

      • Client: Develop the client of the program scheduled by AirFlow, used to develop the Python program of AirFlow
      • Master: the master node in the distributed architecture, responsible for running WebServer and Scheduler
      • Worker: Responsible for running Execution to execute the Task in the submitted workflow
    • components

      image-20211005111759153

      A scheduler, which handles both triggering scheduled workflows, and submitting Tasks to the executor to run.
      An executor, which handles running tasks. In the default Airflow installation, this runs everything inside the scheduler, but most production-suitable executors actually push task execution out to workers.
      A webserver, which presents a handy user interface to inspect, trigger and debug the behaviour of DAGs and tasks.
      A folder of DAG files, read by the scheduler and executor (and any workers the executor has)
      A metadata database, used by the scheduler, executor and webserver to store state.
      
      • WebServer: Provides an interactive interface and monitoring, allowing developers to debug and monitor the operation of all Tasks
      • Scheduler: Responsible for parsing and scheduling Task tasks to submit to Execution for execution
      • Executor: The execution component is responsible for running the Task assigned by the Scheduler and running in the Worker
      • DAG Directory: The directory of the DAG program, put the program developed by yourself into this directory, and AirFlow's WebServer and Scheduler will automatically read it
        • airflow puts all programs in one directory
        • Automatically detect whether there is a new program in this directory
      • MetaData DataBase: AirFlow's metadata storage database, which records the information of all DAG programs
  • summary

    • Understand the architectural components of AirFlow

Knowledge point 06: AirFlow development rules

  • Goal : Master the development rules of AirFlow

  • path

    • step1: Develop a Python scheduler
    • step2: Submit the Python scheduler
  • implement

    • official document

      • Concepts: http://airflow.apache.org/docs/apache-airflow/stable/concepts/index.html
    • Example: http://airflow.apache.org/docs/apache-airflow/stable/tutorial.html

    • Develop a Python scheduler

      • To develop a Python program, the program file needs to contain the following parts

      • Note: The operation of this file does not support utf8 encoding, and cannot write Chinese

      • step1: guide package

        # 必选:导入airflow的DAG工作流
        from airflow import DAG
        # 必选:导入具体的TaskOperator类型
        from airflow.operators.bash import BashOperator
        # 可选:导入定时工具的包
        from airflow.utils.dates import days_ago
        

        image-20211015103936196

      • step2: Define DAG and configuration

        # 当前工作流的基础配置
        default_args = {
                  
                  
            # 当前工作流的所有者
            'owner': 'airflow',
            # 当前工作流的邮件接受者邮箱
            'email': ['[email protected]'],
            # 工作流失败是否发送邮件告警
            'email_on_failure': True,
            # 工作流重试是否发送邮件告警
            'email_on_retry': True,
            # 重试次数
            'retries': 2,
            # 重试间隔时间
            'retry_delay': timedelta(minutes=1),
        }
        
        # 定义当前工作流的DAG对象
        dagName = DAG(
            # 当前工作流的名称,唯一id
            'airflow_name',
            # 使用的参数配置
            default_args=default_args,
            # 当前工作流的描述
            description='first airflow task DAG',
            # 当前工作流的调度周期:定时调度【可选】
            schedule_interval=timedelta(days=1),
            # 工作流开始调度的时间
            start_date=days_ago(1),
            # 当前工作流属于哪个组
            tags=['itcast_bash'],
        )
        
        • Build an instance and configuration of a DAG workflow
      • step3: Define Tasks

        • Task type: http://airflow.apache.org/docs/apache-airflow/stable/concepts/operators.html

        • commonly used

        • other

        • BashOperator: Define a Shell command Task

          # 导入BashOperator
          from airflow.operators.bash import BashOperator
          # 定义一个Task的对象
          t1 = BashOperator(
          	# 指定唯一的Task的名称
              task_id='first_bashoperator_task',
          	# 指定具体要执行的Linux命令
              bash_command='echo "hello airflow"',
          	# 指定属于哪个DAG对象
              dag=dagName
          )
          
        • PythonOperator: Define a Python code Task

          # 导入PythonOperator
          from airflow.operators.python import PythonOperator
          
          # 定义需要执行的代码逻辑
          def sayHello():
              print("this is a programe")
          
          #定义一个Task对象
          t2 = PythonOperator(
              # 指定唯一的Task的名称
              task_id='first_pyoperator_task',
              # 指定调用哪个Python函数
              python_callable=sayHello,
              # 指定属于哪个DAG对象
              dag=dagName
          )
          

      • step4: Run Task and specify dependencies

        • Define Task

          Task1:runme_0
          Task2:runme_1
          Task3:runme_2
          Task4:run_after_loop
          Task5:also_run_this
          Task6:this_will_skip
          Task7:run_this_last
          
        • need

          • Task1, Task2, and Task3 run in parallel, and run Task4 after the end
          • Task4, Task5, and Task6 run in parallel, and run Task7 after the end

          image-20211005121040679

        • the code

          task1 >> task4
          task2 >> task4
          task3 >> task4
          task4 >> task7
          task5 >> task7
          task6 >> task7
          
        • If there is only one Task, just write the name of the Task object directly

          task1
          
    • Submit the Python scheduler

      • Which kind of submission needs to wait for a while

      • Automatic submission: need to wait for automatic detection

        • Put the developed program into the DAG Directory of AirFlow
        • The default path is: /root/airflow/dags
      • Manual submission: Manually run the file to let airflow listen for loading

        python xxxx.py
        
      • Scheduling status

        • No status (scheduler created empty task instance): The scheduled task has been created, but no task instance has been generated yet

        • Scheduled (scheduler determined task instance needs to run): The scheduled task has generated a task instance and is waiting to run

        • Queued (scheduler sent task to executor to run on the queue): The scheduled task starts in the queue before the executor executes

        • Running (worker picked up a task and is now running it): The task is being executed on the worker node

        • Success (task completed): Task execution completed successfully

  • summary

    • Master the development rules of AirFlow

Guess you like

Origin blog.csdn.net/xianyu120/article/details/132204203