Helping the Industrial Internet of Things and the Service Domain of Industrial Big Data: Introduction of AirFlow [31]

02: Task flow scheduling review

  • Goal : Review the requirements and common tools for task flow scheduling

  • path

    • step1: Requirements
    • step2: common tools
  • implement

    • need

      • The same business line, with different requirements, will be realized by multiple programs. The requirements of these multiple programs are combined together to form a workflow or task flow.

      • Automated operation of task flow based on workflow

        image-20210117143520344

      • Requirement 1: Time-based task execution

        • job1 and job2 run automatically after 0:00 every day
      • Requirement 2: Task running based on running dependencies

        • job3 must wait for job1 to run successfully before it can run
        • job5 must wait for both job3 and job4 to run successfully before it can run
      • scheduling type

        • Timing scheduling : scheduling operation based on a certain time rule
          • Scheduling workflow
        • Dependent Scheduling : Scheduling and running based on certain dependencies
          • Dependencies of programs in a workflow
    • Common tool

      • Oozie: Developed by Cloudera, it has powerful functions and relies on MR to achieve distribution. It is very convenient to integrate Hue development and use

        • Traditional development: xml file

          <workflow>
          	<start to="action1">
          	</start>
          	<action name='action1'>
          		<shell>
          		</shell>
          		<ok to='action2'>
          		<kill to='killAction'>
          	</action>
          	<action name='action2'>
          		<shell>
          		</shell>
          		<ok to='action3'>
          		<kill to='killAction'>
          	</action>
          	……
          </workflow>
          
          • Now developing: Hue independently edits DAG through a graphical interface

          • Scenario: CDH big data platform

        • Azkaban: developed by LinkedIn, with friendly interface, rich plug-in support, independent distribution, and can be developed using properties or JSON

          • Develop the properties file and compress it into a zip archive

            name='appname2'
            type=command
            dependencies=appname1
            comman='sh xxxx.sh'
            
          • Upload to the web interface

          • Scenario: Apache platform

        • AirFlow: developed by Airbnb, independent distribution, Python language development and interaction, richer application scenarios

          • Develop Python files

            # step1:导包
            # step2:函数调用
            
          • Submit to run

          • Scenario: The entire data platform is developed based on Python

        • DolphinScheduler: Developed by Analysys, a domestic open source product, high reliability, high scalability, and easy to use

  • summary

    • Review the requirements and common tools for task flow scheduling

03: Introduction to AirFlow

  • Objective : To understand the functional characteristics and application scenarios of AirFlow

  • path

    • step1: background
    • step2: design
    • step3: function
    • step4: Features
    • step5: application
  • implement

    image-20211005105421215

    • origin
      • In 2014, Airbnb created a workflow scheduling system: Airflow, which is used to complete complex ETL processing in their business for them. From cleaning to splicing, you only need to set up a set of Airflow flow charts.
      • In 2016, it was open sourced to the Apache Foundation.
      • In 2019, it became a top-level project of the Apache Foundation: http://airflow.apache.org/.
    • Design : Take advantage of the portability and versatility of Python to quickly build a task flow scheduling platform
    • Function : implement dependency scheduling and timing scheduling based on Python
    • features
      • Distributed task scheduling: allows a workflow task to be executed simultaneously on multiple workers
      • DAG task dependencies: build task dependencies in a directed acyclic graph
      • Task atomicity: each task on the workflow is atomically retryable, and a task failure in a certain link of a workflow can be automatically or manually retried
      • Self-customization: You can construct any task or processing tool you need to schedule based on code
        • Pros: good flexibility
        • Disadvantages: complex development
    • application
      • Recommended use of the system based on the background of Python development
  • summary

    • Understand the features and application scenarios of AirFlow

04: AirFlow deployment starts

  • Goal : Understand AirFlow tool deployment and management

  • path

    • step1: Install and deploy
    • step2: Start the test
    • step3: close
  • implement

    • Installation and deployment

      • Self-installation: "Refer to Appendix 1"
      • Abandon the installation: Please restore the virtual machine snapshot to "AirFlow installation complete"
    • start test

      • Delete records: do it after the second startup

        rm -f /root/airflow/airflow-*
        
      • Start Redis: message queue:

        • nohub non-pending redis tasks, /opt/redis-4.0.9/src/redis-server
        • Load the redis configuration file, /opt/redis-4.0.9/src/redis.conf
        • output.log is the storage log file
        • The 2 in 2>&1 represents the error log, redirected to the correct log record and put it in output.log, otherwise the error log will be printed on the linux command line
        • &Backstage
        nohup /opt/redis-4.0.9/src/redis-server /opt/redis-4.0.9/src/redis.conf > output.log 2>&1 &
        ps -ef | grep redis
        

        image-20211015102148871

      • Start AirFlow

        # 以后台进程方式,启动服务
        airflow webserver -D
        airflow scheduler -D
        airflow celery flower -D
        airflow celery worker -D
        

        image-20211015102430125

        • test network port

        • Airflow Web UI:node1:8085

          • Username Password: admin

          • Celery Web UI:node1:5555

  • summary

    • Understand AirFlow tool deployment and management

Guess you like

Origin blog.csdn.net/xianyu120/article/details/132204164