Article Directory
02: Task flow scheduling review
-
Goal : Review the requirements and common tools for task flow scheduling
-
path
- step1: Requirements
- step2: common tools
-
implement
-
need
-
The same business line, with different requirements, will be realized by multiple programs. The requirements of these multiple programs are combined together to form a workflow or task flow.
-
Automated operation of task flow based on workflow
-
Requirement 1: Time-based task execution
- job1 and job2 run automatically after 0:00 every day
-
Requirement 2: Task running based on running dependencies
- job3 must wait for job1 to run successfully before it can run
- job5 must wait for both job3 and job4 to run successfully before it can run
-
scheduling type
- Timing scheduling : scheduling operation based on a certain time rule
- Scheduling workflow
- Dependent Scheduling : Scheduling and running based on certain dependencies
- Dependencies of programs in a workflow
- Timing scheduling : scheduling operation based on a certain time rule
-
-
Common tool
-
Oozie: Developed by Cloudera, it has powerful functions and relies on MR to achieve distribution. It is very convenient to integrate Hue development and use
-
Traditional development: xml file
<workflow> <start to="action1"> </start> <action name='action1'> <shell> </shell> <ok to='action2'> <kill to='killAction'> </action> <action name='action2'> <shell> </shell> <ok to='action3'> <kill to='killAction'> </action> …… </workflow>
-
Now developing: Hue independently edits DAG through a graphical interface
-
Scenario: CDH big data platform
-
-
Azkaban: developed by LinkedIn, with friendly interface, rich plug-in support, independent distribution, and can be developed using properties or JSON
-
Develop the properties file and compress it into a zip archive
name='appname2' type=command dependencies=appname1 comman='sh xxxx.sh'
-
Upload to the web interface
-
Scenario: Apache platform
-
-
AirFlow: developed by Airbnb, independent distribution, Python language development and interaction, richer application scenarios
-
Develop Python files
# step1:导包 # step2:函数调用
-
Submit to run
-
Scenario: The entire data platform is developed based on Python
-
-
DolphinScheduler: Developed by Analysys, a domestic open source product, high reliability, high scalability, and easy to use
-
-
-
-
summary
- Review the requirements and common tools for task flow scheduling
03: Introduction to AirFlow
-
Objective : To understand the functional characteristics and application scenarios of AirFlow
-
path
- step1: background
- step2: design
- step3: function
- step4: Features
- step5: application
-
implement
- origin
- In 2014, Airbnb created a workflow scheduling system: Airflow, which is used to complete complex ETL processing in their business for them. From cleaning to splicing, you only need to set up a set of Airflow flow charts.
- In 2016, it was open sourced to the Apache Foundation.
- In 2019, it became a top-level project of the Apache Foundation: http://airflow.apache.org/.
- Design : Take advantage of the portability and versatility of Python to quickly build a task flow scheduling platform
- Function : implement dependency scheduling and timing scheduling based on Python
- features
- Distributed task scheduling: allows a workflow task to be executed simultaneously on multiple workers
- DAG task dependencies: build task dependencies in a directed acyclic graph
- Task atomicity: each task on the workflow is atomically retryable, and a task failure in a certain link of a workflow can be automatically or manually retried
- Self-customization: You can construct any task or processing tool you need to schedule based on code
- Pros: good flexibility
- Disadvantages: complex development
- application
- Recommended use of the system based on the background of Python development
- origin
-
summary
- Understand the features and application scenarios of AirFlow
04: AirFlow deployment starts
-
Goal : Understand AirFlow tool deployment and management
-
path
- step1: Install and deploy
- step2: Start the test
- step3: close
-
implement
-
Installation and deployment
- Self-installation: "Refer to Appendix 1"
- Abandon the installation: Please restore the virtual machine snapshot to "AirFlow installation complete"
-
start test
-
Delete records: do it after the second startup
rm -f /root/airflow/airflow-*
-
Start Redis: message queue:
- nohub non-pending redis tasks, /opt/redis-4.0.9/src/redis-server
- Load the redis configuration file, /opt/redis-4.0.9/src/redis.conf
- output.log is the storage log file
- The 2 in 2>&1 represents the error log, redirected to the correct log record and put it in output.log, otherwise the error log will be printed on the linux command line
- &Backstage
nohup /opt/redis-4.0.9/src/redis-server /opt/redis-4.0.9/src/redis.conf > output.log 2>&1 & ps -ef | grep redis
-
Start AirFlow
# 以后台进程方式,启动服务 airflow webserver -D airflow scheduler -D airflow celery flower -D airflow celery worker -D
-
test network port
-
Airflow Web UI:
node1:8085
-
Username Password: admin
-
Celery Web UI:
node1:5555
-
-
-
-
-
summary
- Understand AirFlow tool deployment and management