airflow scheduling principle and k8s scheduling principle

 

Airflow

 

 

airflow is a task scheduling component, mainly based on DAG (directed acyclic graph) to define the entire workflow. He mainly solves the task dependence, web service, task suspension and other functions that cannot be completed by crontab scheduling. And airflow can well support python, spark, hive, k8s, etc.

  • airflow architecture

airflow contains the following components:

Metadatabase (storage DAG)

Actuator worker (responsible for task execution)

Scheduler sheduler (responsible for task triggering)

webServer (web service)

 

  • principle of airflow task scheduling

The principle of airflow multi-process scheduling is as follows:

First, airflow will create a DagFileProcessor process pool to traverse all dag files in the dags folder. Each process processes a dag file. The result is that DAGRUNS (the state of the graph) and taskInstance (task instance) are put into the database. At this time taskInstance Marked as Queued.

At the same time, the shedulerjob class object will periodically check the database, put the taskInstance marked as queued into the Executor queue, and mark the task status in the database as Scheduled

Each available Executor will take a TaskInstance from the Executor queue for execution, and then mark the task status in the database as running

When a taskInstance is executed, the executor will mark the status of the task in the database as (completed, failed, skipped, etc.), and process each task in turn according to the task dependency of the graph

When a process finishes processing a dag, it repeats the above process for the next dag.

When all dag files are processed, airflow will enter the next cycle. If the processing time of a dag is too long, the process pool will ignore this dag in the next cycle to avoid blocking.

 

  • Airflow's celery mechanism:

CeleryExecutor is mainly used to expand the scale of workers. In order to use this function, you need to configure related accessories (RabbitMQ, REDIS, ...), and modify the related configuration of airflow to CeleryExecutor.

 

1. The webserver fetches task execution logs from workers

2. The webserver obtains the Dag structure from the Dag file

3. The webserver gets the task status from the DB

4. Workers get the DAG structure and execute tasks from DAGFiles

5. Workers get and store connection configuration, variables and XCOM information from DB

6. Workers save the task status in celery Result Backend

7. Workers store the executed commands in celery Queue Brokers

8. Scheduler stores graph state and related tasks in DB

9. Scheduler obtains DAG structure and executes tasks from DAGFiles

10. Scheduler gets status information of completed tasks from Celery Result backend

11. Scheduler places the commands to be executed in Celery broker. 

  • Airflow Dag creation

Airflow's dag is created using python scripts to define dependencies between tasks. Take the dag created below as an example

First need to introduce DAG and BashOperator in the airflow package

Then create a parameter dictionary set

args = {
    'owner': 'Airflow',
    'start_date': airflow.utils.dates.days_ago(2),
}

'owner' if not specified defaults to 'airflow'

Then create dag object

day = DAY (
    dag_id='example_bash_operator',
    default_args=args,
    schedule_interval='0 0 * * *',
    dagrun_timeout = time delta (minutes = 60),
)

dag_id is the only attribute that identifies dag. default_args sets the default parameters of dag. 'schedule_interval' indicates the execution interval of the dag. dagrun_timeout indicates the timeout setting of dag. If dag is not completed after this time, tasks that are not executed in dag will be indicated. failure.

Create task

run_this_last = DummyOperator(
    task_id='run_this_last',
    date = day;
)

Tasks are divided into DummyOperator (), PythonOperator (), BashOperator (), etc. according to different operations, users can also customize operator (). Instantiate by parameters such as task_id and dag. Here are a few tasks defined one after another

# [START howto_operator_bash]
run_this = BashOperator(
    task_id='run_after_loop',
    bash_command='echo 1',
    date = day;
)
# [END howto_operator_bash]
run_this >> run_this_last
for i in range(3):
    task = BashOperator(
        task_id='runme_' + str(i),
        bash_command='echo "{{ task_instance_key_str }}" && sleep 1',
        date = day;
    )
    task >> run_this

# [START howto_operator_bash_template]
also_run_this = BashOperator(
    task_id='also_run_this',
    bash_command='echo "run_id={{ run_id }} | dag_run={{ dag_run }}"',
    date = day;
)
# [END howto_operator_bash_template]
also_run_this >> run_this_last
The dependencies between tasks are defined by <<, >> between each task, for example also_run_this >> run_this_last means that also_run_this will execute run_this_last after it is executed, which is equivalent to run_this_last << also_run_this.

At the same time, you can set dependencies through set_upstream and set_downstream, such as also_run_this.set_downstream (run_this_last), which is also equivalent to run_this_last.set_upstream (also_run_this).

Put the script created above in the default dags folder of airflow, airflow will scan the dag and execute it automatically, or you can manually trigger the execution of the dag through the command line interface (cli) or webUI.

  • airflow的webUI

Airflow has a powerful web page that supports tasks such as dag graph viewing, tree graph viewing, log viewing, source code viewing, execution status viewing, task pause, and other operations. The details are shown below:

 

K8S

K8s is a micro-service architecture for container scheduling, which optimizes the execution of tasks by scheduling pods. Its distribution architecture is a cluster control point and multiple working nodes, which mainly solves the problem of resource matching between pods and nodes. Among them, the Master node is responsible for the management of the entire cluster and provides a management interface for the cluster.

And monitor and arrange each working node in the cluster. Each worker node runs the container in the form of a pod, and each node needs a configuration number to run all services and resources on which the container depends.

k8s scheduling

affinity / anti_affinity principle

In some services that require frequent interaction or common resources, some pods need to be deployed in the same node or the same domain, but other services need to be deployed in different points or regions for security and disaster tolerance. The above business considerations lead to the deployment rules of affinity and anti_affinity.

node_affinity, when generating pods, you can indicate that some pods need to be deployed on nodes with a certain label. For example, some pods require fast IO operations and need to be deployed on the hard disk of the ssd. Then the pod can be deployed on the node, otherwise it cannot be deployed successfully. Later, the improved version introduced the concepts of hard requirements and soft requirements. Hard requirements are the same as above. Soft requirements indicate that if the conditions are best, they can be created successfully if they are not met.

Pod's affinity / anti_affinity means to check whether there is a pod with a certain label on the node or the domain before deployment. The scope here is determined by the pod's defined topology domain.

Stain and tolerance

Some nodes have defined taints, indicating that only pods with such tags can be deployed, otherwise they cannot be deployed

The pod can define a tolerance to match the node.

 

 

 

 

 

Published 42 original articles · praised 4 · 10,000+ views

Guess you like

Origin blog.csdn.net/wangyhwyh753/article/details/103427926