2. Summary of problems after installation of K8S multi-worker nodes of airflow series

1. Unable to log in after uninstalling and reinstalling

Remember install.sh, I closed the user creation, and deleted the airflow_plus database, the following configuration causes no default user to be created, so I can’t log in, set it to true, and install it again

# 关闭创建用户job
--set webserver.defaultUser.enabled=false \

2. Task failure caused by worker node restart

The reason is that the custom installation depends on the version conflict. Check the event of restarting the pod where the Worker is located , and check the information as follows

"/home/airflow/.local/lib/python3.7/site-packages/kombu/entity.py", line 7, in <module> from .serialization import prepare_accept_content File "/home/airflow/.local/lib/python3.7/site-packages/kombu/serialization.py", line 440, in <module> for ep, args in entrypoints('kombu.serializers'): # pragma: no cover File "/home/airflow/.local/lib/python3.7/site-packages/kombu/utils/compat.py", line 82, in entrypoints for ep in importlib_metadata.entry_points().get(namespace, []) AttributeError: 'EntryPoints' object has no attribute 'get'

Check dependencies and fix dependency versions

importlib-metadata==4.13.0

3. Dynamic installation dependency problem

  • For a single worker node, you can execute the installation method before each task, but it is cumbersome and not practical:
import logging
import os
log = logging.getLogger(__name__)

def install():
    log.info("begin install requirements")
    os.system("pip install -r /opt/airflow/dags/repo/dags/requirements.txt")
    os.system("pip install -I my_utils")
    log.info("finish install requirements")

The first TASK specific execution file in the DAG is introduced at the beginning, so that the specific log can be seen in the WebServer Log

import sys
sys.path.insert(0, '/opt/airflow/dags/repo')
import dags.install as install
install.install()
  • For multi-worker nodes, since the next process flow in each task may run on different workers, there is no solution, and the easiest manual installation

4. The problem that the triggerer cannot be scheduled regularly during startup

{triggerer_job.py:101} INFO - Starting the triggerer
[2023-03-17T13:47:30.947+0000] {triggerer_job.py:348} ERROR - Triggerer's async thread was blocked for 0.36 seconds, likely by a badly-written trigger. Set PYTHONASYNCIODEBUG=1 to get more information on overrunning coroutines.
[2023-03-17T22:15:17.394+0000] {triggerer_job.py:348} ERROR - Triggerer's async thread was blocked for 0.27 seconds, likely by a badly-written trigger. Set PYTHONASYNCIODEBUG=1 to get more information on overrunning coroutines.

It should be a version problem. I changed the version to 2.2.1, cleared the database, and reinstalled it after uninstalling [this problem has been checked for a long time, but it is helpless], pay attention to write the version in values.yaml, and remove the version in install.sh , otherwise it will not take effect

airflowVersion: 2.2.1
defaultAirflowTag: 2.2.1
config:
  core:
    dags_folder: /opt/airflow/dags/repo/dags
    hostname_callable: airflow.utils.net.get_host_ip_address

5. .airflowignore ignore file problem

.airflowignore file users ignore checking non-DAG files and place them under dags_folders. This configuration is in values.yaml. I configured /opt/airflow/dags/repo/dags, so I put it here

If the configuration ignores the py non-DAG files under merge

jh/merge/*

At this time, we cannot name the merge_dag.py file in the jh directory, otherwise the DAG will not be displayed in the WebServer, and we will uniformly change the naming prefix to dag_ to solve this problem

6. Mounting problem

The author tried to mount the worker node to the cephfs directory, and the mount was successful, but the worker node was blocked all the time and failed, so the intermediate table was abandoned as a csv method, and it was written to ClickHouse instead.

7. Scheduler 401 permission issue

Make sure that the Scheduler's deployment executor label is CeleryKubernetesExecutor, if it is other, the problem will occur

8. Scheduler and Triggerer health check problems

Liveness probe failed: No alive jobs found

Modify values.yaml, modify the survival check configuration and reinstall

scheduler:
  livenessProbe:
    command: ["bash", "-c", "airflow jobs check --job-type SchedulerJob --allow-multiple --limit 100"]
triggerer:
  livenessProbe:
    command: ["bash", "-c", "airflow jobs check --job-type TriggererJob --allow-multiple --limit 100"]

I hope it can be helpful to you. Welcome to pay attention to the official account algorithm niche

Guess you like

Origin blog.csdn.net/SJshenjian/article/details/129643119