Installation Guide Airflow [v1.10] task scheduling platform

0.5 Background

Really can not figure out, Airflow star regardless of the number of community activity or Github there are far better than Azkaban EasyScheduler, but why not even a complete installation tutorial did not it? It is too high for my needs? Really tired heart is endless, the search engine as well as a full youtube over and over again did not make me feel satisfied ...... Fortunately, however, a step in the footprints of the final pit a good environment to build communication and Operator. Well, ado, today began Airflow part installation tutorial.

1. Preparation Before Installation work

  • Installation Release Notes
Installation Tools version use
Python 3.6.5 Installation dependencies and airflow, airflow is developed using dag
MySQL 5.7 As a meta database of airflow
Airflow 1.10.0 Task scheduling platform

Please select a clean physical host machine or a cloud. Otherwise, any excess or impact the consequences, I will not be responsible!

  • Make sure you are familiar with Linux environment and basic operation command, the way will be some Python basic commands, if not familiar, please go left fully charged again

2. Install Python3

Python3 installation can refer to my previous article, this is no longer Ao said

3. Install MySQL

About three years ago, also wrote a Centos install MySQL tutorial , but although practical, but the content is too long, here we use the most simple way to quickly install and configure the MySQL user (of course, if you use ready-made RDScan, eliminating the need for installation process, can jump to the user is built Airflow step of building a database).

  • The old rules, uninstall mariadb
rpm -qa | grep mariadb

rpm -e --nodeps mariadb-libs-5.5.52-1.el7.x86_64

sudo rpm -e --nodeps mariadb-libs-5.5.52-1.el7.x86_64

rpm -qa | grep mariadb
复制代码
  • Mysql download the repo source
wget http://repo.mysql.com/mysql-community-release-el7-5.noarch.rpm
复制代码
  • By installing rpm
sudo rpm -ivh mysql-community-release-el7-5.noarch.rpm
复制代码
  • Install mysql and authorized
sudo yum install mysql-server
sudo chown -R mysql:mysql /var/lib/mysql
复制代码
  • Start mysql
sudo service mysqld start
复制代码

The following operations are carried out operations on the mysql client, you first need to connect and log mysql.

Mysql connection to log on as root user:

mysql -uroot
复制代码
  • Mysql password reset
use mysql;

update user set password=password('root') where user='root';

flush privileges;
复制代码
  • Airflow is building a database, build user

Building a database:

create database airflow;
复制代码

Built users:

create user 'airflow'@'%' identified by 'airflow';

create user 'airflow'@'localhost' identified by 'airflow';
复制代码

Authorized users:

grant all on airflow.* to 'airflow'@'%';

flush privileges;

exit;
复制代码

4. Installation Airflow

Now that everything is in place, we began to enter the theme of today!

4.1 Basics

  • 1) by the airflow scaffolding mounted pip

About the need to set before installing the temporary environment variable SLUGIFY_USES_TEXT_UNIDECODE, otherwise, it will cause the installation to fail, the command is as follows:

export SLUGIFY_USES_TEXT_UNIDECODE=yes
复制代码

Installation airflow scaffolding:

sudo pip install apache-airflow===1.10.0
复制代码

airflow will be installed at site-packages directory under python3, complete directory is: ${PYTHON_HOME}/lib/python3.6/site-packages/airflowMy airflow directory as follows:

    1. Formal installation airflow

Before installing the airflow, we need to configure the look of the installation directory airflow AIRFLOW_HOME, and airflow in order to facilitate the use of related commands, we configured the airflow into the environment variable, once and for all.

Edit /etc/profilethe system environment variable file:

sudo vim /etc/profile
复制代码

Make the following changes (of course, need to be modified into a specific directory corresponding directory of your own, do not move Oh copy):

The modified environment variables to take effect immediately:

sudo source /etc/profile
复制代码
  • 3) execute airflowthe command to do the initial operations

Because the configuration of the airflow through the environment variables SITE_AIRFLOW_HOME, we can execute the following command where:

airflow
复制代码

This, airflow will be in just the AIRFLOW_HOMEgenerated files directory. Of course, some errors may be reported when the command is executed, you can not ignore! Generating a list of files is as follows:

    1. Airflow module is installed mysql
sudo pip install 'apache-airflow[mysql]'
复制代码

The installation package dependency airflow can install this manner, specific reference airflow official documents


[Knock on the blackboard, draw the focus]

You may report the following error when installing mysql module:

mysql_config not found
复制代码

Resolve as follows:

(1) whether there has been mysql_config to view the file:

find / -name mysql_config
复制代码

(2) If not, the installation mysql-devel:

sudo yum install mysql-devel
复制代码

(3) After the installation is complete, verify that there mysql_config file again:


    1. Airflow using metadata of a database mysql

Airflow.cfg modify files, configuration mysql database as airflow yuan:


Here it is extremely giant pit, and a lot of people are not so directly written tutorial, and then will be ruined! ! ! Do not believe you give it a try, and so on down to initialize the database will die! But no effective solution for your search! ! ! Second, do not believe anything changed pymysqlwith pymysql implementation of the package, so even worse, there will be data type parsing problem, so you have no idea! ! ! Remember Remember! ! !

sql_alchemy_conn = mysql://airflow:airflow@localhost:3306/airflow
复制代码

or

sql_alchemy_conn = mysql+pymysql://airflow:airflow@localhost:3306/airflow
复制代码

Since such a way that is not possible, how to do it? More solutions than problems of! Because Python3 MySQLdb no longer supported, and only Python2 is supported. But then, only MySQLdb is the best result.

First, we install it by pip mysqlclient:

sudo pip install mysqlclient
复制代码

Then, by pip install it MySQLdb:

sudo pip install MySQLdb
复制代码

Finally, we modify sql_alchemy_conn airflow.cfg configuration file:

sql_alchemy_conn = mysql+mysqldb://airflow:airflow@localhost:3306/airflow
复制代码

This, we have configured the database metadata information airflow and ready dependencies.

    1. Meta database initialization information (in fact, is the new airflow dependent table)
airflow initdb
复制代码

At this point, our meta mysql database (library named airflow) has a good airflow new dependency table:


[Knock on the blackboard, draw the focus]

At this point initialization database, you may report the following error:

Global variable explicit_defaults_for_timestamp needs to be on (1) for mysql
复制代码

The solution to this problem in the Airflow official document has explained, the link is: airflow.apache.org/faq.html . MySQL needs to be processed by modifying the configuration file my.cnf, the following steps:

(1) Find my.cnf Location:

mysql --help | grep my.cnf
复制代码

(2) modify the my.cnf file:

In the [mysqld] below (not necessarily wrong place) add the following configuration:

explicit_defaults_for_timestamp=true
复制代码

(3)) to restart the MySQL configuration to take effect:

sudo service mysqld restart
复制代码

(4) Check whether the modified configuration take effect:

(5) re-run airflow initdbto


    1. Application of basic commands
    • airflow组件:webserver, scheduler, worker, flower
    • Start background components of each command:airflow xxx -D
    • View dag list:airflow list_dags
    • Dag view a list of tasks:airflow list_tasks dag_id
    • Suspend / resume a dag:airflow pause/unpause dag_id
    • Test a dag tasks:airflow test dag_id task_id execution_date

[Knock on the blackboard, draw the focus]

You may report the following error when you start webserver components:

Error 1:

Error: 'python:airflow.www.gunicorn_config' doesn‘t exist
复制代码

Install the specified version of gunicorn to:

(1) Airflow1.10 corresponding version 19.4.0 of version gunicorn:

sudo pip install gunicorn==19.4.0
复制代码

(2) Airflow1.8 version 19.3.0 installed version of gunicorn:

sudo pip install gunicorn==19.3.0
复制代码

Error 2:

FileNotFoundError: [Errno 2] No such file or directory: 'gunicorn': 'gunicorn'
复制代码

Only you need to configure the Python environment variable to the bin directory (you can also refer www.cnblogs.com/lwglinux/p/... ):

sudo vim /etc/profile
复制代码

source /etc/profile
复制代码

4.2 Advanced articles

    1. Acquaintance executor (need to restart to take effect)

Why here to modify it? Because SequentialExecutor task is to perform a single process sequence, default executor, usually only used for testing, LocalExecutor is a multi-process tasks performed locally using, CeleryExecutor is distributed scheduling use (of course, can also stand-alone), commonly used in the production environment, DaskExecutor is used dynamic task scheduling, commonly used in data analysis.

  • 2) How to change the time zone to the East eight districts (need to restart to take effect)

Why do you want to change the time zone? Because Airflow default time is GMT time, although in the Airflow cluster located in a different time zone can ensure the same time, a matter of time does not appear out of sync, but this time is 8 hours earlier than Beijing, is not consistent with our reading habits , not enough simple and intuitive. Since most our case, we either single-node service, or even if the expansion is also in the same time zone, it will modify the time zone for the East eight districts, namely Beijing, which is more for us to use.

Come on!

(1) modify airflow.cfg file:

default_timezone = Asia/Shanghai
复制代码

Here is the revised schedule time scheduler, that in the preparation of the scheduled time can be written directly to Beijing.

(2) modify the time on the top right of the page webserver show:

You need to modify the ${PATHON_HOME}/lib/python3.6/site-packages/airflow/www/templates/admin/master.htmlfile.

The modified results as shown:

(3) modify the webserver lastRun time:

At first modified ${PATHON_HOME}/lib/python3.6/site-packages/airflow/models.pyfile.

def utc2local(self,utc):
       import time
       epoch = time.mktime(utc.timetuple())
       offset = datetime.fromtimestamp(epoch) - datetime.utcfromtimestamp(epoch)
       return utc + offset
复制代码

Results are as follows:

At the second modified ${PATHON_HOME}/lib/python3.6/site-packages/airflow/www/templates/airflow/dags.htmlfile.

dag.utc2local(last_run.execution_date).strftime("%Y-%m-%d %H:%M")
dag.utc2local(last_run.start_date).strftime("%Y-%m-%d %H:%M")
复制代码

Results are as follows:

Modification, then you can see the effect by restarting the webserver!

    1. Adding user authentication

Here we use a simple password authentication is enough!

(1) Installation password components:

sudo pip install apache-airflow[password]
复制代码

(2) modify the configuration file airflow.cfg:

[webserver]
authenticate = True
auth_backend = airflow.contrib.auth.backends.password_auth
复制代码

(3) write a python script to add user accounts:

Preparation of add_account.pydocuments:

import airflow
from airflow import models, settings
from airflow.contrib.auth.backends.password_auth import PasswordUser

user = PasswordUser(models.User())
user.username = 'airflow'
user.email = '[email protected]'
user.password = 'airflow'

session = settings.Session()
session.add(user)
session.commit()
session.close()
exit()
复制代码

Execute add_account.pyfile:

python add_account.py
复制代码

You will find mysql user metadata database tables will be more out of a record.

Of course, you can also by means of third-party plug-ins for visualization UI user account as well as create / modify the code DAG. Links to: github.com/lattebank/a... , but only support to Python2.x. But I will follow them to do the upgrade process.

    1. Modify webserver address (you need to restart to take effect)

    1. How to modify the new DAG detection interval (need to restart to take effect)

If the scheduler detects DAG too often, it can lead to very high CPU load. The default scheduler detection time is 0, that is, no time interval.

Airflow.cfg you can modify the file min_file_process_intervalset the time interval, as follows, I modify time for 5 seconds Detection:

    1. How to modify the number of scheduler thread concurrency control (need to restart to take effect)

Airflow.cfg file can be modified by parallelismcontrolling the amount of concurrent scheduler:

4.3 Advanced articles

    1. Airflow distributed cluster configuration

(To be continued ...)

Reproduced in: https: //juejin.im/post/5d02a5bbe51d45778f076d26

Guess you like

Origin blog.csdn.net/weixin_34306446/article/details/93167642