Use Apache DolphinScheduler for EMR task scheduling

file By AWS Team

Preface

As the scale of the enterprise expands and business data increases, we will use the Hadoop/Spark framework to process ETL/aggregation analysis jobs of large amounts of data, and these jobs will need to be scheduled regularly by a unified job scheduling platform.

In Amazon EMR, you can use AWS to provide Step Function, host AirFlow, and Apache Oozie or Azkaban for job invocation. However, with the improvement of Apache Dolphinscheduler products, the increasingly popular community, and its features such as simplicity, ease of use, high reliability, high scalability, support for rich usage scenarios, and multi-tenant mode, more and more companies are choosing to use it. Product as a service for task scheduling.

DolphinScheduler can be installed and deployed in an Amazon EMR cluster. However, based on the characteristics of Amazon EMR itself and usage best practices, it is not recommended that customers use a large, comprehensive, and persistently running EMR cluster to provide the entire big data related services. The cluster is split based on different dimensions, such as research and development stage (development, testing, production), workload (ad hoc query, batch processing), time sensitivity, job duration requirements, organization type, etc., so DolphinScheduler serves as a unified The scheduling platform does not need to be installed on a fixed EMR cluster. Instead, it chooses to deploy independently, divide the jobs into different EMR clusters, and assemble them in DAG (Directed Acyclic Graph, DAG) streaming mode to achieve unified scheduling and manage.file

This article will introduce the installation and deployment of DolphinScheduler, as well as job orchestration in DolphinScheduler, and use python scripts to perform EMR task scheduling, including creating clusters, cluster status checks, submitting EMR Step jobs, EMR Step job status checks, and all jobs Terminate the cluster when finished.

Amazon EMR

Amazon EMR is a managed cluster platform that simplifies running big data frameworks such as Apache Hadoop and Apache Spark on AWS to process and analyze massive amounts of data. Users can start a cluster that contains many Hadoop ecological data processing and analysis-related services with one click, without the need for manual complicated configuration.

Apache DolphinScheduler

Apache DolphinScheduler is a distributed and easily scalable visual DAG workflow task scheduling open source system. Suitable for enterprise-level scenarios, it provides a solution for visualizing operational tasks, workflows and full life cycle data processing processes.

characteristic

  • Simple and easy to use

    • Visual DAG: user-friendly, drag-and-drop workflow definition, modular runtime control tools
    • Operation: Modular for easy customization and maintenance
  • Rich usage scenarios

    • Supports multiple task types: supports more than 10 task types such as Shell, MR, Spark, SQL, etc., and supports cross-language
    • Easily expandable and rich workflow operations: Workflows can be scheduled, paused, resumed and stopped, making it easy to maintain and control global and local parameters.
  • High Reliability

High reliability: decentralized design to ensure stability. Native HA task queue support provides overload fault tolerance. DolphinScheduler provides a highly robust environment.

  • High Scalability

High scalability: supports multi-tenancy and online resource management. Supports the stable operation of 100,000 data tasks per day.

Architecture diagram:file

Mainly achieved:

  • Tasks are associated according to task dependencies in the form of a DAG diagram, allowing real-time visual monitoring of the running status of tasks.
  • Supports a variety of task types: Shell, MR, Spark, SQL (mysql, oceanbase, postgresql, hive, sparksql), Python, Sub_Process, Procedure, etc.
  • Supports workflow scheduled scheduling, dependent scheduling, manual scheduling, manual pause/stop/resume, and also supports operations such as failed retry/alarm, failed recovery from specified nodes, and Kill tasks.
  • Supports workflow priority, task priority, task failover and task timeout alarm/failure
  • Supports workflow global parameters and node custom parameter settings
  • Supports online upload/download, management, etc. of resource files, supports online file creation and editing
  • Supports online viewing and scrolling of task logs, online downloading of logs, etc.
  • Implement cluster HA and achieve decentralization of Master cluster and Worker cluster through Zookeeper
  • Supports online viewing of Master/Worker CPU load, memory, and CPU
  • Supports workflow running history tree/Gantt chart display, task status statistics, and process status statistics.
  • Supports complement
  • Support multi-tenancy

Install DolphinScheduler

DolphinScheduler supports multiple deployment methods

  • Standalone deployment: Standalone is only suitable for DolphinScheduler’s quick experience
  • Pseudo cluster deployment: The purpose of pseudo cluster deployment is to deploy the DolphinScheduler service on a single machine. In this mode, the master, worker, and api server are all on the same machine.
  • Cluster deployment: The purpose of cluster deployment is to deploy the DolphinScheduler service on multiple machines to run a large number of tasks.

If you are a novice and want to experience the functions of DolphinScheduler, it is recommended to use Standalone; if you want to experience more complete functions or a larger workload, it is recommended to use pseudo-cluster deployment; if you are using it in production, it is recommended to use Cluster deployment or kubernetes.

This experiment will introduce the deployment of DolphinScheduler in pseudo-cluster mode on AWS.

  1. Start an EC2

Start an EC2 in the AWS public subnet, select Amazon-linux2, and open TCP 12345 port in the m5.xlarge security group.

  1. Install JDK and configure JAVA_HOME environment
java -version  
openjdk version "1.8.0_362"  
OpenJDK Runtime Environment (build 1.8.0_362-b08) OpenJDK 64-Bit Server VM (build 25.362-b08, mixed mode)
  1. Install and start Zookeeper
bin/zkServer.sh status  
/usr/bin/java  
ZooKeeper JMX enabled by default  
Using config: /usr/local/src/apache-zookeeper-3.8.1-bin/bin/../conf/zoo.cfg Client port found: 2181. Client address: localhost. Client SSL: false.  
Mode: standalone
  1. Start mysql, choose Aurora Serverless here

file

  1. Install AWS CLI2
aws --version  
aws-cli/2.11.4 Python/3.11.2 Linux/5.10.167-147.601.amzn2.x86_64 exe/x86_64.amzn.2 prompt/off
  1. Update python version to 3.9
python --version  
Python 3.9.1
  1. DownloadDolphinScheduler
cd /usr/local/src  
wget https://dlcdn.apache.org/dolphinscheduler/3.1.4/apache-dolphinscheduler-3.1.4-bin.tar.gz
  1. Configure user password exemption and permissions
# 创建用户需使用 root 登录  
useradd dolphinscheduler  

# 添加密码  
echo "dolphinscheduler" | passwd --stdin dolphinscheduler  

# 配置 sudo 免密  
sed -i '$adolphinscheduler ALL=(ALL) NOPASSWD: NOPASSWD: ALL' /etc/sudoers
sed -i 's/Defaults   requirett/#Defaults requirett/g' /etc/sudoers  

# 修改目录权限,使得部署用户对二进制包解压后的 apache-dolphinscheduler-*-bin 目录有操作权限  
cd /usr/local/src  
chown -R dolphinscheduler:dolphinscheduler apache-dolphinscheduler-*-bin 
  1. Configure machine SSH password-free login
# 切换 dolphinscheduler 用户
su dolphinscheduler  
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa  
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys   
chmod 600 ~/.ssh/authorized_keys  
# 注意:配置完成后,可以通过运行命令 ssh localhost 判断是否成功,如果不需要输⼊密码就能 ssh 登陆则证明成功
  1. Data initialization
cd /usr/local/src  
# 下载 mysql-connector  
wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-j-8.0.31.tar.gz   
tar -zxvf mysql-connector-j-8.0.31.tar.gz  
# 驱动拷贝  
cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler-3.1.4-bin/api-server/libs/  
cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler-3.1.4-bin/alert-server/libs/  
cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler-3.1.4-bin/master-server/libs/  
cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler-3.1.4-bin/worker-server/libs/  
cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler-3.1.4-bin/tools/libs/  

# 安装 mysql 客户端  
# 修改 {mysql-endpoint} 为你 mysql 连接地址  
# 修改 {user} 和 {password} 为你 mysql ⽤户名和密码  
mysql -h {mysql-endpoint} -u{user} -p{password}  
mysql> CREATE DATABASE dolphinscheduler DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci; 

# 修改 {user} 和 {password} 为你希望的用户名和密码  
mysql> CREATE USER '{user}'@'%' IDENTIFIED BY '{password}';   
mysql> GRANT ALL PRIVILEGES ON dolphinscheduler.* TO '{user}'@'%';  
mysql> CREATE USER '{user}'@'localhost' IDENTIFIED BY '{password}';   
mysql> GRANT ALL PRIVILEGES ON dolphinscheduler.* TO '{user}'@'localhost';   
mysql> FLUSH PRIVILEGES;  

修改数据库配置  
vi bin/env/dolphinscheduler_env.sh  

# Database related configuration, set database type, username and password # 修改 {mysql-endpoint} 为你 mysql 连接地址  
# 修改 {user} 和 {password} 为你 mysql ⽤户名和密码,{rds-endpoint}为数据库连接地址
export DATABASE=${DATABASE:-mysql}   
export SPRING_PROFILES_ACTIVE=${DATABASE}  
export SPRING_DATASOURCE_URL="jdbc:mysql://{rds-endpoint}/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8&useSSL=false"   
export SPRING_DATASOURCE_USERNAME={user}  
export SPRING_DATASOURCE_PASSWORD={password}  

# 执行数据初始化  
bash apache-dolphinscheduler/tools/bin/upgrade-schema.sh
  1. Modify install_env.sh
cd /usr/local/src/apache-dolphinscheduler
vi bin/env/install_env.sh   

# 替换 IP 为 DolphinScheduler 所部署 EC2 私有 IP 地址  
ips=${ips:-"10.100.1.220"}  
masters=${masters:-"10.100.1.220"}
workers=${workers:-"10.100.1.220:default"}
alertServer=${alertServer:-"10.100.1.220"}
apiServers=${apiServers:-"10.100.1.220"}
installPath=${installPath:-"~/dolphinscheduler"}  
  1. Modify DolphinScheduler_env.sh
cd /usr/local/src/  
mv apache-dolphinscheduler-3.1.4-bin apache-dolphinscheduler   
cd ./apache-dolphinscheduler  
# 修改 DolphinScheduler 环境变量  
vi bin/env/dolphinscheduler_env.sh  

export JAVA_HOME=${JAVA_HOME:-/usr/lib/jvm/jre-1.8.0-openjdk-1.8.0.362.b08-1.amzn2.0.1.x86_64}
export PYTHON_HOME=${PYTHON_HOME:-/bin/python} 
  1. Start DolphinScheduler
cd /usr/local/src/apache-dolphinscheduler 
su dolphinscheduler  
bash ./bin/install.sh
  1. Visit DolphinScheduler

URL access uses the IP EC2 public IP address deployed for DolphinScheduler http://ec2-endpoint:12345/dolphinscheduler/ui/login

Initial username/passwordadmin/dolphinscheduler123 05

Configure DolphinScheduler

  1. Create a tenant

file2. Bind users to tenants

file

  1. AWS Create IAM Policy
{  
    "Version":"2012-10-17",  
    "Statement":[  
        {  
            "Sid":"ElasticMapReduceActions",  
            "Effect":"Allow",  
            "Action":[  
                "elasticmapreduce:RunJobFlow",  
                "elasticmapreduce:DescribeCluster",  
                "elasticmapreduce:AddJobFlowSteps",  
                "elasticmapreduce:DescribeStep",  
                "elasticmapreduce:TerminateJobFlows",  
                "elasticmapreduce:SetTerminationProtection"  
            ],  
            "Resource":"*"  
        },  
        {  
            "Effect":"Allow",  
            "Action":[  
                "iam:GetRole",  
                "iam:PassRole"  
            ],  
            "Resource":[  
                "arn:aws:iam::accountid:role/EMR_DefaultRole",  
                "arn:aws:iam::accountid:role:role/EMR_EC2_DefaultRole"  
            ]  
        }  
    ]  
}  
  1. Create an IAM role

Enter AWS IAM, create a role, and assign the policy created in the previous step

  1. DolphinScheduler deploys EC2 bound roles

file

Bind EC2 to the role created in the previous step so that the EC2 deployed by DolphinScheduler has the permission to call EMR.

  1. Python installs boto3 and other components needed
sudu pip install boto3  
sudu pip install redis

Using DolphinScheduler for job scheduling

Executed in Python.

Job execution sequence diagram:

file

  1. Create EMR cluster creation task

Create an EMR cluster with 3 MASTERs and 3 COREs, specify the subnet and permissions, and automatically terminate the cluster after being idle for ten minutes. Specific parameters can be found in the link.

import boto3  
from datetime import date  
import redis  

def run_job_flow():  
    response = client.run_job_flow(  
        Name='create-emrcluster-'+ d1,  
        LogUri='s3://s3bucket/elasticmapreduce/',  
        ReleaseLabel='emr-6.8.0',  
        Instances={  
            'KeepJobFlowAliveWhenNoSteps': False,  
            'TerminationProtected': False,  
            # 替换{Sunbet-id}为你需要部署的子网 id  
            'Ec2SubnetId': '{Sunbet-id}',  
            # 替换{Keypairs-name}为你 ec2 使用密钥对名称  
            'Ec2KeyName': '{Keypairs-name}',  
            'InstanceGroups': [  
                {  
                    'Name': 'Master',  
                    'Market': 'ON_DEMAND',  
                    'InstanceRole': 'MASTER',  
                    'InstanceType': 'm5.xlarge',  
                    'InstanceCount': 3,  
                    'EbsConfiguration': {  
                        'EbsBlockDeviceConfigs': [  
                            {  
                                'VolumeSpecification': {  
                                    'VolumeType': 'gp3',  
                                    'SizeInGB': 500  
                                },  
                                'VolumesPerInstance': 1  
                            },  
                        ],  
                        'EbsOptimized': True  
                    },  
                },  
                {  
                    'Name': 'Core',  
                    'Market': 'ON_DEMAND',  
                    'InstanceRole': 'CORE',  
                    'InstanceType': 'm5.xlarge',  
                    'InstanceCount': 3,  
                    'EbsConfiguration': {  
                        'EbsBlockDeviceConfigs': [  
                            {  
                                'VolumeSpecification': {  
                                    'VolumeType': 'gp3',  
                                    'SizeInGB': 500  
                                },  
                                'VolumesPerInstance': 1  
                            },  
                        ],  
                        'EbsOptimized': True  
                    },  
                }  
            ],  
        },  
        Applications=[{'Name': 'Spark'},{'Name': 'Hive'},{'Name': 'Pig'},{'Name': 'Presto'}],  
        Configurations=[  
            { 'Classification': 'spark-hive-site',  
                'Properties': {  
                    'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'}  
            },  
            { 'Classification': 'hive-site',  
                'Properties': {  
                    'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'}  
            },  
            { 'Classification': 'presto-connector-hive',  
                'Properties': {  
                    'hive.metastore.glue.datacatalog.enabled': 'true'}  
            }  
        ],  
        JobFlowRole='EMR_EC2_DefaultRole',  
        ServiceRole='EMR_DefaultRole',  
        EbsRootVolumeSize=100,  
        # 集群空闲十分钟自动终止  
        AutoTerminationPolicy={  
            'IdleTimeout': 600  
        }  
      )  
    return response  

if __name__ == "__main__":  
    today = date.today()  
    d1 = today.strftime("%Y%m%d")  
    # {region}替换为你需要创建 EMR 的 Region  
    client = boto3.client('emr',region_name='{region}')  
    # 创建 EMR 集群  
    clusterCreate = run_job_flow()  
    job_id = clusterCreate['JobFlowId']  

    # 使用 redis 来保存信息,作为 DolphinScheduler job step 的参数传递,也可以使用 DolphinScheduler 所使用的 mysql 或者其他方式存储  
    # 替换{redis-endpoint}为你 redis 连接地址  
    pool = redis.ConnectionPool(host='{redis-endpoint}', port=6379, decode_responses=True)  
    r = redis.Redis(connection_pool=pool)  
    r.set('cluster_id_'+d1, job_id) 
  1. Create an EMR cluster status check task

Check whether the EMR cluster has been created

import boto3  
import redis  
import time  
from datetime import date  

if __name__ == "__main__":  
    today = date.today()  
    d1 = today.strftime("%Y%m%d")  

    # {region}替换为你需要创建 EMR 的 Region  
    client = boto3.client('emr',region_name='{region}')  
    # 替换{redis-endpoint}为你 redis 连接地址  
    pool = redis.ConnectionPool(host='{redis-endpoint}', port=6379, decode_responses=True)  
    r = redis.Redis(connection_pool=pool)  
    # 获取创建的 EMR 集群 id  
    job_id = r.get('cluster_id_' + d1)  
    print(job_id)  
    while True:  
        result = client.describe_cluster(ClusterId=job_id)  
        emr_state = result['Cluster']['Status']['State']  
        print(emr_state)  
        if emr_state == 'WAITING':  
            # EMR 集群创建成功  
            break  
        elif emr_state == 'FAILED':  
            # 集群创建失败  
            # do something...  
            break  
        else:  
            time.sleep(10)
  1. Start the spark job using the created EMR cluster
import time  
import re  
import boto3  
from datetime import date  
import redis  

def generate_step(step_name, step_command):  
    cmds = re.split('\\s+', step_command)  
    print(cmds)  
    if not cmds:  
        raise ValueError  
    return {  
        'Name': step_name,  
        'ActionOnFailure': 'CANCEL_AND_WAIT',  
        'HadoopJarStep': {  
            'Jar': 'command-runner.jar',  
            'Args': cmds  
        }  
    }  


if __name__ == "__main__":  
    today = date.today()  
    d1 = today.strftime("%Y%m%d")  

    # {region}替换为你需要创建 EMR 的 Region  
    client = boto3.client('emr',region_name='{region}')  

    # 获取 emr 集群 id  
    # 替换{redis-endpoint}为你 redis 连接地址  
    pool = redis.ConnectionPool(host='{redis-endpoint}', port=6379, decode_responses=True)  
    r = redis.Redis(connection_pool=pool)  
    job_id = r.get('cluster_id_' + d1)  

    # job 启动命令  
    spark_submit_cmd = """spark-submit 
                s3://s3bucket/file/spark/spark-etl.py 
                s3://s3bucket/input/ 
                s3://s3bucket/output/spark/"""+d1+'/'  

    steps = []  
    steps.append(generate_step("SparkExample_"+d1 , spark_submit_cmd),)  
    # 提交 EMR Step 作业  
    response = client.add_job_flow_steps(JobFlowId=job_id, Steps=steps)  
    step_id = response['StepIds'][0]  
    # 将作业 id 保存,以便于做任务检查  
    r.set('SparkExample_'+d1, step_id)
  1. Create JOB execution status check
import boto3  
import redis  
import time  
from datetime import date  


if __name__ == "__main__":  
    today = date.today()  
    d1 = today.strftime("%Y%m%d")  

    # {region}替换为你需要创建 EMR 的 Region  
    client = boto3.client('emr',region_name='{region}')  

    # 替换{redis-endpoint}为你 redis 连接地址  
    pool = redis.ConnectionPool(host='{redis-endpoint}', port=6379, decode_responses=True)  
    r = redis.Redis(connection_pool=pool)  
    job_id = r.get('cluster_id_' + d1)  
    step_id = r.get('SparkExample_' + d1)  
    print(job_id)  
    print(step_id)  

    while True:  
        # 查询作业执行结果  
        result = client describe_step(ClusterId=job_id,StepId=step_id)  
        emr_state = result['Step']['Status']['State']  
        print(emr_state)  
        if emr_state == 'COMPLETED':  
            # 作业执行完成  
            break  
        elif emr_state == 'FAILED'  
            # 作业执行失败  
            # do somethine  
            # ......  
            break  
        else:  
            time.sleep(10)
  1. Set execution order

fileCreate a workflow in DolphinScheduler - Project Management - Workflow - Workflow Definition, and create a python task, and concatenate the above python scripts as tasks

  1. Save and go online

fileSave the task and click Go Online

  1. implement

fileYou can click to execute immediately, or specify a scheduled task to be executed on time.

View execution status in EMR

EMR creation status - starting

fileEMR Step execution status - currently executing

file

  1. Check execution results and execution logs

fileCheck the execution status and execution log in DolphinScheduler - Project Management - Workflow - Workflow instance. fileView the execution status in EMR

EMR Creation Status - Waiting

fileStep execution status - completedfile

  1. Terminate the cluster

For temporary execution jobs or batch jobs that are scheduled to be executed every day, the EMR cluster can be terminated after the job ends to save costs (EMR uses best practices). Terminating an EMR cluster can be automatically terminated after idle using the EMR native functionality, or abort can be called manually.

Automatically terminate the EMR cluster and configure it when creating the cluster

AutoTerminationPolicy={  
    'IdleTimeout': 600  
} 

This cluster will automatically terminate after the job is idle for ten minutes. Manually terminate the EMR cluster:

import boto3  
from datetime import date  
import redis  

if __name__ == "__main__":  
    today = date.today()  
    d1 = today.strftime("%Y%m%d")  

    # 获取集群 id  
    # {region}替换为你需要创建 EMR 的 Region  
    client = boto3.client('emr',region_name='{region}')  

    # 替换{redis-endpoint}为你 redis 连接地址  
    pool = redis.ConnectionPool(host='{redis-endpoint}', port=6379, decode_responses=True)  
    r = redis.Redis(connection_pool=pool)  
    job_id = r.get('cluster_id_' + d1)  
    # 关闭集群终止保护  
    client.set_termination_protection(JobFlowIds=[job_id],TerminationProtected=False)  
    # 终止集群  
    client.terminate_job_flows(JobFlowIds=[job_id])  

Add this script to the DolphinScheduler job flow, and the job flow will execute the script after all tasks are completed to terminate the EMR cluster.

Summarize

With the application of enterprise big data analysis platforms, more and more data processing processes/processing tasks need to use a simple and easy-to-use scheduling system to sort out their intricate dependencies and arrange and schedule them according to the execution plan. At the same time, they need to provide Easy-to-use and easy-to-extend visual DAG capabilities, and Apache DolphinScheduler just meets the above needs.

This article introduces the independent deployment of DolphinScheduler on AWS, and uses the characteristics of EMR, combined with best practices, to demonstrate the process from creating an EMR cluster to submitting ETL jobs, and finally terminating the cluster after all job executions are completed to form a complete job processing process. Users can refer to this document to quickly deploy and build their own big data scheduling system.

author

Wang Xiao, AWS solutions architect, is responsible for the consulting and design of AWS cloud computing solution architecture, and promotes AWS cloud platform technology and various solutions in China. He has rich experience in enterprise IT architecture and currently focuses on the field of big data. Research.

This article is published by Beluga Open Source Technology !

Guess you like

Origin blog.csdn.net/DolphinScheduler/article/details/131939115