By AWS Team
Preface
As the scale of the enterprise expands and business data increases, we will use the Hadoop/Spark framework to process ETL/aggregation analysis jobs of large amounts of data, and these jobs will need to be scheduled regularly by a unified job scheduling platform.
In Amazon EMR, you can use AWS to provide Step Function, host AirFlow, and Apache Oozie or Azkaban for job invocation. However, with the improvement of Apache Dolphinscheduler products, the increasingly popular community, and its features such as simplicity, ease of use, high reliability, high scalability, support for rich usage scenarios, and multi-tenant mode, more and more companies are choosing to use it. Product as a service for task scheduling.
DolphinScheduler can be installed and deployed in an Amazon EMR cluster. However, based on the characteristics of Amazon EMR itself and usage best practices, it is not recommended that customers use a large, comprehensive, and persistently running EMR cluster to provide the entire big data related services. The cluster is split based on different dimensions, such as research and development stage (development, testing, production), workload (ad hoc query, batch processing), time sensitivity, job duration requirements, organization type, etc., so DolphinScheduler serves as a unified The scheduling platform does not need to be installed on a fixed EMR cluster. Instead, it chooses to deploy independently, divide the jobs into different EMR clusters, and assemble them in DAG (Directed Acyclic Graph, DAG) streaming mode to achieve unified scheduling and manage.
This article will introduce the installation and deployment of DolphinScheduler, as well as job orchestration in DolphinScheduler, and use python scripts to perform EMR task scheduling, including creating clusters, cluster status checks, submitting EMR Step jobs, EMR Step job status checks, and all jobs Terminate the cluster when finished.
Amazon EMR
Amazon EMR is a managed cluster platform that simplifies running big data frameworks such as Apache Hadoop and Apache Spark on AWS to process and analyze massive amounts of data. Users can start a cluster that contains many Hadoop ecological data processing and analysis-related services with one click, without the need for manual complicated configuration.
Apache DolphinScheduler
Apache DolphinScheduler is a distributed and easily scalable visual DAG workflow task scheduling open source system. Suitable for enterprise-level scenarios, it provides a solution for visualizing operational tasks, workflows and full life cycle data processing processes.
characteristic
Simple and easy to use
- Visual DAG: user-friendly, drag-and-drop workflow definition, modular runtime control tools
- Operation: Modular for easy customization and maintenance
Rich usage scenarios
- Supports multiple task types: supports more than 10 task types such as Shell, MR, Spark, SQL, etc., and supports cross-language
- Easily expandable and rich workflow operations: Workflows can be scheduled, paused, resumed and stopped, making it easy to maintain and control global and local parameters.
High Reliability
High reliability: decentralized design to ensure stability. Native HA task queue support provides overload fault tolerance. DolphinScheduler provides a highly robust environment.
- High Scalability
High scalability: supports multi-tenancy and online resource management. Supports the stable operation of 100,000 data tasks per day.
Architecture diagram:
Mainly achieved:
- Tasks are associated according to task dependencies in the form of a DAG diagram, allowing real-time visual monitoring of the running status of tasks.
- Supports a variety of task types: Shell, MR, Spark, SQL (mysql, oceanbase, postgresql, hive, sparksql), Python, Sub_Process, Procedure, etc.
- Supports workflow scheduled scheduling, dependent scheduling, manual scheduling, manual pause/stop/resume, and also supports operations such as failed retry/alarm, failed recovery from specified nodes, and Kill tasks.
- Supports workflow priority, task priority, task failover and task timeout alarm/failure
- Supports workflow global parameters and node custom parameter settings
- Supports online upload/download, management, etc. of resource files, supports online file creation and editing
- Supports online viewing and scrolling of task logs, online downloading of logs, etc.
- Implement cluster HA and achieve decentralization of Master cluster and Worker cluster through Zookeeper
- Supports online viewing of Master/Worker CPU load, memory, and CPU
- Supports workflow running history tree/Gantt chart display, task status statistics, and process status statistics.
- Supports complement
- Support multi-tenancy
Install DolphinScheduler
DolphinScheduler supports multiple deployment methods
- Standalone deployment: Standalone is only suitable for DolphinScheduler’s quick experience
- Pseudo cluster deployment: The purpose of pseudo cluster deployment is to deploy the DolphinScheduler service on a single machine. In this mode, the master, worker, and api server are all on the same machine.
- Cluster deployment: The purpose of cluster deployment is to deploy the DolphinScheduler service on multiple machines to run a large number of tasks.
If you are a novice and want to experience the functions of DolphinScheduler, it is recommended to use Standalone; if you want to experience more complete functions or a larger workload, it is recommended to use pseudo-cluster deployment; if you are using it in production, it is recommended to use Cluster deployment or kubernetes.
This experiment will introduce the deployment of DolphinScheduler in pseudo-cluster mode on AWS.
- Start an EC2
Start an EC2 in the AWS public subnet, select Amazon-linux2, and open TCP 12345 port in the m5.xlarge security group.
- Install JDK and configure JAVA_HOME environment
java -version
openjdk version "1.8.0_362"
OpenJDK Runtime Environment (build 1.8.0_362-b08) OpenJDK 64-Bit Server VM (build 25.362-b08, mixed mode)
- Install and start Zookeeper
bin/zkServer.sh status
/usr/bin/java
ZooKeeper JMX enabled by default
Using config: /usr/local/src/apache-zookeeper-3.8.1-bin/bin/../conf/zoo.cfg Client port found: 2181. Client address: localhost. Client SSL: false.
Mode: standalone
- Start mysql, choose Aurora Serverless here
- Install AWS CLI2
aws --version
aws-cli/2.11.4 Python/3.11.2 Linux/5.10.167-147.601.amzn2.x86_64 exe/x86_64.amzn.2 prompt/off
- Update python version to 3.9
python --version
Python 3.9.1
- DownloadDolphinScheduler
cd /usr/local/src
wget https://dlcdn.apache.org/dolphinscheduler/3.1.4/apache-dolphinscheduler-3.1.4-bin.tar.gz
- Configure user password exemption and permissions
# 创建用户需使用 root 登录
useradd dolphinscheduler
# 添加密码
echo "dolphinscheduler" | passwd --stdin dolphinscheduler
# 配置 sudo 免密
sed -i '$adolphinscheduler ALL=(ALL) NOPASSWD: NOPASSWD: ALL' /etc/sudoers
sed -i 's/Defaults requirett/#Defaults requirett/g' /etc/sudoers
# 修改目录权限,使得部署用户对二进制包解压后的 apache-dolphinscheduler-*-bin 目录有操作权限
cd /usr/local/src
chown -R dolphinscheduler:dolphinscheduler apache-dolphinscheduler-*-bin
- Configure machine SSH password-free login
# 切换 dolphinscheduler 用户
su dolphinscheduler
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
# 注意:配置完成后,可以通过运行命令 ssh localhost 判断是否成功,如果不需要输⼊密码就能 ssh 登陆则证明成功
- Data initialization
cd /usr/local/src
# 下载 mysql-connector
wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-j-8.0.31.tar.gz
tar -zxvf mysql-connector-j-8.0.31.tar.gz
# 驱动拷贝
cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler-3.1.4-bin/api-server/libs/
cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler-3.1.4-bin/alert-server/libs/
cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler-3.1.4-bin/master-server/libs/
cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler-3.1.4-bin/worker-server/libs/
cp mysql-connector-j-8.0.31/mysql-connector-j-8.0.31.jar ./apache-dolphinscheduler-3.1.4-bin/tools/libs/
# 安装 mysql 客户端
# 修改 {mysql-endpoint} 为你 mysql 连接地址
# 修改 {user} 和 {password} 为你 mysql ⽤户名和密码
mysql -h {mysql-endpoint} -u{user} -p{password}
mysql> CREATE DATABASE dolphinscheduler DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
# 修改 {user} 和 {password} 为你希望的用户名和密码
mysql> CREATE USER '{user}'@'%' IDENTIFIED BY '{password}';
mysql> GRANT ALL PRIVILEGES ON dolphinscheduler.* TO '{user}'@'%';
mysql> CREATE USER '{user}'@'localhost' IDENTIFIED BY '{password}';
mysql> GRANT ALL PRIVILEGES ON dolphinscheduler.* TO '{user}'@'localhost';
mysql> FLUSH PRIVILEGES;
修改数据库配置
vi bin/env/dolphinscheduler_env.sh
# Database related configuration, set database type, username and password # 修改 {mysql-endpoint} 为你 mysql 连接地址
# 修改 {user} 和 {password} 为你 mysql ⽤户名和密码,{rds-endpoint}为数据库连接地址
export DATABASE=${DATABASE:-mysql}
export SPRING_PROFILES_ACTIVE=${DATABASE}
export SPRING_DATASOURCE_URL="jdbc:mysql://{rds-endpoint}/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8&useSSL=false"
export SPRING_DATASOURCE_USERNAME={user}
export SPRING_DATASOURCE_PASSWORD={password}
# 执行数据初始化
bash apache-dolphinscheduler/tools/bin/upgrade-schema.sh
- Modify install_env.sh
cd /usr/local/src/apache-dolphinscheduler
vi bin/env/install_env.sh
# 替换 IP 为 DolphinScheduler 所部署 EC2 私有 IP 地址
ips=${ips:-"10.100.1.220"}
masters=${masters:-"10.100.1.220"}
workers=${workers:-"10.100.1.220:default"}
alertServer=${alertServer:-"10.100.1.220"}
apiServers=${apiServers:-"10.100.1.220"}
installPath=${installPath:-"~/dolphinscheduler"}
- Modify DolphinScheduler_env.sh
cd /usr/local/src/
mv apache-dolphinscheduler-3.1.4-bin apache-dolphinscheduler
cd ./apache-dolphinscheduler
# 修改 DolphinScheduler 环境变量
vi bin/env/dolphinscheduler_env.sh
export JAVA_HOME=${JAVA_HOME:-/usr/lib/jvm/jre-1.8.0-openjdk-1.8.0.362.b08-1.amzn2.0.1.x86_64}
export PYTHON_HOME=${PYTHON_HOME:-/bin/python}
- Start DolphinScheduler
cd /usr/local/src/apache-dolphinscheduler
su dolphinscheduler
bash ./bin/install.sh
- Visit DolphinScheduler
URL access uses the IP EC2 public IP address deployed for DolphinScheduler http://ec2-endpoint:12345/dolphinscheduler/ui/login
Initial username/passwordadmin/dolphinscheduler123 05
Configure DolphinScheduler
- Create a tenant
2. Bind users to tenants
- AWS Create IAM Policy
{
"Version":"2012-10-17",
"Statement":[
{
"Sid":"ElasticMapReduceActions",
"Effect":"Allow",
"Action":[
"elasticmapreduce:RunJobFlow",
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:AddJobFlowSteps",
"elasticmapreduce:DescribeStep",
"elasticmapreduce:TerminateJobFlows",
"elasticmapreduce:SetTerminationProtection"
],
"Resource":"*"
},
{
"Effect":"Allow",
"Action":[
"iam:GetRole",
"iam:PassRole"
],
"Resource":[
"arn:aws:iam::accountid:role/EMR_DefaultRole",
"arn:aws:iam::accountid:role:role/EMR_EC2_DefaultRole"
]
}
]
}
- Create an IAM role
Enter AWS IAM, create a role, and assign the policy created in the previous step
- DolphinScheduler deploys EC2 bound roles
Bind EC2 to the role created in the previous step so that the EC2 deployed by DolphinScheduler has the permission to call EMR.
- Python installs boto3 and other components needed
sudu pip install boto3
sudu pip install redis
Using DolphinScheduler for job scheduling
Executed in Python.
Job execution sequence diagram:
- Create EMR cluster creation task
Create an EMR cluster with 3 MASTERs and 3 COREs, specify the subnet and permissions, and automatically terminate the cluster after being idle for ten minutes. Specific parameters can be found in the link.
import boto3
from datetime import date
import redis
def run_job_flow():
response = client.run_job_flow(
Name='create-emrcluster-'+ d1,
LogUri='s3://s3bucket/elasticmapreduce/',
ReleaseLabel='emr-6.8.0',
Instances={
'KeepJobFlowAliveWhenNoSteps': False,
'TerminationProtected': False,
# 替换{Sunbet-id}为你需要部署的子网 id
'Ec2SubnetId': '{Sunbet-id}',
# 替换{Keypairs-name}为你 ec2 使用密钥对名称
'Ec2KeyName': '{Keypairs-name}',
'InstanceGroups': [
{
'Name': 'Master',
'Market': 'ON_DEMAND',
'InstanceRole': 'MASTER',
'InstanceType': 'm5.xlarge',
'InstanceCount': 3,
'EbsConfiguration': {
'EbsBlockDeviceConfigs': [
{
'VolumeSpecification': {
'VolumeType': 'gp3',
'SizeInGB': 500
},
'VolumesPerInstance': 1
},
],
'EbsOptimized': True
},
},
{
'Name': 'Core',
'Market': 'ON_DEMAND',
'InstanceRole': 'CORE',
'InstanceType': 'm5.xlarge',
'InstanceCount': 3,
'EbsConfiguration': {
'EbsBlockDeviceConfigs': [
{
'VolumeSpecification': {
'VolumeType': 'gp3',
'SizeInGB': 500
},
'VolumesPerInstance': 1
},
],
'EbsOptimized': True
},
}
],
},
Applications=[{'Name': 'Spark'},{'Name': 'Hive'},{'Name': 'Pig'},{'Name': 'Presto'}],
Configurations=[
{ 'Classification': 'spark-hive-site',
'Properties': {
'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'}
},
{ 'Classification': 'hive-site',
'Properties': {
'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'}
},
{ 'Classification': 'presto-connector-hive',
'Properties': {
'hive.metastore.glue.datacatalog.enabled': 'true'}
}
],
JobFlowRole='EMR_EC2_DefaultRole',
ServiceRole='EMR_DefaultRole',
EbsRootVolumeSize=100,
# 集群空闲十分钟自动终止
AutoTerminationPolicy={
'IdleTimeout': 600
}
)
return response
if __name__ == "__main__":
today = date.today()
d1 = today.strftime("%Y%m%d")
# {region}替换为你需要创建 EMR 的 Region
client = boto3.client('emr',region_name='{region}')
# 创建 EMR 集群
clusterCreate = run_job_flow()
job_id = clusterCreate['JobFlowId']
# 使用 redis 来保存信息,作为 DolphinScheduler job step 的参数传递,也可以使用 DolphinScheduler 所使用的 mysql 或者其他方式存储
# 替换{redis-endpoint}为你 redis 连接地址
pool = redis.ConnectionPool(host='{redis-endpoint}', port=6379, decode_responses=True)
r = redis.Redis(connection_pool=pool)
r.set('cluster_id_'+d1, job_id)
- Create an EMR cluster status check task
Check whether the EMR cluster has been created
import boto3
import redis
import time
from datetime import date
if __name__ == "__main__":
today = date.today()
d1 = today.strftime("%Y%m%d")
# {region}替换为你需要创建 EMR 的 Region
client = boto3.client('emr',region_name='{region}')
# 替换{redis-endpoint}为你 redis 连接地址
pool = redis.ConnectionPool(host='{redis-endpoint}', port=6379, decode_responses=True)
r = redis.Redis(connection_pool=pool)
# 获取创建的 EMR 集群 id
job_id = r.get('cluster_id_' + d1)
print(job_id)
while True:
result = client.describe_cluster(ClusterId=job_id)
emr_state = result['Cluster']['Status']['State']
print(emr_state)
if emr_state == 'WAITING':
# EMR 集群创建成功
break
elif emr_state == 'FAILED':
# 集群创建失败
# do something...
break
else:
time.sleep(10)
- Start the spark job using the created EMR cluster
import time
import re
import boto3
from datetime import date
import redis
def generate_step(step_name, step_command):
cmds = re.split('\\s+', step_command)
print(cmds)
if not cmds:
raise ValueError
return {
'Name': step_name,
'ActionOnFailure': 'CANCEL_AND_WAIT',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': cmds
}
}
if __name__ == "__main__":
today = date.today()
d1 = today.strftime("%Y%m%d")
# {region}替换为你需要创建 EMR 的 Region
client = boto3.client('emr',region_name='{region}')
# 获取 emr 集群 id
# 替换{redis-endpoint}为你 redis 连接地址
pool = redis.ConnectionPool(host='{redis-endpoint}', port=6379, decode_responses=True)
r = redis.Redis(connection_pool=pool)
job_id = r.get('cluster_id_' + d1)
# job 启动命令
spark_submit_cmd = """spark-submit
s3://s3bucket/file/spark/spark-etl.py
s3://s3bucket/input/
s3://s3bucket/output/spark/"""+d1+'/'
steps = []
steps.append(generate_step("SparkExample_"+d1 , spark_submit_cmd),)
# 提交 EMR Step 作业
response = client.add_job_flow_steps(JobFlowId=job_id, Steps=steps)
step_id = response['StepIds'][0]
# 将作业 id 保存,以便于做任务检查
r.set('SparkExample_'+d1, step_id)
- Create JOB execution status check
import boto3
import redis
import time
from datetime import date
if __name__ == "__main__":
today = date.today()
d1 = today.strftime("%Y%m%d")
# {region}替换为你需要创建 EMR 的 Region
client = boto3.client('emr',region_name='{region}')
# 替换{redis-endpoint}为你 redis 连接地址
pool = redis.ConnectionPool(host='{redis-endpoint}', port=6379, decode_responses=True)
r = redis.Redis(connection_pool=pool)
job_id = r.get('cluster_id_' + d1)
step_id = r.get('SparkExample_' + d1)
print(job_id)
print(step_id)
while True:
# 查询作业执行结果
result = client describe_step(ClusterId=job_id,StepId=step_id)
emr_state = result['Step']['Status']['State']
print(emr_state)
if emr_state == 'COMPLETED':
# 作业执行完成
break
elif emr_state == 'FAILED'
# 作业执行失败
# do somethine
# ......
break
else:
time.sleep(10)
- Set execution order
Create a workflow in DolphinScheduler - Project Management - Workflow - Workflow Definition, and create a python task, and concatenate the above python scripts as tasks
- Save and go online
Save the task and click Go Online
- implement
You can click to execute immediately, or specify a scheduled task to be executed on time.
View execution status in EMR
EMR creation status - starting
EMR Step execution status - currently executing
- Check execution results and execution logs
Check the execution status and execution log in DolphinScheduler - Project Management - Workflow - Workflow instance. View the execution status in EMR
EMR Creation Status - Waiting
Step execution status - completed
- Terminate the cluster
For temporary execution jobs or batch jobs that are scheduled to be executed every day, the EMR cluster can be terminated after the job ends to save costs (EMR uses best practices). Terminating an EMR cluster can be automatically terminated after idle using the EMR native functionality, or abort can be called manually.
Automatically terminate the EMR cluster and configure it when creating the cluster
AutoTerminationPolicy={
'IdleTimeout': 600
}
This cluster will automatically terminate after the job is idle for ten minutes. Manually terminate the EMR cluster:
import boto3
from datetime import date
import redis
if __name__ == "__main__":
today = date.today()
d1 = today.strftime("%Y%m%d")
# 获取集群 id
# {region}替换为你需要创建 EMR 的 Region
client = boto3.client('emr',region_name='{region}')
# 替换{redis-endpoint}为你 redis 连接地址
pool = redis.ConnectionPool(host='{redis-endpoint}', port=6379, decode_responses=True)
r = redis.Redis(connection_pool=pool)
job_id = r.get('cluster_id_' + d1)
# 关闭集群终止保护
client.set_termination_protection(JobFlowIds=[job_id],TerminationProtected=False)
# 终止集群
client.terminate_job_flows(JobFlowIds=[job_id])
Add this script to the DolphinScheduler job flow, and the job flow will execute the script after all tasks are completed to terminate the EMR cluster.
Summarize
With the application of enterprise big data analysis platforms, more and more data processing processes/processing tasks need to use a simple and easy-to-use scheduling system to sort out their intricate dependencies and arrange and schedule them according to the execution plan. At the same time, they need to provide Easy-to-use and easy-to-extend visual DAG capabilities, and Apache DolphinScheduler just meets the above needs.
This article introduces the independent deployment of DolphinScheduler on AWS, and uses the characteristics of EMR, combined with best practices, to demonstrate the process from creating an EMR cluster to submitting ETL jobs, and finally terminating the cluster after all job executions are completed to form a complete job processing process. Users can refer to this document to quickly deploy and build their own big data scheduling system.
author
Wang Xiao, AWS solutions architect, is responsible for the consulting and design of AWS cloud computing solution architecture, and promotes AWS cloud platform technology and various solutions in China. He has rich experience in enterprise IT architecture and currently focuses on the field of big data. Research.
This article is published by Beluga Open Source Technology !