Table of contents
5. Create a coordinate schedule
demand background
According to the business situation, data needs to be pulled regularly through sqoop+mysql+hive. Business data is landed in hive through sqoop+mysql; ETL result data is landed in mysql through sqopp+hive.
solution
The big data component HUE+OOZIE schedules shell scripts to execute sqoop commands, which is convenient for management and troubleshooting.
method of execution
1. Write a shell file
sqoop-mysql2hive.sh
#!/bin/bash
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
#do_date=$(date -d "-1 day" +%F)
if [ -n "$1" ]; then
do_date=$1
else
do_date=$(date -d "-1 day" +%F)
fi
jdbc_url_dduser="jdbc:mysql://xxx:13306/dduser?serverTimezone=Asia/Shanghai&characterEncoding=utf8&tinyInt1isBit=false"
jdbc_username=root
jdbc_password=123456
echo "===开始从mysql中提取业务数据日期为 $do_date 的数据==="
#sqoop-mysql2hive-appconfig
sqoop import --connect $jdbc_url_dduser --username $jdbc_username --password $jdbc_password --table app_config --hive-overwrite --hive-import --hive-table dd_database_bigdata.ods_app_config --target-dir /warehouse/dd/bigdata/ods/tmp/ods_app_config --hive-drop-import-delims -m 1 --input-null-string '\\N' --input-null-non-string '\\N'
echo "===从mysql中提取日期为 $do_date 的数据完成==="
Statement explanation:
Define a variable in the shell file, directly define such as: jdbc_username=root, use this parameter: $jdbc_username
`date -d "-1 day" +%F` on the previous day in the shell, `date +%F` on the current day
When the shell action needs to pass parameters, HUE stipulates that $1, $2, and $3 should be used, which will be mentioned later when creating a Schedule.
2. Put the sh file on hdfs
/warehouse/dd/oozie/workspace/workspace-sqoop-hive2mysql-now/shell/sqoop-hive2mysql-now-shell.sh
3. Create a workflow
4. Execute the test
Program execution can be viewed in the job
5. Create a coordinate schedule
If you select a previous time from start time, after the task is created and executed, multiple jobs will be executed first to make up for the selected time difference.
For example: we have a task at 10 minutes per hour, the current time is 12:15, and from selects 12:00, when this coordinate is executed, a job workflow will be executed immediately, which is the task executed at 12:10. So for this from, we only need to default the time at that time when it is created.
6. Execute coord