Solution: How to execute plain SQL files on Amazon EMR Serverless?

For a long time, SQL has been the programming language of choice for ETL due to its advantages of ease of use and high development efficiency, and has played an irreplaceable role in the process of building data warehouses and data lakes. Hive and Spark SQL are also based on this point, and they firmly occupy the main position in today's big data ecology. In a regular Spark environment, developers can use spark-sqlcommands to directly execute SQL files. This is a seemingly trivial but very important function: on the one hand, this method greatly reduces the threshold for using Spark. Users only need to know how to You can use Spark by writing SQL; on the other hand, driving the execution of SQL files through the command line can greatly simplify the submission of SQL jobs, so that the job submission itself can be "coded", which provides convenience for large-scale engineering development and automated deployment .

Unfortunately, Amazon EMR Serverless does not provide native support for executing SQL files. Users can only embed SQL statements in Scala/Python code, which is not friendly to users who rely heavily on pure SQL to develop data warehouses or data lakes. To this end, we have specially developed a set of tools for reading, parsing, and executing SQL files. With this set of tools, users can directly execute SQL files on Amazon EMR Serverless. This article will introduce this solution in detail.

1. Scheme design

Given that the method of executing SQL statements in the Spark programming environment is: spark.sql("..."), we can design a general job class, which will read the SQL file at the specified location according to the parameters passed in at startup, and then split it into a single SQL and write call to spark.sql("...")execute. In order to make the job class more flexible and versatile, wildcards can also be introduced to load and execute multiple SQL files at a time. In addition, ETL jobs often need to execute corresponding batches according to the time parameters generated by the job scheduling tool, and these parameters will also be applied to SQL. Therefore, the job class should also allow users to embed custom variables in SQL files and submit them Assign values ​​to custom variables in the form of parameters during the job. Based on this design idea, we have developed a project to realize the above functions. The project address is:

project name project address
Amazon EMR Serverless Utilities https://github.com/bluishglc/emr-serverless-utils

The class in the project com.github.emr.serverless.SparkSqlJobis the general SQL job class, which accepts two optional parameters, namely:

parameter illustrate Value example
–sql-files Specify the SQL file path to be executed, support Java file system wildcards , and specify multiple files to execute together s3://my-spark-sql-job/sqls/insert-into-*.sql
–sql-params Set values ​​for the variables K1=V1,K2=V2,...defined in the SQL file in the form of ${K1}, ,...${K2} CUST_CITY=NEW YORK,ORD_DATE=2008-07-15

The program has the following characteristics:

① Allows a single SQL file to contain multiple SQL statements ② Allows to define variables in the form of , ,…
in the SQL file , and use the parameters of the form to assign variables when executing jobs ③ Supports Java file system wildcards , allowing multiple SQLs to be executed at one time document${K1}${K2}K1=V1,K2=V2,...

Below, we will introduce and demonstrate how to use the tool classes of this project to submit pure SQL jobs in the two environments of AWS console and command line respectively.

2. Practical demonstration

2.1. Environment preparation

When submitting a job on EMR Serverless, you need to prepare an "EMR Serverless Application" and an "EMR Serverless Job Execution Role", where the latter should have read and write permissions for S3 and Glue Data Catalog. Application can be easily created through a wizard on the EMR Serverless console (EMR Studio) (all default configurations are sufficient), and Execution Role can use the script provided in Section 5 of the article "CDC One-Key Lake: When Apache Hudi DeltaStreamer Meets Serverless Spark" to quickly create.

Next, prepare the Jar package and SQL file required for submitting the job. First create a storage bucket on S3. The name of the bucket used in this article is: my-spark-sql-job(Please pay attention to replace the bucket name when you operate in your own environment), then download the compiled package from [here] and upload it to the directory:emr-serverless-utils.jars3://my-spark-sql-job/jars/

Please add a picture description

In the demonstration process, 5 SQL sample files will also be used, downloaded from [here] , decompressed and uploaded to s3://my-spark-sql-job/sqls/the directory:

Please add a picture description

2.2. Submit a pure SQL file job on the console

2.2.1. Execute a single SQL file

Open the EMR Serverless console (EMR Studio), and submit a job as follows under the selected EMR Serverless Application:

Please add a picture description

① Script location: set to the previously uploaded Jar package path s3://my-spark-sql-job/jars/emr-serverless-utils-1.0.jar
② Main class: set to com.github.emr.serverless.SparkSqlJob
③ Script arguments: set to["--sql-files","s3://my-spark-sql-job/sqls/drop-tables.sql"]

As for other options, no special settings are required, just keep the default configuration. For jobs deployed in the production environment, you can flexibly configure them according to your own job needs, such as the resource allocation of Spark Driver/Executor, etc. It should be reminded that Glue Data Catalog will be enabled by default for jobs created through the console (ie: Additional settings -> Metastore configuration -> Use AWS Glue Data Catalog is checked by default), for the convenience of checking SQL scripts in Glue and Athena It is recommended that you do not modify this default configuration.

The above configuration describes such a job: with in s3://my-spark-sql-job/jars/emr-serverless-utils-1.0.jaras com.github.emr.serverless.SparkSqlJobthe main class, a Spark job is raised. Among them ["--sql-files","s3://my-spark-sql-job/sqls/drop-tables.sql"]is SparkSqlJobthe parameter passed to, which is used to tell the location of the SQL file to be executed by the job. The SQL file executed by this job has only three simple DROP TABLE statements, which is a basic example to demonstrate the ability of the tool class to execute multiple SQL statements in a single file.

2.2.2. Execute SQL files with custom parameters

The next thing to demonstrate is the second function of the tool class: executing SQL files with custom parameters. Create a new job or directly copy the previous job (select the previous job on the console, click Actions -> Clone job), and then set the value of "Script arguments" to:

["--sql-files","s3://my-spark-sql-job/sqls/create-tables.sql","--sql-params","APP_S3_HOME=s3://my-spark-sql-job"]

As shown below:

Please add a picture description

--sql-filesIn addition to specifying the SQL file using parameters, this job setting also --sql-paramsassigns values ​​to user-defined variables appearing in SQL through parameters. According to the previous introduction, APP_S3_HOME=s3://my-spark-sql-jobit is a "Key=Value" string, which means that the value is s3://my-spark-sql-jobassigned to the variable APP_S3_HOME, and all occurrences in SQL ${APP_S3_HOME}will be s3://my-spark-sql-jobreplaced. View create-tables.sqlthe file, you can find custom variables in the LOCATION part of the table creation statement ${APP_S3_HOME}:

CREATE EXTERNAL TABLE IF NOT EXISTS ORDERS (
    ... ...
)
... ...
LOCATION '${APP_S3_HOME}/data/orders/';

When reading the SQL file, all the strings in the SQL file will be replaced SparkSqlJobaccording to the key-value pair string , and the actual executed SQL will become:APP_S3_HOME=s3://my-spark-sql-job${APP_S3_HOME}s3://my-spark-sql-job

CREATE EXTERNAL TABLE IF NOT EXISTS ORDERS (
    ... ...
)
... ...
LOCATION 's3://my-spark-sql-job/data/orders/';

After the job is submitted and executed, log in to the Athena console to check whether the data table is created successfully.

2.2.3. Using wildcards to execute multiple files

Sometimes, we need to batch execute all SQL files in a folder, or use wildcards to selectively execute some SQL files. Java file system wildcardsSparkSqlJob are used to support such requirements. The following job demonstrates the use of wildcards. It is also to create a new job or directly copy the previous job, and then set the value of "Script arguments" to:

["--sql-files","s3://my-spark-sql-job/sqls/insert-into-*.sql"]

As shown below:

Please add a picture description

The parameters of this job --sql-filesuse path wildcards, insert-into-*.sqlwhich will match insert-into-orders.sqland insert-into-customers.sqltwo SQL files at the same time, and they will insert multiple records into ORDERSand two tables respectively. CUSTOMERSAfter the execution is complete, you can log in to the Athena console to check whether there is data generated in the data table.

2.2.4. A Composite Example

Finally, let's submit a more representative compound example: file wildcard + user-defined parameters. Create a new job again or directly copy the previous job, and then set the value of "Script arguments" to:

["--sql-files","s3://my-spark-sql-job/sqls/select-*.sql","--sql-params","APP_S3_HOME=s3://my-spark-sql-job,CUST_CITY=NEW YORK,ORD_DATE=2008-07-15"]

As shown below:

![emr-serverless-snapshot-4.jpg-150.8kB][6]

The parameters of this job --sql-filesuse path wildcards select-*.sqlto match select-tables.sqlfiles, and there are three user-defined variables in this file, namely ${APP_S3_HOME}, ${CUST_CITY}, ${ORD_DATE}:

CREATE EXTERNAL TABLE ORDERS_CUSTOMERS
    ... ...
    LOCATION '${APP_S3_HOME}/data/orders_customers/'
AS SELECT
    ... ...
WHERE
    C.CUST_CITY = '${CUST_CITY}' AND
    O.ORD_DATE = CAST('${ORD_DATE}' AS DATE);

--sql-paramsThe parameters set the values ​​for these three custom variables, which are: APP_S3_HOME=s3://my-spark-sql-job, CUST_CITY=NEW YORK, ORD_DATE=2008-07-15, so the above SQL will be converted into the following content to execute:

CREATE EXTERNAL TABLE ORDERS_CUSTOMERS
    ... ...
    LOCATION 's3://my-spark-sql-job/data/orders_customers/'
AS SELECT
    ... ...
WHERE
    C.CUST_CITY = 'NEW YORK' AND
    O.ORD_DATE = CAST('2008-07-15' AS DATE);

So far, all functions of submitting pure SQL file jobs through the console have been demonstrated.

2.3. Submit a pure SQL file job through the command line

In fact, many EMR Serverless users do not submit their jobs on the console, but through the AWS CLI. This method is more common in engineering code or job scheduling. So, let's introduce how to submit a pure SQL file job through the command line.

This article uses the command line to submit EMR Serverless jobs following the "Best Practices: How to Submit an Amazon EMR Serverless Job Elegantly?" " The best practices given in the article. First, log in to a Linux environment with AWS CLI installed and configured with user credentialssudo yum -y install jq (Amazon Linux2 is recommended), first use the command to install the command-line tool for manipulating json files: jq (it will be used in subsequent scripts), and then complete the following preliminary preparations Work:

① Create or select a job-specific working directory and S3 bucket
② Create or select an EMR Serverless Execution Role
③ Create or select an EMR Serverless Application

Next, export all environment-related variables (please replace the corresponding values ​​in the command line according to your AWS account and local environment):

export APP_NAME='change-to-your-app-name'
export APP_S3_HOME='change-to-your-app-s3-home'
export APP_LOCAL_HOME='change-to-your-app-local-home'
export EMR_SERVERLESS_APP_ID='change-to-your-application-id'
export EMR_SERVERLESS_EXECUTION_ROLE_ARN='change-to-your-execution-role-arn'

Here is an example:

export APP_NAME='my-spark-sql-job'
export APP_S3_HOME='s3://my-spark-sql-job'
export APP_LOCAL_HOME='/home/ec2-user/my-spark-sql-job'
export EMR_SERVERLESS_APP_ID='00fbfel40ee59k09'
export EMR_SERVERLESS_EXECUTION_ROLE_ARN='arn:aws:iam::123456789000:role/EMR_SERVERLESS_ADMIN'

"Best Practices: How to Submit an Amazon EMR Serverless Job Elegantly?" " The article provides multiple general scripts for operating Jobs, which are very practical. This article will also reuse these scripts directly. However, since we need to submit multiple times and the parameters are different each time, for ease of use and simplification of the text, we Encapsulate part of the script in the original text as a Shell function, named submit-spark-sql-job:

submit-spark-sql-job() {
    
    
    sqlFiles="$1"
    sqlParams="$2"
    cat << EOF > $APP_LOCAL_HOME/start-job-run.json
{
    "name":"my-spark-sql-job",
    "applicationId":"$EMR_SERVERLESS_APP_ID",
    "executionRoleArn":"$EMR_SERVERLESS_EXECUTION_ROLE_ARN",
    "jobDriver":{
        "sparkSubmit":{
        "entryPoint":"$APP_S3_HOME/jars/emr-serverless-utils-1.0.jar",
        "entryPointArguments":[
            $([[ -n "$sqlFiles" ]] && echo "\"--sql-files\", \"$sqlFiles\"")
            $([[ -n "$sqlParams" ]] && echo ",\"--sql-params\", \"$sqlParams\"")
        ],
         "sparkSubmitParameters":"--class com.github.emr.serverless.SparkSqlJob --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
        }
   },
   "configurationOverrides":{
        "monitoringConfiguration":{
            "s3MonitoringConfiguration":{
                "logUri":"$APP_S3_HOME/logs"
            }
        }
   }
}
EOF
    jq . $APP_LOCAL_HOME/start-job-run.json
    export EMR_SERVERLESS_JOB_RUN_ID=$(aws emr-serverless start-job-run \
        --no-paginate --no-cli-pager --output text \
        --name my-spark-sql-job \
        --application-id $EMR_SERVERLESS_APP_ID \
        --execution-role-arn $EMR_SERVERLESS_EXECUTION_ROLE_ARN \
        --execution-timeout-minutes 0 \
        --cli-input-json file://$APP_LOCAL_HOME/start-job-run.json \
        --query jobRunId)
    now=$(date +%s)sec
    while true; do
        jobStatus=$(aws emr-serverless get-job-run \
                        --no-paginate --no-cli-pager --output text \
                        --application-id $EMR_SERVERLESS_APP_ID \
                        --job-run-id $EMR_SERVERLESS_JOB_RUN_ID \
                        --query jobRun.state)
        if [ "$jobStatus" = "PENDING" ] || [ "$jobStatus" = "SCHEDULED" ] || [ "$jobStatus" = "RUNNING" ]; then
            for i in {
    
    0..5}; do
                echo -ne "\E[33;5m>>> The job [ $EMR_SERVERLESS_JOB_RUN_ID ] state is [ $jobStatus ], duration [ $(date -u --date now-$now +%H:%M:%S) ] ....\r\E[0m"
                sleep 1
            done
        else
            printf "The job [ $EMR_SERVERLESS_JOB_RUN_ID ] is [ $jobStatus ]%50s\n\n"
            break
        fi
    done
}

The function accepts two positional arguments:

① The parameter at the first position is used to specify the path of the SQL file, and its value will be passed to SparkSqlJob--sql-files
The parameter at the second position is used to specify the user-defined variable in the SQL file, and its value will be passed to SparkSqlJobthe--sql-params

The Jar package and SQL file used in the function are the same as the Jar package and SQL file prepared in "2.1. Environment Preparation", so the environment preparation in Section 2.1 also needs to be completed before using the script to submit the job. Next, we use this function to complete the same operation as Section 2.2.

2.3.1. Execute a single SQL file

The operations in this section are exactly the same as those in Section 2.2.1, except that the command line is used instead. The command is as follows:

submit-spark-sql-job "$APP_S3_HOME/sqls/drop-tables.sql"

2.3.2. Execute the SQL file with custom parameters

The operations in this section are exactly the same as those in Section 2.2.2, except that the command line is used instead. The command is as follows:

submit-spark-sql-job "$APP_S3_HOME/sqls/create-tables.sql" "APP_S3_HOME=$APP_S3_HOME"

2.3.3. Using wildcards to execute multiple files

The operations in this section are exactly the same as those in Section 2.2.3, except that the command line is used instead. The command is as follows:

submit-spark-sql-job "$APP_S3_HOME/sqls/insert-into-*.sql"

2.3.4. A Composite Example

The operations in this section are exactly the same as those in Section 2.2.4, except that the command line is used instead. The command is as follows:

submit-spark-sql-job "$APP_S3_HOME/sqls/select-tables.sql" "APP_S3_HOME=$APP_S3_HOME,CUST_CITY=NEW YORK,ORD_DATE=2008-07-15"

3. Call the tool class in the source code

Although the form can be used to directly execute SQL statements in the Spark programming environment spark.sql(...), it can be seen from the previous examples that the SQL file execution capability provided by emr-serverless-utils is more convenient and powerful. Call the relevant tool class in the source code to obtain the processing capability of the above SQL file. The specific method is very simple, you only need to:

① will emr-serverless-utils-1.0.jarbe loaded into your classpath
② declare implicit type conversion
③ call directly on sparkexecSqlFile()


# 初始化SparkSession及其他操作
...

# 声明隐式类型转换
import com.github.emr.serverless.SparkSqlSupport._

# 在spark上直接调用execSqlFile()
spark.execSqlFile("s3://YOUR/XXX.sql")

# 在spark上直接调用execSqlFile()
spark.execSqlFile("s3://YOUR/XXX.sql", "K1=V1,K2=V2,...")

# 其他操作
...

Guess you like

Origin blog.csdn.net/bluishglc/article/details/132314907