AWS Design High Availability Program Architecture - Glue (ETL) Deployment and Development

Dependency: This article requires an understanding of the basics of AWS architecture design

AWS Glue is a fully managed ETL (extract, transform, and load) service that enables you to easily and cost-effectively classify, cleanse, and enrich data, and move data reliably between various data stores and data streams . AWS Glue consists of a central metadata repository called the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there is no infrastructure to set up or manage.

AWS Glue is designed to work with semi-structured data. It introduces a component called Dynamic Frames that you can use in your ETL scripts. A dynamic frame is similar to an Apache Spark DataFrame, which is a data abstraction for organizing data into rows and columns, except that each record is self-describing, so no schema is required to begin with. With Dynamic Frames, you get architectural flexibility and a set of advanced transformations designed specifically for Dynamic Frames. You can convert between Dynamic Frames and Spark DataFrames to leverage AWS Glue and Spark transformations to perform the required analysis.

You can use the AWS Glue console to discover data, transform it, and make it available for search and query. The console calls the underlying services to coordinate the work needed to transform the data. You can also use AWS Glue API operations to interact with AWS Glue services. Use a familiar development environment to edit, debug, and test your Python or Scala Apache Spark ETL code.

1. Deploy Glue

Deploy glue using cloudformation, including databases, connections, crawlers, jobs, triggers.

Create an IAM role

additional strategy

AmazonS3FullAccess
AmazonSNSFullAccess
AWSGlueServiceRole
AmazonRDSFullAccess
SecretsManagerReadWrite
AWSLambdaRole

trust relationship

{
    
    
    "Version": "2012-10-17",
    "Statement": [
        {
    
    
            "Effect": "Allow",
            "Principal": {
    
    
                "Service": "glue.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Create Glue

AWSTemplateFormatVersion: '2010-09-09'
Parameters:
    Environment:
        Type: String
        Default: DEV
    EnvironmentName:
        Type: String
        Default: d
    CustomerName:
        Description: The name of the customer
        Type: String
        #TODO:
        Default: your-company-name
    ProjectName:
        Description: The name of the project
        Type: String
        #TODO:
        Default: your-project-name
    CrawlerRoleARN:
        Type: String
        #TODO:
        Default: XXXXXXXXXXXXX
    ScriptLocation:
        Type: String
        #TODO: a empty file
        Default: s3://XXXXXX-s3/aws-glue-scripts
    SSLCertificateLocation:
        Type: String
        #TODO:a pem file
        Default: s3://XXXXXX-s3/aws-glue-scripts/xxxxxxx.pem
    ConnAvailabilityZone:
        Description:
            The name of the AvailabilityZone,Currently the field must be populated, but it will be
            deprecated in the future
        Type: String
        #TODO:
        Default: cn-northwest-xxx
    ConnSecurityGroups:
        Description: The name of the Secret
        Type: List<AWS::EC2::SecurityGroup::Id>
        #TODO:
        Default: sg-xxxxxxxxx, sg-xxxxxxxxx
    ConnSubnetId:
        Description: The name of the Secret
        Type: String
        #TODO:
        Default: subnet-xxxxxxxxx
    OriginSecretid:
        Description: The name of the Secret
        Type: String
        #TODO:
        Default: xxxxxxxxxxxxxxxxx
    OriginJDBCString:
        Type: String
        #TODO: jdbc:postgresql://{database ARN}:{port}/{databasename}
        Default: jdbc:postgresql://xxxx:xxx/xxxx
    OriginJDBCPath:
        Type: String
        #TODO: Database/Schema/%
        Default: xxxx/xxxx/%
Resources:
    #Create Origin to contain tables created by the crawler
    OriginDatabase:
        Type: AWS::Glue::Database
        Properties:
            CatalogId: !Ref AWS::AccountId
            DatabaseInput:
                Name: !Sub ${
    
    CustomerName}-${
    
    ProjectName}-origin-${
    
    EnvironmentName}-gluedatabase
                Description: 'AWS Glue container to hold metadata tables for the Origin crawler'
    #Create Origin Connection
    OriginConnectionPostgreSQL:
        Type: AWS::Glue::Connection
        Properties:
            CatalogId: !Ref AWS::AccountId
            ConnectionInput:
                Description: 'Connect to Origin PostgreSQL database.'
                ConnectionType: 'JDBC'
                PhysicalConnectionRequirements:
                    AvailabilityZone: !Ref ConnAvailabilityZone
                    SecurityGroupIdList: !Ref ConnSecurityGroups
                    SubnetId: !Ref ConnSubnetId
                ConnectionProperties:
                    {
    
    
                        'JDBC_CONNECTION_URL': !Ref OriginJDBCString,
                        # If use ssl
                        'JDBC_ENFORCE_SSL': true,
                        'CUSTOM_JDBC_CERT': !Ref SSLCertificateLocation,
                        'SKIP_CUSTOM_JDBC_CERT_VALIDATION': true,
                        'USERNAME': !Join [ '', [ '{
    
    {resolve:secretsmanager:', !Ref OriginSecretid, ':SecretString:username}}' ] ],
                        'PASSWORD': !Join [ '', [ '{
    
    {resolve:secretsmanager:', !Ref OriginSecretid, ':SecretString:password}}' ] ]
                    }
                Name: !Sub ${
    
    CustomerName}-${
    
    ProjectName}-origin-${
    
    EnvironmentName}-glueconn
    #Create Target to contain tables created by the crawler
    TargetDatabase:
        Type: AWS::Glue::Database
        Properties:
            CatalogId: !Ref AWS::AccountId
            DatabaseInput:
                Name: !Sub ${
    
    CustomerName}-${
    
    ProjectName}-target-${
    
    EnvironmentName}-gluedatabase
                Description: 'AWS Glue container to hold metadata tables for the Target crawler'
    #Create Target Connection
    TargetConnectionPostgreSQL:
        Type: AWS::Glue::Connection
        Properties:
            CatalogId: !Ref AWS::AccountId
            ConnectionInput:
                Description: 'Connect to Target PostgreSQL database.'
                ConnectionType: 'JDBC'
                PhysicalConnectionRequirements:
                    AvailabilityZone: !Ref ConnAvailabilityZone
                    SecurityGroupIdList: !Ref ConnSecurityGroups
                    SubnetId: !Ref ConnSubnetId
                ConnectionProperties:
                    {
    
    
                        'JDBC_CONNECTION_URL': !Ref TargetJDBCString,
                        # If use ssl
                        'JDBC_ENFORCE_SSL': true,
                        'CUSTOM_JDBC_CERT': !Ref SSLCertificateLocation,
                        'SKIP_CUSTOM_JDBC_CERT_VALIDATION': true,
                        'USERNAME': !Join [  '', [ '{
    
    {resolve:secretsmanager:', !Ref TargetSecretid, ':SecretString:username}}' ] ],
                        'PASSWORD': !Join [ '', [ '{
    
    {resolve:secretsmanager:', !Ref TargetSecretid,  ':SecretString:password}}' ] ]
                    }
                Name: !Sub ${
    
    CustomerName}-${
    
    ProjectName}-target-${
    
    EnvironmentName}-glueconn
    #Create a crawler to crawl the Origin data in PostgreSQL database
    OriginCrawler:
        Type: AWS::Glue::Crawler
        Properties:
            Name: !Sub ${
    
    CustomerName}-${
    
    ProjectName}-origin-${
    
    EnvironmentName}-gluecrawler
            Role: !Sub arn:aws-cn:iam::${
    
    AWS::AccountId}:role/${
    
    CrawlerRoleARN}
            Description: AWS Glue crawler to crawl Origin data
            DatabaseName: !Ref OriginDatabase
            Targets:
                JdbcTargets:
                    - ConnectionName: !Ref OriginConnectionPostgreSQL
                      Path: !Ref OriginJDBCPath
            TablePrefix: !Sub ${
    
    ProjectName}_${
    
    EnvironmentName}_
            SchemaChangePolicy:
                UpdateBehavior: 'UPDATE_IN_DATABASE'
                DeleteBehavior: 'LOG'
            Tags:
                ApplName:  your-app-name
    #Create a crawler to crawl the Target data in PostgreSQL database
    TargetCrawler:
        Type: AWS::Glue::Crawler
        Properties:
            Name: !Sub ${
    
    CustomerName}-${
    
    ProjectName}-target-${
    
    EnvironmentName}-gluecrawler
            Role: !Sub arn:aws-cn:iam::${
    
    AWS::AccountId}:role/${
    
    CrawlerRoleARN}
            Description: AWS Glue crawler to crawl Target data
            DatabaseName: !Ref TargetDatabase
            Targets:
                JdbcTargets:
                    - ConnectionName: !Ref TargetConnectionPostgreSQL
                      Path: !Ref TargetJDBCPath
            TablePrefix: !Sub ${
    
    ProjectName}_${
    
    EnvironmentName}_
            SchemaChangePolicy:
                UpdateBehavior: 'UPDATE_IN_DATABASE'
                DeleteBehavior: 'LOG'
            Tags:
                ApplName: your-app-name
    #Job  sync from Origin to Target
    JobDataSync:
        Type: AWS::Glue::Job
        Properties:
            Name: !Sub ${
    
    CustomerName}-${
    
    ProjectName}-data-sync-${
    
    EnvironmentName}-gluejob
            Role: !Ref CrawlerRoleARN
            DefaultArguments: {
    
    '--job-language': 'python','--enable-continuous-cloudwatch-log': 'true','--enable-continuous-log-filter': 'true'}
            # If script written in Scala, then set DefaultArguments={'--job-language'; 'scala', '--class': 'your scala class'}
            Connections:
                Connections:
                    - !Ref OriginConnectionPostgreSQL
                    - !Ref TargetConnectionPostgreSQL
            Description: AWS Glue job for Data sync from Origin to Target
            GlueVersion: 2.0
            Command:
                Name: glueetl
                PythonVersion: 3
                ScriptLocation:
                    !Sub ${
    
    ScriptLocation}/${
    
    CustomerName}-${
    
    ProjectName}-data-sync-gluejob.py
            Timeout: 60
            WorkerType: Standard
            NumberOfWorkers: 2
            ExecutionProperty:
                MaxConcurrentRuns: 1
            Tags:
                ApplName: your-app-name
    #Trigger
    TriggerDataSync:
        Type: AWS::Glue::Trigger
        Properties:
            Name: !Sub ${
    
    CustomerName}-${
    
    ProjectName}-data-sync-${
    
    EnvironmentName}-gluetrigger
            Description: AWS Glue trigger for Data sync from Origin to Target
            Type: SCHEDULED
            Actions:
                - JobName: !Ref JobDataSync
            Schedule: cron(0 12 * * ? *)
            StartOnCreation: true
            Tags:
                ApplName: your-app-name

2. Glue automated deployment (CD)

name: build-and-deploy

# Controls when the action will run. Triggers the workflow on push 
# but only for the master branch.
on:
  push:
    branches: [ master ]

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
  # This workflow contains two jobs called "build" and "deploy"
  build:
    # The type of runner that the job will run on
    runs-on: ubuntu-latest

    # Steps represent a sequence of tasks that will be executed as part of the job
    steps:
      # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
      - uses: actions/checkout@v2
        
      # Set up Python
      - name: Set up Python 3.8
        uses: actions/setup-python@v2
        with:
          python-version: '3.8'
          
      # Install nbconvert to convert notebook file to python script
      - name: Install nbconvert
        run: |
          python -m pip install --upgrade pip
          pip install nbconvert

      # Convert notebook file to python
      - name: Convert notebook
        run: jupyter nbconvert --to python traffic.ipynb

      # Persist python script for use between jobs
      - name: Upload python script
        uses: actions/upload-artifact@v2
        with:
          name: traffic.py
          path: traffic.py
  
  # Upload python script to S3 and update Glue job
  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Download python script from build
        uses: actions/download-artifact@v2
        with:
          name: traffic.py
          
      # Install the AWS CLI
      - name: Install AWS CLI
        run: |
          curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
          unzip awscliv2.zip
          sudo ./aws/install
          
      # Set up credentials used by AWS CLI
      - name: Set up AWS credentials
        shell: bash
        env:
          AWS_ACCESS_KEY_ID: ${
    
    {
    
     secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${
    
    {
    
     secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          mkdir -p ~/.aws
          touch ~/.aws/credentials
          echo "[default]
          aws_access_key_id = $AWS_ACCESS_KEY_ID
          aws_secret_access_key = $AWS_SECRET_ACCESS_KEY" > ~/.aws/credentials
          
      # Copy the file to the S3 bucket
      - name: Upload to S3
        run: aws s3 cp traffic.py s3://${
    
    {
    
    secrets.S3_BUCKET}}/traffic_${
    
    GITHUB_SHA}.py --region us-east-1
      
      # Update the Glue job to use the new script
      - name: Update Glue job
        run: |
          aws glue update-job --job-name "Traffic ETL" --job-update \
            "Role=AWSGlueServiceRole-TrafficCrawler,Command={Name=glueetl,ScriptLocation=s3://${
    
    {secrets.S3_BUCKET}}/traffic_${GITHUB_SHA}.py},Connections={Connections=redshift}" \
            --region us-east-1
      
      # Remove stored credentials file
      - name: Cleanup
        run: rm -rf ~/.aws
    

3. Low-code Glue development (recommended)

AWS Glue Studio is a new graphical interface that makes it easy to create, execute, and monitor extract, transform, and load (ETL) jobs in AWS Glue. You can visually author data transformation workflows and execute them smoothly on AWS Glue's Apache Spark-style serverless ETL engine. You can examine the structure description and profile results at each step of the task.

Amazon Glue Studio

insert image description here

4. Python development

Basic information python:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

4.1 Import data source

PostgreSQLtable_node1 = glueContext.create_dynamic_frame.from_catalog(
    database="[您创建的Glue连接源数据库名称]",
    table_name="[通过爬网程序生成的表名]",
    additional_options = {
    
    "jobBookmarkKeys":["[tablename表的书签字段,不能为空]"],"jobBookmarkKeysSortOrder":"[asc/desc选一个]"},
    transformation_ctx="PostgreSQLtable_node1",
)

transformation_ctx is the name of the bookmark, and the bookmark is the mark where the data is processed, just like reading a book; this is very useful in incremental synchronization.

For the bookmark to take effect, the following must be met:

​ 1) "Advanced Settings" -> "Enable Bookmarks" -> "Enable" in Glue's Job;

2) The additional_options item is enabled to take effect.

4.2 Introduce field mapping

# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
    frame=PostgreSQLtable_node1,
    mappings=[
        ("id", "decimal(19,0)", "id", "decimal(19,0)"),
        ("updatetime", "timestamp", "updatetime", "timestamp"),
        ("value", "decimal(19,0)", "value", "decimal(19,0)"),
    ],
    transformation_ctx="ApplyMapping_node2",
)

The type in the field mapping requires constant attempts. For example, when directly defining decimal with more than 8 characters, there will be problems in data export, which requires a certain amount of experience and experimentation.

4.3 Insert data incrementally

# Script generated for node PostgreSQL table
 PostgreSQLtable_node3 = glueContext.write_dynamic_frame.from_catalog(
     frame=ApplyMapping_node2,
     database="[您创建的Glue目标数据库连接名称]",
     table_name="[通过爬网程序生成的表名]",
     transformation_ctx="PostgreSQLtable_node3",
 )

transformation_ctx is the name of the bookmark, and the bookmark is the mark where the data is processed, just like reading a book; this is very useful in incremental synchronization.

4.4 Insert data in full (with empty table)

df = ApplyMapping_node2.toDF()
df.write.format("jdbc").mode('overwrite') \
  .option("url", "jdbc:postgresql://[host主机]:5432/[数据库名称]") \
  .option("user", "[账号]") \
  .option("password", "[密码]") \
  .option("dbtable", "[dbo.表名]") \
  .option("truncate", "true") \
  .save()

If you want to clear the table and execute the write operation before inserting data, please perform the above actions.

4.5 Use configuration parameters and execute custom SQL

import boto3
import psycopg2

data_frame = ApplyMapping_node2.toDF()
glue = boto3.client('glue')
connection = glue.get_connection(Name="[您创建的Glue目标数据库连接名称]")
pg_url = connection['Connection']['ConnectionProperties']['JDBC_CONNECTION_URL']
pg_url = pg_url.split('/')[2].split(':')[0]
pg_user = connection['Connection']['ConnectionProperties']['USERNAME']
pg_password = connection['Connection']['ConnectionProperties']['PASSWORD']
magento = data_frame.collect()

#以下代码中使用配置参数
db = psycopg2.connect(host = pg_url, user = pg_user, password = pg_password, database = "[数据库名]")
cursor = db.cursor()
for r in magento:
    insertQry=""" INSERT INTO dbo.gluetest(id, updatetime, value) VALUES(%s, %s, %s) ;"""
    cursor.execute(insertQry, (r.id, r.updatetime, r.value))
    #可以考虑分页提交
    db.commit()
cursor.close()

Using this method requires the introduction of the psycopg2 package (equivalent to the package pre-installed by docker before running)

"Security Configuration, Script Library and Job Parameters (Optional)" -> "Job Parameters" in Glue's Job;

Glue version key value
2.0 –additional-python-modules psycopg2-binary==2.8.6
3.0 –additional-python-modules psycopg2-binary==2.9.0

4.6 Upsert (Insert & update)

Incrementally update data, use updatetime as a bookmark (not empty), new data is inserted, and old data is updated.

from py4j.java_gateway import java_import
sc = SparkContext()
java_import(sc._gateway.jvm,"java.sql.Connection")
java_import(sc._gateway.jvm,"java.sql.DatabaseMetaData")
java_import(sc._gateway.jvm,"java.sql.DriverManager")
java_import(sc._gateway.jvm,"java.sql.SQLException")

data_frame = PostgreSQLtable_node1.toDF()
magento = data_frame.collect()
source_jdbc_conf = glueContext.extract_jdbc_conf('[您创建的Glue目标数据库连接名称]')
page = 0
try:
    conn = sc._gateway.jvm.DriverManager.getConnection(source_jdbc_conf.get('url') + '/[数据库名]',source_jdbc_conf.get('user'),source_jdbc_conf.get('password'))
    insertQry="""INSERT INTO dbo.[表名](id, updatetime, value) VALUES(?, ?, ?) ON CONFLICT (id) DO UPDATE 
            SET updatetime = excluded.updatetime, value = excluded.value 
            WHERE dbo.gluetest.updatetime is distinct from excluded.updatetime;"""
    stmt = conn.prepareStatement(insertQry)
    conn.setAutoCommit(False)
    for r in magento:
        stmt.setBigDecimal(1, r.id)
        stmt.setTimestamp(2, r.updatetime)
        stmt.setBigDecimal(3, r.value)
        stmt.addBatch()
        page += 1
        if page % 1000 ==0:
            stmt.executeBatch()
            conn.commit()
            page = 0
    if page > 0:
        stmt.executeBatch()
        conn.commit()
finally:
    if conn:
        conn.close()
job.commit()

Main points:

The above is the processing method of postgreSQL, oracle uses Marge, and sqlserver uses a syntax similar to insert into update.

The spark native Java package used can be used as an alternative to "psycopg2" without importing new packages.

The disadvantage of "psycopg2" is that it takes about 1 minute to install the package. For time-sensitive operations, it is recommended to use the native package.

5. Local Glue debugging (auxiliary)

Develop and test AWS Glue task scripts

Set up the container to use Visual Studio Code

prerequisites:

  1. Install Visual Studio Code.

  2. Install Python .

  3. Install Visual Studio Code Remote - Container

  4. Open the workspace folder in Visual Studio Code.

  5. Select Settings .

  6. Please select Workspace .

  7. Please select Open Settings (JSON) .

  8. Paste the following JSON and save it.

    {
          
          
        "python.defaultInterpreterPath": "/usr/bin/python3",
        "python.analysis.extraPaths": [
            "/home/glue_user/aws-glue-libs/PyGlue.zip:/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip:/home/glue_user/spark/python/",
        ]
    }
    

step:

  1. Run the Docker container.
docker run -it -v D:/Projects/AWS/Projects/Glue/.aws:/home/glue_user/.aws -v D:/Projects/AWS/Projects/Glue:/home/glue_user/workspace/ -e AWS_PROFILE=default -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_3.0.0_image_01 pyspark
  1. Start Visual Studio Code.

  2. Select Remote Explorer from the left menu , then select amazon/aws-glue-libs:glue_libs_3.0.0_image_01.

  3. Right-click and select Attach to Container . If a dialog box appears, select Got it .

  4. open /home/glue_user/workspace/.

  5. Run the following command first in VSCode:

    export AWS_REGION=cn-northwest-x
    
  6. Create the Glue PySpark script, then choose Run .

    You will see the script run successfully.

Guess you like

Origin blog.csdn.net/black0707/article/details/124987653