Arquitectura del programa de alta disponibilidad de AWS Design: implementación y desarrollo de Glue (ETL)

Dependencia: este artículo requiere una comprensión de los conceptos básicos del diseño de la arquitectura de AWS

AWS Glue es un servicio ETL (extracción, transformación y carga) completamente administrado que le permite clasificar, limpiar y enriquecer datos de manera fácil y rentable, y mover datos de manera confiable entre varios almacenes de datos y flujos de datos. AWS Glue consta de un repositorio central de metadatos llamado AWS Glue Data Catalog, un motor ETL que genera automáticamente código Python o Scala y un programador flexible que maneja la resolución de dependencias, la supervisión de trabajos y los reintentos. AWS Glue no tiene servidor, por lo que no hay infraestructura para configurar o administrar.

AWS Glue está diseñado para trabajar con datos semiestructurados. Introduce un componente llamado Dynamic Frames que puede usar en sus scripts ETL. Un marco dinámico es similar a un marco de datos de Apache Spark, que es una abstracción de datos para organizar datos en filas y columnas, excepto que cada registro se describe a sí mismo, por lo que no se requiere un esquema para empezar. Con Dynamic Frames, obtiene flexibilidad arquitectónica y un conjunto de transformaciones avanzadas diseñadas específicamente para Dynamic Frames. Puede convertir entre Dynamic Frames y Spark DataFrames para aprovechar las transformaciones de AWS Glue y Spark para realizar el análisis requerido.

Puede utilizar la consola de AWS Glue para descubrir datos, transformarlos y ponerlos a disposición para realizar búsquedas y consultas. La consola llama a los servicios subyacentes para coordinar el trabajo necesario para transformar los datos. También puede utilizar las operaciones de la API de AWS Glue para interactuar con los servicios de AWS Glue. Utilice un entorno de desarrollo familiar para editar, depurar y probar su código ETL Python o Scala Apache Spark.

1. Implementar pegamento

Implemente el pegamento utilizando la formación en la nube, incluidas las bases de datos, las conexiones, los rastreadores, los trabajos y los disparadores.

Crear un rol de IAM

estrategia adicional

AmazonS3FullAccess
AmazonSNSFullAccess
AWSGlueServiceRole
AmazonRDSFullAccess
SecretsManagerReadWrite
AWSLambdaRole

relación de confianza

{
    
    
    "Version": "2012-10-17",
    "Statement": [
        {
    
    
            "Effect": "Allow",
            "Principal": {
    
    
                "Service": "glue.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Crear pegamento

AWSTemplateFormatVersion: '2010-09-09'
Parameters:
    Environment:
        Type: String
        Default: DEV
    EnvironmentName:
        Type: String
        Default: d
    CustomerName:
        Description: The name of the customer
        Type: String
        #TODO:
        Default: your-company-name
    ProjectName:
        Description: The name of the project
        Type: String
        #TODO:
        Default: your-project-name
    CrawlerRoleARN:
        Type: String
        #TODO:
        Default: XXXXXXXXXXXXX
    ScriptLocation:
        Type: String
        #TODO: a empty file
        Default: s3://XXXXXX-s3/aws-glue-scripts
    SSLCertificateLocation:
        Type: String
        #TODO:a pem file
        Default: s3://XXXXXX-s3/aws-glue-scripts/xxxxxxx.pem
    ConnAvailabilityZone:
        Description:
            The name of the AvailabilityZone,Currently the field must be populated, but it will be
            deprecated in the future
        Type: String
        #TODO:
        Default: cn-northwest-xxx
    ConnSecurityGroups:
        Description: The name of the Secret
        Type: List<AWS::EC2::SecurityGroup::Id>
        #TODO:
        Default: sg-xxxxxxxxx, sg-xxxxxxxxx
    ConnSubnetId:
        Description: The name of the Secret
        Type: String
        #TODO:
        Default: subnet-xxxxxxxxx
    OriginSecretid:
        Description: The name of the Secret
        Type: String
        #TODO:
        Default: xxxxxxxxxxxxxxxxx
    OriginJDBCString:
        Type: String
        #TODO: jdbc:postgresql://{database ARN}:{port}/{databasename}
        Default: jdbc:postgresql://xxxx:xxx/xxxx
    OriginJDBCPath:
        Type: String
        #TODO: Database/Schema/%
        Default: xxxx/xxxx/%
Resources:
    #Create Origin to contain tables created by the crawler
    OriginDatabase:
        Type: AWS::Glue::Database
        Properties:
            CatalogId: !Ref AWS::AccountId
            DatabaseInput:
                Name: !Sub ${
    
    CustomerName}-${
    
    ProjectName}-origin-${
    
    EnvironmentName}-gluedatabase
                Description: 'AWS Glue container to hold metadata tables for the Origin crawler'
    #Create Origin Connection
    OriginConnectionPostgreSQL:
        Type: AWS::Glue::Connection
        Properties:
            CatalogId: !Ref AWS::AccountId
            ConnectionInput:
                Description: 'Connect to Origin PostgreSQL database.'
                ConnectionType: 'JDBC'
                PhysicalConnectionRequirements:
                    AvailabilityZone: !Ref ConnAvailabilityZone
                    SecurityGroupIdList: !Ref ConnSecurityGroups
                    SubnetId: !Ref ConnSubnetId
                ConnectionProperties:
                    {
    
    
                        'JDBC_CONNECTION_URL': !Ref OriginJDBCString,
                        # If use ssl
                        'JDBC_ENFORCE_SSL': true,
                        'CUSTOM_JDBC_CERT': !Ref SSLCertificateLocation,
                        'SKIP_CUSTOM_JDBC_CERT_VALIDATION': true,
                        'USERNAME': !Join [ '', [ '{
    
    {resolve:secretsmanager:', !Ref OriginSecretid, ':SecretString:username}}' ] ],
                        'PASSWORD': !Join [ '', [ '{
    
    {resolve:secretsmanager:', !Ref OriginSecretid, ':SecretString:password}}' ] ]
                    }
                Name: !Sub ${
    
    CustomerName}-${
    
    ProjectName}-origin-${
    
    EnvironmentName}-glueconn
    #Create Target to contain tables created by the crawler
    TargetDatabase:
        Type: AWS::Glue::Database
        Properties:
            CatalogId: !Ref AWS::AccountId
            DatabaseInput:
                Name: !Sub ${
    
    CustomerName}-${
    
    ProjectName}-target-${
    
    EnvironmentName}-gluedatabase
                Description: 'AWS Glue container to hold metadata tables for the Target crawler'
    #Create Target Connection
    TargetConnectionPostgreSQL:
        Type: AWS::Glue::Connection
        Properties:
            CatalogId: !Ref AWS::AccountId
            ConnectionInput:
                Description: 'Connect to Target PostgreSQL database.'
                ConnectionType: 'JDBC'
                PhysicalConnectionRequirements:
                    AvailabilityZone: !Ref ConnAvailabilityZone
                    SecurityGroupIdList: !Ref ConnSecurityGroups
                    SubnetId: !Ref ConnSubnetId
                ConnectionProperties:
                    {
    
    
                        'JDBC_CONNECTION_URL': !Ref TargetJDBCString,
                        # If use ssl
                        'JDBC_ENFORCE_SSL': true,
                        'CUSTOM_JDBC_CERT': !Ref SSLCertificateLocation,
                        'SKIP_CUSTOM_JDBC_CERT_VALIDATION': true,
                        'USERNAME': !Join [  '', [ '{
    
    {resolve:secretsmanager:', !Ref TargetSecretid, ':SecretString:username}}' ] ],
                        'PASSWORD': !Join [ '', [ '{
    
    {resolve:secretsmanager:', !Ref TargetSecretid,  ':SecretString:password}}' ] ]
                    }
                Name: !Sub ${
    
    CustomerName}-${
    
    ProjectName}-target-${
    
    EnvironmentName}-glueconn
    #Create a crawler to crawl the Origin data in PostgreSQL database
    OriginCrawler:
        Type: AWS::Glue::Crawler
        Properties:
            Name: !Sub ${
    
    CustomerName}-${
    
    ProjectName}-origin-${
    
    EnvironmentName}-gluecrawler
            Role: !Sub arn:aws-cn:iam::${
    
    AWS::AccountId}:role/${
    
    CrawlerRoleARN}
            Description: AWS Glue crawler to crawl Origin data
            DatabaseName: !Ref OriginDatabase
            Targets:
                JdbcTargets:
                    - ConnectionName: !Ref OriginConnectionPostgreSQL
                      Path: !Ref OriginJDBCPath
            TablePrefix: !Sub ${
    
    ProjectName}_${
    
    EnvironmentName}_
            SchemaChangePolicy:
                UpdateBehavior: 'UPDATE_IN_DATABASE'
                DeleteBehavior: 'LOG'
            Tags:
                ApplName:  your-app-name
    #Create a crawler to crawl the Target data in PostgreSQL database
    TargetCrawler:
        Type: AWS::Glue::Crawler
        Properties:
            Name: !Sub ${
    
    CustomerName}-${
    
    ProjectName}-target-${
    
    EnvironmentName}-gluecrawler
            Role: !Sub arn:aws-cn:iam::${
    
    AWS::AccountId}:role/${
    
    CrawlerRoleARN}
            Description: AWS Glue crawler to crawl Target data
            DatabaseName: !Ref TargetDatabase
            Targets:
                JdbcTargets:
                    - ConnectionName: !Ref TargetConnectionPostgreSQL
                      Path: !Ref TargetJDBCPath
            TablePrefix: !Sub ${
    
    ProjectName}_${
    
    EnvironmentName}_
            SchemaChangePolicy:
                UpdateBehavior: 'UPDATE_IN_DATABASE'
                DeleteBehavior: 'LOG'
            Tags:
                ApplName: your-app-name
    #Job  sync from Origin to Target
    JobDataSync:
        Type: AWS::Glue::Job
        Properties:
            Name: !Sub ${
    
    CustomerName}-${
    
    ProjectName}-data-sync-${
    
    EnvironmentName}-gluejob
            Role: !Ref CrawlerRoleARN
            DefaultArguments: {
    
    '--job-language': 'python','--enable-continuous-cloudwatch-log': 'true','--enable-continuous-log-filter': 'true'}
            # If script written in Scala, then set DefaultArguments={'--job-language'; 'scala', '--class': 'your scala class'}
            Connections:
                Connections:
                    - !Ref OriginConnectionPostgreSQL
                    - !Ref TargetConnectionPostgreSQL
            Description: AWS Glue job for Data sync from Origin to Target
            GlueVersion: 2.0
            Command:
                Name: glueetl
                PythonVersion: 3
                ScriptLocation:
                    !Sub ${
    
    ScriptLocation}/${
    
    CustomerName}-${
    
    ProjectName}-data-sync-gluejob.py
            Timeout: 60
            WorkerType: Standard
            NumberOfWorkers: 2
            ExecutionProperty:
                MaxConcurrentRuns: 1
            Tags:
                ApplName: your-app-name
    #Trigger
    TriggerDataSync:
        Type: AWS::Glue::Trigger
        Properties:
            Name: !Sub ${
    
    CustomerName}-${
    
    ProjectName}-data-sync-${
    
    EnvironmentName}-gluetrigger
            Description: AWS Glue trigger for Data sync from Origin to Target
            Type: SCHEDULED
            Actions:
                - JobName: !Ref JobDataSync
            Schedule: cron(0 12 * * ? *)
            StartOnCreation: true
            Tags:
                ApplName: your-app-name

2. Implementación automática de pegamento (CD)

name: build-and-deploy

# Controls when the action will run. Triggers the workflow on push 
# but only for the master branch.
on:
  push:
    branches: [ master ]

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
  # This workflow contains two jobs called "build" and "deploy"
  build:
    # The type of runner that the job will run on
    runs-on: ubuntu-latest

    # Steps represent a sequence of tasks that will be executed as part of the job
    steps:
      # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
      - uses: actions/checkout@v2
        
      # Set up Python
      - name: Set up Python 3.8
        uses: actions/setup-python@v2
        with:
          python-version: '3.8'
          
      # Install nbconvert to convert notebook file to python script
      - name: Install nbconvert
        run: |
          python -m pip install --upgrade pip
          pip install nbconvert

      # Convert notebook file to python
      - name: Convert notebook
        run: jupyter nbconvert --to python traffic.ipynb

      # Persist python script for use between jobs
      - name: Upload python script
        uses: actions/upload-artifact@v2
        with:
          name: traffic.py
          path: traffic.py
  
  # Upload python script to S3 and update Glue job
  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Download python script from build
        uses: actions/download-artifact@v2
        with:
          name: traffic.py
          
      # Install the AWS CLI
      - name: Install AWS CLI
        run: |
          curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
          unzip awscliv2.zip
          sudo ./aws/install
          
      # Set up credentials used by AWS CLI
      - name: Set up AWS credentials
        shell: bash
        env:
          AWS_ACCESS_KEY_ID: ${
    
    {
    
     secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${
    
    {
    
     secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          mkdir -p ~/.aws
          touch ~/.aws/credentials
          echo "[default]
          aws_access_key_id = $AWS_ACCESS_KEY_ID
          aws_secret_access_key = $AWS_SECRET_ACCESS_KEY" > ~/.aws/credentials
          
      # Copy the file to the S3 bucket
      - name: Upload to S3
        run: aws s3 cp traffic.py s3://${
    
    {
    
    secrets.S3_BUCKET}}/traffic_${
    
    GITHUB_SHA}.py --region us-east-1
      
      # Update the Glue job to use the new script
      - name: Update Glue job
        run: |
          aws glue update-job --job-name "Traffic ETL" --job-update \
            "Role=AWSGlueServiceRole-TrafficCrawler,Command={Name=glueetl,ScriptLocation=s3://${
    
    {secrets.S3_BUCKET}}/traffic_${GITHUB_SHA}.py},Connections={Connections=redshift}" \
            --region us-east-1
      
      # Remove stored credentials file
      - name: Cleanup
        run: rm -rf ~/.aws

3. Desarrollo de pegamento de código bajo (recomendado)

AWS Glue Studio es una nueva interfaz gráfica que facilita la creación, ejecución y supervisión de trabajos de extracción, transformación y carga (ETL) en AWS Glue. Puede crear visualmente flujos de trabajo de transformación de datos y ejecutarlos sin problemas en el motor ETL sin servidor estilo Apache Spark de AWS Glue. Puede examinar la descripción de la estructura y los resultados del perfil en cada paso de la tarea.

Estudio de pegamento de Amazon

inserte la descripción de la imagen aquí

4. Desarrollo de Python

Python información básica:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

4.1 Fuente de datos de importación

PostgreSQLtable_node1 = glueContext.create_dynamic_frame.from_catalog(
    database="[您创建的Glue连接源数据库名称]",
    table_name="[通过爬网程序生成的表名]",
    additional_options = {
    
    "jobBookmarkKeys":["[tablename表的书签字段，不能为空]"],"jobBookmarkKeysSortOrder":"[asc/desc选一个]"},
    transformation_ctx="PostgreSQLtable_node1",
)

transform_ctx es el nombre del marcador, y el marcador es la marca donde se procesan los datos, al igual que leer un libro, esto es muy útil en la sincronización incremental.

Para que el marcador surta efecto se debe cumplir con lo siguiente:

1) "Configuración avanzada" -> "Habilitar marcadores" -> "Habilitar" en el trabajo de Glue;

2) El elemento Additional_options está habilitado para que surta efecto.

4.2 Introducir el mapeo de campos

# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
    frame=PostgreSQLtable_node1,
    mappings=[
        ("id", "decimal(19,0)", "id", "decimal(19,0)"),
        ("updatetime", "timestamp", "updatetime", "timestamp"),
        ("value", "decimal(19,0)", "value", "decimal(19,0)"),
    ],
    transformation_ctx="ApplyMapping_node2",
)

El mapeo de tipos en el campo requiere intentos constantes, por ejemplo, al definir directamente decimales con más de 8 caracteres, habrá problemas en la exportación de datos, lo que requiere cierta experiencia y experimentación.

4.3 Insertar datos de forma incremental

# Script generated for node PostgreSQL table
 PostgreSQLtable_node3 = glueContext.write_dynamic_frame.from_catalog(
     frame=ApplyMapping_node2,
     database="[您创建的Glue目标数据库连接名称]",
     table_name="[通过爬网程序生成的表名]",
     transformation_ctx="PostgreSQLtable_node3",
 )

transform_ctx es el nombre del marcador, y el marcador es la marca donde se procesan los datos, al igual que leer un libro, esto es muy útil en la sincronización incremental.

4.4 Insertar datos completos (con tabla vacía)

df = ApplyMapping_node2.toDF()
df.write.format("jdbc").mode('overwrite') \
  .option("url", "jdbc:postgresql://[host主机]:5432/[数据库名称]") \
  .option("user", "[账号]") \
  .option("password", "[密码]") \
  .option("dbtable", "[dbo.表名]") \
  .option("truncate", "true") \
  .save()

Si desea borrar la tabla y ejecutar la operación de escritura antes de insertar datos, realice las acciones anteriores.

4.5 Usar parámetros de configuración y ejecutar SQL personalizado

import boto3
import psycopg2

data_frame = ApplyMapping_node2.toDF()
glue = boto3.client('glue')
connection = glue.get_connection(Name="[您创建的Glue目标数据库连接名称]")
pg_url = connection['Connection']['ConnectionProperties']['JDBC_CONNECTION_URL']
pg_url = pg_url.split('/')[2].split(':')[0]
pg_user = connection['Connection']['ConnectionProperties']['USERNAME']
pg_password = connection['Connection']['ConnectionProperties']['PASSWORD']
magento = data_frame.collect()

#以下代码中使用配置参数
db = psycopg2.connect(host = pg_url, user = pg_user, password = pg_password, database = "[数据库名]")
cursor = db.cursor()
for r in magento:
    insertQry=""" INSERT INTO dbo.gluetest(id, updatetime, value) VALUES(%s, %s, %s) ;"""
    cursor.execute(insertQry, (r.id, r.updatetime, r.value))
    #可以考虑分页提交
    db.commit()
cursor.close()

El uso de este método requiere la introducción del paquete psycopg2 (equivalente al paquete preinstalado por docker antes de ejecutar)

"Configuración de seguridad, biblioteca de scripts y parámetros de trabajo (opcional)" -> "Parámetros de trabajo" en Trabajo de Glue;

Versión con pegamento	llave	valor
2.0	–módulos-adicionales-de-python	psycopg2-binary==2.8.6
3.0	–módulos-adicionales-de-python	psycopg2-binary==2.9.0

4.6 Upsert (Insertar y actualizar)

Actualice los datos de forma incremental, use el tiempo de actualización como un marcador (no vacío), se insertan nuevos datos y se actualizan los datos antiguos.

from py4j.java_gateway import java_import
sc = SparkContext()
java_import(sc._gateway.jvm,"java.sql.Connection")
java_import(sc._gateway.jvm,"java.sql.DatabaseMetaData")
java_import(sc._gateway.jvm,"java.sql.DriverManager")
java_import(sc._gateway.jvm,"java.sql.SQLException")

data_frame = PostgreSQLtable_node1.toDF()
magento = data_frame.collect()
source_jdbc_conf = glueContext.extract_jdbc_conf('[您创建的Glue目标数据库连接名称]')
page = 0
try:
    conn = sc._gateway.jvm.DriverManager.getConnection(source_jdbc_conf.get('url') + '/[数据库名]',source_jdbc_conf.get('user'),source_jdbc_conf.get('password'))
    insertQry="""INSERT INTO dbo.[表名](id, updatetime, value) VALUES(?, ?, ?) ON CONFLICT (id) DO UPDATE 
            SET updatetime = excluded.updatetime, value = excluded.value 
            WHERE dbo.gluetest.updatetime is distinct from excluded.updatetime;"""
    stmt = conn.prepareStatement(insertQry)
    conn.setAutoCommit(False)
    for r in magento:
        stmt.setBigDecimal(1, r.id)
        stmt.setTimestamp(2, r.updatetime)
        stmt.setBigDecimal(3, r.value)
        stmt.addBatch()
        page += 1
        if page % 1000 ==0:
            stmt.executeBatch()
            conn.commit()
            page = 0
    if page > 0:
        stmt.executeBatch()
        conn.commit()
finally:
    if conn:
        conn.close()
job.commit()

Puntos principales:

Lo anterior es el método de procesamiento de postgreSQL, Oracle usa Marge y sqlserver usa una sintaxis similar para insertar en la actualización.

El paquete java nativo de chispa utilizado se puede utilizar como una alternativa a "psycopg2" sin importar nuevos paquetes.

La desventaja de "psycopg2" es que se tarda aproximadamente 1 minuto en instalar el paquete. Para operaciones sensibles al tiempo, se recomienda utilizar el paquete nativo.

5. Depuración de pegamento local (auxiliar)

Desarrollar y probar scripts de tareas de AWS Glue

Configurar el contenedor para usar Visual Studio Code

requisitos previos:

Instale el código de Visual Studio.
Instala Pitón .
Instalar Visual Studio Code Remote - Contenedor
Abra la carpeta del área de trabajo en Visual Studio Code.
Seleccione Configuración .
Seleccione Espacio de trabajo .
Seleccione Configuración abierta (JSON) .

Pegue el siguiente JSON y guárdelo.

{
      
      
    "python.defaultInterpreterPath": "/usr/bin/python3",
    "python.analysis.extraPaths": [
        "/home/glue_user/aws-glue-libs/PyGlue.zip:/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip:/home/glue_user/spark/python/",
    ]
}

paso:

Ejecute el contenedor Docker.

docker run -it -v D:/Projects/AWS/Projects/Glue/.aws:/home/glue_user/.aws -v D:/Projects/AWS/Projects/Glue:/home/glue_user/workspace/ -e AWS_PROFILE=default -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_3.0.0_image_01 pyspark

Inicie el código de Visual Studio.
Seleccione Explorador remoto en el menú de la izquierda y, a continuación, seleccione amazon/aws-glue-libs:glue_libs_3.0.0_image_01.
Haga clic con el botón derecho y seleccione Adjuntar a contenedor . Si aparece un cuadro de diálogo, seleccione Entendido .
abierto /home/glue_user/workspace/_
Ejecute el siguiente comando primero en VSCode:
```
export AWS_REGION=cn-northwest-x
```
Cree el script Glue PySpark, luego elija Ejecutar .

Verá que el script se ejecuta correctamente.