Using AWS Glue for ETL work

Data lake

The data lake is created to store a large warehouse of various raw data. These data are accessed, processed, and analyzed as needed. For the storage part, the common open source version is hdfs. The major cloud vendors also provide their own storage services, such as Amazon S3, Azure Blob, etc.

Since the data stored in the data lake is all original data, it is generally necessary to do ETL (Extract-Transform-Load) on the data. For large data sets, the commonly used frameworks are Spark and pyspark. After the data is done with ETL, store the cleaned data to the storage system again (eg hdfs, s3). Based on this part of the cleaned data, data analysts or machine learning engineers can perform data analysis or train models based on these data. In these processes, another very important point is: how to manage metadata for data?

In AWS, Glue services not only provide ETL services, but also provide metadata management. Below we will use S3 + Glue + EMR to show a simple process of a data lake + ETL + data analysis.

 

Prepare data

This time I used GDELT data, the address is:

https://registry.opendata.aws/gdelt/

In this data set, each file name shows the date of the file. As the original data, we first put the 2015 data under a year = 2015 s3 directory:

aws s3 cp s3://xxx/data/20151231.export.csv s3://xxxx/gdelt/year=2015/20151231.export.csv

 

Use Glue to crawl data definitions

Create a crawling program through glue to crawl the data format in this file. The specified data source path is s3: // xxxx / gdelt /.

 

The function and specific introduction of this part can refer to the official aws document:

https://docs.aws.amazon.com/zh_cn/glue/latest/dg/console-crawlers.html

 

After the crawling program ends, in the Glue data directory, you can see the newly created gdelt table:

 

The original data is in csv format. Since there is no header, the column names are col0, col1 ..., col57. Since the directory structure under s3 is year = 2015, the crawler automatically recognizes year as a partition column.

So far, the metadata of this part of the original data is saved in Glue. Before doing ETL, we can use AWS EMR to verify its management of metadata.

 

AWS EMR

AWS EMR is a big data cluster provided by AWS. You can start a cluster with common frameworks such as Hive, HBase, Presto, and Spark with one click.

Start AWS EMR, check Hive and Spark, and use Glue as metadata for their tables. After EMR starts, log in to the master node and start Hive:

> show tables;

gdelt

Time taken: 0.154 seconds, Fetched: 1 row(s)

 

You can see this table can be seen in hive, execute the query:

> select * from gdelt where year=2015 limit 3;

OK

498318487       20060102        200601  2006    2006.0055       CVL     COMMUNITY                                               CVL                                                                                                   1       53      53      5       1       3.8     3       1       3       -2.42718446601942       1       United States   US      US      38.0    -97.0   US      0                               NULL  NULL            1       United States   US      US      38.0    -97.0   US      20151231        http://www.inlander.com/spokane/after-dolezal/Content?oid=2646896       2015

498318488       20060102        200601  2006    2006.0055       CVL     COMMUNITY                                               CVL                     USA     UNITED STATES   USA                                                           1       51      51      5       1       3.4     3       1       3       -2.42718446601942       1       United States   US      US      38.0    -97.0   US      1       United States   US    US      38.0    -97.0   US      1       United States   US      US      38.0    -97.0   US      20151231        http://www.inlander.com/spokane/after-dolezal/Content?oid=2646896       2015

498318489       20060102        200601  2006    2006.0055       CVL     COMMUNITY                                               CVL                     USA     UNITED STATES   USA                                                           1       53      53      5       1       3.8     3       1       3       -2.42718446601942       1       United States   US      US      38.0    -97.0   US      1       United States   US    US      38.0    -97.0   US      1       United States   US      US      38.0    -97.0   US      20151231        http://www.inlander.com/spokane/after-dolezal/Content?oid=2646896       2015

 

You can see that there are many columns of raw data. Suppose we only need 4 columns: event ID, country code, date, and URL, and do analysis based on these data. Then our next step is to do ETL.

 

GLUE ETL

The Glue service also provides ETL tools. You can write scripts based on spark or python and submit them to glue etl for execution. In this example, we will extract the col0, col52, col56, col57, and year columns, and rename them. Then extract the records containing only "UK", and finally write them to the final s3 directory in the format of date = current_day, the storage format is parquet. The GLUE programming interface can be called through python or scala language. In this article, scala is used:

import com.amazonaws.services.glue.ChoiceOption
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.DynamicFrame
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.ResolveSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._
import java.text.SimpleDateFormat
import java.util.Date

object Gdelt_etl {
  def main(sysArgs: Array[String]) {
      
    val sc: SparkContext = new SparkContext ()
    val glueContext: GlueContext = new GlueContext(sc)
    val spark = glueContext.getSparkSession
    
    // @params: [JOB_NAME]
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    // db and table
    val dbName = "default"
    val tblName = "gdelt"

    // s3 location for output
    val format = new SimpleDateFormat("yyyy-MM-dd")
    val curdate = format.format(new Date())
    val outputDir = "s3://xxx-xxx-xxx/cleaned-gdelt/date=" + curdate + "/"

    // Read data into DynamicFrame
    val raw_data = glueContext.getCatalogSource(database=dbName, tableName=tblName).getDynamicFrame()

    // Re-Mapping Data
    val cleanedDyF = raw_data.applyMapping(Seq(("col0", "long", "EventID", "string"), ("col52", "string", "CountryCode", "string"), ("col56", "long", "Date", "String"), ("col57", "string", "url", "string"), ("year", "string", "year", "string")))

    // Spark SQL on a Spark DataFrame
    val cleanedDF = cleanedDyF.toDF()
    cleanedDF.createOrReplaceTempView("gdlttable")
    
    // Get Only UK data
    val only_uk_sqlDF = spark.sql("select * from gdlttable where CountryCode = 'UK'")
    
    val cleanedSQLDyF = DynamicFrame(only_uk_sqlDF, glueContext).withName("only_uk_sqlDF")        
    
    // Write it out in Parquet
    glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions(Map("path" -> outputDir)), format = "parquet").writeDynamicFrame(cleanedSQLDyF)    
    
    Job.commit()
  }
}

 

Save this script as the gdelt.scala file and submit it to the GLUE ETL job for execution. After waiting for the execution to complete, we can see that the output file is generated at s3:

> aws s3 ls s3://xxxx-xxx-xxx/cleaned-gdelt/ date=2020-04-12/

part-00000-d25201b8-2d9c-49a0-95c8-f5e8cbb52b5b-c000.snappy.parquet

 

Then we execute a new GLUE crawling program for this / cleaned-gdelt / directory:

 

After the execution is completed, you can see that the new table has been produced at GLUE. The structure of this table is:

 

You can see that the input and output formats are both parquet, the partition key is date, and only contains the columns we need.

Entering EMR Hive again, you can see that the new table has appeared:

hive> describe cleaned_gdelt;
OK
eventid                 string
countrycode             string
date                    string
url                     string
year                    string
date                    string

# Partition Information
# col_name              data_type               comment
date                    string

 

Query this table:

hive> select * from cleaned_gdelt limit 10;
OK
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
498318821       UK      20151231        http://wmpoweruser.com/microsoft-denies-lumia-950-xl-withdrawn-due-issues-says-stock-due-strong-demand/ 2015
498319466       UK      20151231        http://www.princegeorgecitizen.com/news/police-say-woman-man-mauled-by-2-dogs-in-home-in-british-columbia-1.2142296     2015
498319777       UK      20151231        http://www.catchnews.com/life-society-news/happy-women-do-not-live-any-longer-than-sad-women-1451420391.html    2015
498319915       UK      20151231        http://www.nationalinterest.org/feature/the-perils-eu-army-14770        2015
Time taken: 0.394 seconds, Fetched: 10 row(s)

 

It can be seen that the CountryCode of the results are all UK, reaching our goal.

 

automation

The following is to automate GLUE web crawling + ETL. In the workflow of GLUE ETL, create a workflow, as shown below:

 

As shown in the figure, the process of this workflow is:

  1. The workflow starts at 11:40 every night
  2. Trigger gdelt's web crawling job to crawl the metadata of the original data
  3. Trigger gdelt's ETL job
  4. Trigger the gdelt-cleaned web crawling program to crawl the metadata of the cleaned data

 

Below we add a new file to the original file directory, this new data is the data of year = 2016:

aws s3 cp s3://xxx-xxxx/data/20160101.export.csv s3://xxx-xxx-xxx/gdelt/year=2016/20160101.export.csv

 

Then execute this workflow.

During the period, we can see that ETL job is triggered normally after raw_crawler_done:

 

After the job is completed, the 2016 data can be queried in Hive:

select * from cleaned_gdelt where year=2016 limit 10;
OK
498554334       UK      20160101        http://medicinehatnews.com/news/national-news/2015/12/31/support-overwhelming-for-bc-couple-mauled-by-dogs-on-christmas-day/    2016
498554336       UK      20160101        http://medicinehatnews.com/news/national-news/2015/12/31/support-overwhelming-for-bc-couple-mauled-by-dogs-on-christmas-day/    2016

 

Guess you like

Origin www.cnblogs.com/zackstang/p/12688892.html
ETL