Use Apache Flink to build a unified data lake on Amazon EMR

In order to build a data-driven enterprise, it is important to democratize enterprise data assets in a data catalog. With a unified data catalog, you can quickly search data sets and determine data schema, data format, and location. Amazon Glue Data Catalog provides a unified repository that allows disparate systems to store and find metadata to track data in data silos.

Apache Flink is a widely used data processing engine for scalable streaming ETL, analytics and event-driven applications. The application provides precise time and status management with fault tolerance. Flink can handle bounded streams (batch processing) and unbounded streams (streaming processing) using a unified API or application. After data is processed using Apache Flink, downstream applications can access the curated data using a unified data directory. With unified metadata, both data processing and data consuming applications can access tables using the same metadata.

How to integrate Apache Flink in Amazon EMR with Amazon Glue Data Catalog so that streaming data can be ingested in real time and accessed in near real time for business analysis.

Apache Flink connector and directory schema

Apache Flink uses connectors and directories to interact with data and metadata. The diagram below shows the Apache Flink connector for reading/writing data and the directory schema for reading/writing metadata.

 

To read/write data, Flink provides the DynamicTableSourceFactory interface for read operations and the DynamicTableSinkFactory interface for write operations . There is also a Flink connector that implements two interfaces for accessing data in different stores. For example, the Flink FileSystem connector provides FileSystemTableFactory for reading/writing data in Hadoop Distributed File System (HDFS) or Amazon Simple Storage Service (Amazon S3); the Flink HBase connector provides HBase2DynamicTableFactory for reading/writing data in HBase Read/write data; and the Flink Kafka connector provides KafkaDynamicTableFactory for reading/writing data in Kafka.

For reading/writing metadata, Flink provides a directory interface. Flink has three built-in implementations of directories. GenericInMemoryCatalog stores catalog data in memory. JdbcCatalog stores catalog data in a JDBC-supported relational database. As of now, the JDBC catalog supports MySQL and PostgreSQL databases. HiveCatalog stores catalog data in Hive Metastore. HiveCatalog uses HiveShim to provide compatibility with different Hive versions. Different metastore clients can be configured to use Hive Metastore or Amazon Glue Data Catalog.

Most of Flink's built-in connectors (such as Kafka, Amazon Kinesis, Amazon DynamoDB, Elasticsearch, or FileSystem) can use Flink HiveCatalog to store metadata in Amazon Glue Data Catalog. However, some connector implementations (such as Apache Iceberg) have separate directory management mechanisms. F linkCatalog in Iceberg implements the catalog interface in Flink. FlinkCatalog in Iceberg provides an encapsulation mechanism for its own catalog implementation. The diagram below shows the relationship between Apache Flink, Iceberg connectors and directories. 

 

Apache Hudi also has its own directory management capabilities. Both HoodieCatalog and HoodieHiveCatalog implement the catalog interface in Flink. HoodieCatalog stores metadata in a file system such as HDFS. HoodieHiveCatalog stores metadata in either the Hive Metastore or the Amazon Glue Data Catalog, depending on whether hive.metastore.client.factory.class is configured to use com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory . The following diagram shows the relationship between Apache Flink, Hudi connectors and directories.

Solution overview

The following diagram shows the overall architecture of the solution described in this article.

 

 

In this solution, Amazon RDS for MySQL binlog is enabled to pull transaction changes in real time. The Amazon EMR Flink CDC connector reads binlog data and processes the data. The transformed data can be stored in Amazon S3. Use Amazon Glue Data Catalog to store metadata such as table schema and table location. Downstream data consumer applications such as Amazon Athena or Amazon EMR Trino access the data for business analysis.

Here are the general steps for setting up this solution:

  1. Enable binlog for Amazon RDS for MySQL and initialize the database.
  2. Create an EMR cluster using Amazon Glue Data Catalog.
  3. Use Apache Flink CDC to extract change data capture (CDC, Change Data Capture) data in Amazon EMR.
  4. Store processed data in Amazon S3 and metadata in Amazon Glue Data Catalog.
  5. Confirm that all table metadata is stored in the Amazon Glue Data Catalog.
  6. Use data for business analysis via Athena or Amazon EMR Trino.
  7. Update and delete source records in Amazon RDS for MySQL and verify that corresponding changes occur in the data lake tables.

Using Amazon Glue Data Catalog

Create an EMR cluster

Starting with Amazon EMR 6.9.0, Flink Table API/SQL can be integrated with Amazon Glue Data Catalog. To use Flink's integration with Amazon Glue, you must create Amazon EMR 6.9.0 or later.

  1. Create the file iceberg.properties for Amazon EMR Trino integration with Data Catalog . When the table format is Iceberg, your file should contain the following content:

 

2. Upload iceberg.properties to an S3 bucket, such as DOC-EXAMPLE-BUCKET .

3. Create the trino-glue-catalog-setup.sh file to configure the integration of Trino with Data Catalog. Use trino-glue-catalog-setup.sh as the boot script.

4. Upload trino-glue-catalog-setup.sh to the S3 bucket ( DOC-EXAMPLE-BUCKET ).

5. Create the flink-glue-catalog-setup.sh file to configure the integration of Flink and Data Catalog.

6. Use the script runner to run the flink-glue-catalog-setup.sh script as a step function.

7. Upload flink-glue-catalog-setup.sh to the S3 bucket ( D OC-EXAMPLE-BUCKET ).

8. Create an EMR 6.9.0 cluster using Hive, Flink and Trino applications.

You can create an EMR cluster using the Amazon command line interface (Amazon CLI) or the Amazon Management Console.

summary

This post shows how to integrate Apache Flink in Amazon EMR with Amazon Glue Data Catalog. You can use the Flink SQL connector to read/write data in different stores such as Kafka, CDC, HBase, Amazon S3, Iceberg or Hudi. You can also store metadata in the Data Catalog. The Flink table API has the same connector and directory implementation mechanism. Within a single session, you can have multiple instances pointing to different types of catalogs (such as IcebergCatalog and HiveCatalog ) and then use them interchangeably in queries. You can also use the Flink table API to write code and develop the same solution integrating Flink and Data Catalog. With Amazon EMR Flink's unified batch and streaming data processing capabilities, data can be extracted and processed through a single computing engine. By integrating Apache Iceberg and Hudi into Amazon EMR, you can build an evolvable and scalable data lake. With Amazon Glue Data Catalog, you can uniformly manage all enterprise data catalogs and easily consume data.

forward from:

Build a unified data lake on Amazon EMR using Apache Flink icon-default.png?t=N6B9https://aws.amazon.com/cn/blogs/china/build-a-unified-data-lake-with-apache-flink-on-amazon-emr/

Guess you like

Origin blog.csdn.net/Discovering_/article/details/131918567
Recommended