Build a real-time data lake on Amazon EMR

foreword

When the company's business development encounters a bottleneck, business analysts and decision makers will always hope to cross-analyze a large amount of business data and user behavior data to answer "why is the profit declining?" "why is the inventory turnover slow?" etc. In the end, the whole point of "dry goods" will be released to promote business development.

The Amazon cloud technology developer community provides developers with global development technology resources. There are technical documents, development cases, technical columns, training videos, activities and competitions, etc. Help Chinese developers connect with the world's most cutting-edge technologies, ideas, and projects, and recommend outstanding Chinese developers or technologies to the global cloud community. If you haven't paid attention/favorite yet, please don't rush over when you see this, click here to make it your technical treasure house!

 

Databases are often not good at a large amount of data analysis work, so data warehouses have emerged. However, data warehouses often require a higher configuration of the underlying infrastructure, and the cost is higher. In order to control costs, companies generally only put well-modeled and high-gold data in it. However, it is difficult for many data companies in production and operation to judge whether there is value in the short term. If they discard it, they are afraid of losing opportunities, and storing it in data warehouses will cause huge costs. It is not easy to explore the value of data in cold storage.

At this time, the data lake came into being. On the one hand, its bottom layer is based on   object storage such as Amazon Simple Storage Service (S3), and the storage cost is low; on the other hand, it is connected with a series of data processing tools, which can quickly perform data exploration and data mining, and can realize efficient "gold mining in the sand" .

Although traditional data lakes have low storage costs, their effectiveness is also low. Traditional data lake solutions are often based on Apache Spark for T+1 offline tasks. The company's data team usually starts Spark in the early hours of the morning for ETL data modeling, so that the business can use the data as of 0:00 of the day when it is at work. However, with the evolution of business models, enterprises need to gain competitive advantages through real-time recommendation, real-time reconciliation, real-time early warning, etc., and these methods have extremely high requirements for real-time data. The author once heard a piece of news that the operator of an e-commerce company made a mistake in the decimal point and caused the price to be 10 times lower than the actual price, which caused customers to place orders frantically. It took company analysts hours to discover the problem the next day, based on overdue sales reports. But the loss of assets and credit caused by this time has been irreparable. From the author's point of view, if real-time data technology is adopted, machine learning or rule engine technology is applied to real-time data streaming, and data anomalies are monitored in real time, the loss is easily avoided.

This article will introduce a real-time data lake solution that can help enterprises use massive amounts of data at low cost and respond to business needs more quickly. At the same time, with the help of Amazon cloud technology hosting services, it can be quickly implemented and easily operated and maintained.

Solution Architecture

In order to ensure the effectiveness of the data in the entire architecture, the access layer uses Apache Kafka to access the binlog of the business database, and for data such as user behavior, the producer API of Kafka can be directly called on the embedded server to inject data into Kafka.

In the data processing part, we use Apache Flink + Apache Hudi for incremental consumption, and then use Apache Spark + Apache Hudi to realize incremental ETL. The entire architecture can control data delay to the second level

With the Amazon Elastic MapReduce (EMR)  cluster managed by Amazon Cloud Technology  , you can use Flink, Spark and other services out of the box. In addition, we will use  Amazon Managed Streaming for Apache Kafka (MSK)  to host the Kafka message queue, which enables Kafka to be used out of the box and dynamically expand and shrink according to the amount of data.

The whole structure is as follows:

image.png

Introduction to key services

Throughout the architecture, we use Amazon EMR, Amazon MSK, Amazon S3, Apache Airflow, Apache Hudi and other services or open source products. Below, the author will give you a brief introduction and explain why using them can make it easier for us to build an entire real-time data lake

Amazon S3

Amazon S3 is purpose-built object storage for storing and retrieving any amount of data from any location. Amazon S3 can provide storage services for different users and scenarios, such as data lakes, websites, mobile applications, general data backup and recovery, and big data analysis. It's a simple storage service that delivers industry-leading durability, availability, performance, security, and virtually unlimited scalability at a fraction of the cost. Using Amazon S3, you can easily build applications that use native cloud storage. Amazon S3 is highly scalable and pay-as-you-go, so you can start small and expand storage as needed.

Amazon EMR

image.png

Amazon EMR is an industry-leading cloud computing big data platform, suitable for data processing, interactive analysis and machine learning scenarios using multiple open source frameworks (such as Apache Spark, Apache Hive, Presto). With Amazon EMR, you can focus on data transformation and analysis rather than spending time and effort managing computing power or open source applications, in addition to saving you money. Amazon EMR adopts an architecture that separates storage and computing. Data is stored on Amazon S3, and computing resources come from  Amazon Elastic Compute Cloud (EC2)  instances. After the cluster is created, MapReduce calls the Amazon S3 interface through the HDFS proxy to read and write data from S3. Amazon EMR defines three roles for servers in a cluster.

  1. Master Node - Manages the Cluster: Coordinates the distribution of MapReduce executables and subsets of raw data to core and task instance groups. Additionally, it tracks the execution status of each task and monitors the health of the instance group. There is only one master node in a cluster. This maps to the Hadoop master node.
  2. Core Nodes - Run tasks and store data using the Hadoop Distributed File System (HDFS). This maps with Hadoop slave nodes.
  3. task node (optional) - run the task: this maps to Hadoop slave nodes.

Amazon MSK

Amazon MSK is a highly available and highly secure Kafka service hosted by Amazon Cloud Technology. It is the basis for message delivery in the field of data analysis, and therefore plays an important role in the streaming data into the lake.

Apache Airflow

Apache Airflow is an open source project launched by Airbnb in 2014 to provide a solution for building batch workflows for managing increasingly complex data management, scripting and analysis tools. From a functional point of view, this is a scalable distributed workflow scheduling system that allows workflows to be modeled as directed acyclic graphs (DAGs), which simplifies the creation and orchestration of each processing step in the data pipeline and monitoring.

Apache Hudi

Apache Hudi is a platform that helps companies build streaming data lakes. Hudi means Hadoop Upserts and Incrementals. Its main purpose is to efficiently reduce data delays in the ingestion process. It was developed by Uber and is open source. It is a table structure that provides libraries suitable for Flink and Spark, and is very easy to integrate with existing big data platforms.

infrastructure

Except for Apache Airflow, the infrastructure used in this construction plan adopts the hosting service in Amazon Cloud Technology China (Ningxia) region. Apache Airflow is installed with Docker in the same VPC as other components.

construction plan

data ingestion

In order to reduce data latency, we use streaming data during data ingestion. For the business data in the process of enterprise operation, it is generally stored in a relational database such as MySQL. We can complete the process by connecting binlog to Kafka. This process can be done through a binlog collection tool such as Maxwell, or directly through Flink CDC. For the collection of user behavior data, various burying schemes can directly enter these behavior data into Kafka.

Use Apache Flink + Apache Hudi to build ODS layer table

After the data is connected to Kafka, we can build the ods table through Apache Flink + Apache Hudi. You may ask, is the ods table constructed in this way different from the traditional way?

Since Hudi provides two primitives, insert update (upsert) and incremental consumption for distributed data storage, Hudi builds a raw table that can naturally synchronize changes in raw data.

Below I will use Amazon EMR version 6.4 to demonstrate this process. Starting an EMR cluster is very simple, I won't go into details here, you can refer to the official documentation of Amazon Cloud Technology.

  1.  We log in to the Amazon EMR master node, download hudi-flink-bundle to the /usr/lib/flink/lib/ directory, and then use Hudi (you can download the dependency package suitable for your Flink version from Central Repository: org/apache/  hudi )
  2. Start a Flink session
checkpoints=s3://xxxxxxxx/flink/checkpoints/

flink-yarn-session -jm 1024 -tm 4096 -s 2 \
 -D state.backend=rocksdb \
-D state.checkpoint-storage=filesystem \
-D state.checkpoints.dir=${checkpoints} \
-D execution.checkpointing.interval=60000 \
-D state.checkpoints.num-retained=5 \
-D execution.checkpointing.mode=EXACTLY_ONCE \
-D execution.checkpointing.externalized-checkpoint-retention=RETAIN_ON_CANCELLATION \
 -D state.backend.incremental=true \
-D execution.checkpointing.max-concurrent-checkpoints=1 \
-D rest.flamegraph.enabled=true \
-d \
-t /etc/hive/conf/hive-site.xml
  1. Start Flink SQL. Where application id is the application id of the Flink session
/usr/lib/flink/bin/sql-client.sh -s {application id}
  1. In the started Flink SQL client, create a Kafka flow table. Here is an example of an order
CREATE TABLE kafka_order (
  order_id BIGINT,
  user_mail STRING,
  status STRING, 
  good_count BIGINT,
  city STRING,
  amount DECIMAL(10, 2),
  create_time STRING,
  update_time STRING
) WITH (
 'connector' = 'kafka',
 'topic' = 'order_table',
 'properties.bootstrap.servers' = 'xxxxxxx.cn-northwest-1.amazonaws.com.cn:9092',
 'properties.group.id' = 'testGroup1',
 'format' = 'maxwell-json'
);

Note that the above code connector is set to Kafka, so you need to download the Flink Kafka connector to the /usr/lib/flink/lib/ directory

  1. Create a Hudi table using the Hudi connector.
CREATE TABLE flink_hudi_order_ods(
order_id BIGINT,
user_mail STRING,
status STRING, 
good_count BIGINT,
city STRING,
amount DECIMAL(10, 2),
create_time STRING,
update_time STRING,
ts TIMESTAMP(3),
logday VARCHAR(255),
hh VARCHAR(255)
)PARTITIONED BY (`logday`,`hh`)
WITH (
'connector' = 'hudi',
'path' = 's3://xxxxx/flink/flink_hudi_order_ods/',
'table.type' = 'COPY_ON_WRITE',
'write.precombine.field' = 'ts',
'write.operation' = 'upsert',
'hoodie.datasource.write.recordkey.field' = 'order_id',
'hive_sync.enable' = 'true',
'hive_sync.table' = 'flink_hudi_order_ods',
'hive_sync.mode' = 'HMS',
'hive_sync.use_jdbc' = 'false',
'hive_sync.username' = 'hadoop',
'hive_sync.partition_fields' = 'logday,hh',
'hive_sync.partition_extractor_class' = 'org.apache.hudi.hive.MultiPartKeysValueExtractor'
);
  1. Insert streaming data into the Hudi table just created
insert into flink_hudi_order_ods select * ,
CURRENT_TIMESTAMP as ts,
DATE_FORMAT(CURRENT_TIMESTAMP, 'yyyy-MM-dd') as logday, 
DATE_FORMAT(CURRENT_TIMESTAMP, 'hh') as hh 
from kafka_order;
  1. verify. After a minute or two, you should be able to view the flink_hudi_order table through hive or glue catalog, and you can use hIve or  Amazon Athena  to query the data of this table. At the same time, you can also verify the data update, update a certain record in the mysql of the original data, and try to query the flink_hudi_order_ods table in the data lake in hive after more than ten seconds. You should find that the update just now has been synchronized to the data lake.

Incremental Data ETL with Apache Spark + Apache Hudi

Next, we will use ETL to demonstrate how to use Apache Spark + Apache Hudi to perform incremental data

  1. First, you need to download the hudi-spark-bundle package on the Amazon EMR master node according to your Spark version

  2. When writing ETL tasks, you need to change the output format to hudi

df.write
  .format("hudi")
  .options(getQuickstartWriteConfigs)
  .option(PRECOMBINE_FIELD.key(), "logday")
  .option(RECORDKEY_FIELD.key(), "logday")
  .option(PARTITIONPATH_FIELD.key(), "logday")
  .option(OPERATION.key(), "upsert")
  .option("hoodie.table.name", targetTbName)
  .option("hoodie.datasource.hive_sync.enable", "true")
  .option("hoodie.datasource.hive_sync.database", "default")
  .option("hoodie.datasource.hive_sync.table", targetTbName)
  .option("hoodie.datasource.hive_sync.mode", "HMS")
  .option("hoodie.datasource.hive_sync.use_jdbc", "false")
  .option("hoodie.datasource.hive_sync.username", "hadoop")
  .option("hoodie.datasource.hive_sync.partition_fields", "logday")
  .option(
    "hoodie.datasource.hive_sync.partition_extractor_class",    "org.apache.hudi.hive.MultiPartKeysValueExtractor"  )
  .mode(SaveMode.Append)
  .save(basePath)
  1. Since we are going to schedule tasks through Airflow, your Airflow DAG file should start your tasks as follows
spark-submit \
    --deploy-mode cluster \
    --master yarn \
    --class com.xxxx.xxxx.Demo \
    --jars {app_dir}/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,{app_dir}/spark-avro_2.12-3.1.2.jar \
    --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
    --conf 'spark.dynamicAllocation.enabled=false' \
    {app_dir}/spark-scala-examples-1.0-SNAPSHOT.jar
  1. You can also directly query the Hudi table directly using spark-shell, as shown below how to integrate Hudi and spark-shell
spark-shell --jars ./hudi-spark3.1.2-bundle_2.12-0.10.1.jar,spark-avro_2.12-3.1.2.jar --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.dynamicAllocation.enabled=false'

Deploy Apache Airflow and schedule Spark tasks

Start an Amazon EC2 in the same VPC as Amazon EMR, and the operating system can choose  Amazon Linux  or CentOS. Then install Apache Arflow according to the steps below.

  1. install docker

  2. Install docker-compose

  3. Download git clone  GitHub - Apache Airflow tutorial

  4. cd to the airflow-tutorial directory

  5. rundocker-compose up -d 命令

After executing the command in the fifth step, the Apache Airflow related software will be downloaded and installed. After the installation is complete, we can schedule tasks for Spark by writing DAG.

How to schedule Spark tasks on Amazon EMR? We recommend using the Airflow ssh hook, which is easy to configure and allows your Airflow to be deployed independently. We need to ensure that the Airflow cluster can access your Amazon EMR cluster, so if your Airflow is deployed separately and not in the Amazon EMR cluster, you need to ensure the network interworking between the two, and Airflow can access your Amazon EMR master through ssh node. The specific configuration method is to configure the Amazon EMR master node information, user account, and Amazon EMR master node ssh private key address in the place where Airflow configures the default information of ssh. Here is a problem that needs attention. Since we use docker to start Airflow, your ssh private key must be accessible inside docker.

Summarize

This article demonstrates how to use Flink, Spark and other services to integrate with Hudi on the Amazon EMR cluster, and cooperate with Airflow, Amazon MSK and other services to implement a streaming data lake, thereby effectively reducing the data delay from data generation to consumption.

With the help of Amazon EMR and Amazon MSK, the operating overhead of basic services such as Flink/Spark/Kafka is eliminated, and these services are used out of the box, so that we only need to care about the construction of the data lake and the data processing on the lake.

The author of this article

image.png

Xu Tingxin

Xiyun Data Solution Architect, 10+ years of experience in product development and solution consulting, has rich practical experience in e-commerce, Internet finance, and smart cars, and is good at using cloud computing, big data, AI and other technologies to mine the underlying needs of users , to achieve precise operation.

image.png

Cai Ruhai

Xiyun Data Solution Architect, 10+ years of development and architecture experience, once worked in a well-known foreign company, has rich work experience in media, finance and other business fields, good at cloud computing, machine learning and other technologies, and has rich project management experience.

Article source: https://dev.amazoncloud.cn/column/article/6309c8990c9a20404da7914f?sc_medium=regulartraffic&sc_campaign=crossplatform&sc_channel=CSDN 

Guess you like

Origin blog.csdn.net/u012365585/article/details/131999449