Use Amazon EMR and Apache Hudi inserted in S3, update, delete data

The data stored in Amazon S3 can bring many benefits, including terms of size, reliability and cost efficiency. The most important thing is that you can use Amazon EMR in Apache Spark, Hive and Presto like open source tools to process and analyze data. Despite the powerful features of these tools, but in dealing with the need for incremental data processing and record-level insert, update, and delete scenes, still very challenging.

When talking with customers, we found that some incremental updates to the scene to deal with a single record, such as:

  • Compliance with data privacy regulations, the regulations, the user choose to forget or change the application protocol for data usage patterns.
  • Use streaming data, when you have to deal with specific data insert and update events.
  • Achieve change data capture (CDC) to track and extract database schema change log enterprise data warehouse or operational data store.
  • Late data recovery, data analysis, or a specific point in time.

Starting today, EMR 5.28.0 version includes Apache Hudi (hatching), so you no longer need to build a custom solution to insert record level of execution, update, and delete operations. Hudi is Uber began developing in 2016, to address the low uptake and ETL pipeline efficiency. In recent months, EMR team works closely with Apache Hudi community, offers some patches, including Hudi updated Spark 2.4.4, support Spark Avro, adds support for AWS Glue Data Catalog, as well as multiple bug fixes.

Use Hudi, which can be performed on S3 record level insert, update and delete, so that you can comply with data privacy laws, consumer real-time streaming, capture data updates, recovery late data and with an open, vendor-independent format tracking history and rollback. Creating data sets and tables, and then Hudi manage the underlying data format. Hudi use Apache Parquet and Apache Avro data storage, and built-in integrated Spark, Hive and Presto, so you can use the same tools and are now used to query Hudi data sets, and almost real-time access to new data.

When you start EMR cluster, simply select one of the following components (Hive, Spark, Presto), you can automatically install and configure Hudi libraries and tools. You can use the Spark to create a new Hudi data sets, as well as insert, update, and delete data. Each Hudi data set will be registered in the metadata repository has been configured cluster (including AWS Glue Data Catalog) and displayed as by Spark, Hive and Presto query table.

Hudi supports two types of memory, these memory type defines how the writing and reading of data from the index S3:

  • Copy (Copy On Write) write - data in column format (Parquet) stored and updated when writing data creates a new version of the file. This type of storage is best suited for reading heavy work load, because the latest version of the data set is always available in an efficient inline file.

  • The combined reading (Merge On Read) - the combination of columns (Parquet) format and line-based (Avro) format to store data; updating records to row based 增量文件in and compressed later, to create a new version of the column type of the file. This type of storage is best suited for heavy write workloads, because the new submission (commit) will fast incremental file format to write, but to read the data set, you need to compress a list of files and 增量文件merge.

Let's take a quick preview at how to set up and use Hudi EMR data sets in the cluster.

Apache Hudi combined with Amazon EMR

Start creating clusters from EMR console. In the Advanced Options, select the EMR version 5.28.0 (including Hudi first version) and the following applications: Spark, Hive and Tez. In hardware options, added three task nodes to ensure that there is sufficient capacity to run Spark and Hive.

After the cluster is ready, use the key selected in security options in right, SSH into the master node and access Spark Shell. Use the following command to start Spark Shell to use it with Hudi:

$ spark-shell --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
              --conf "spark.sql.hive.convertMetastoreParquet=false"
              --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar

The following examples of some Scala code ELB log import 写时复制storage type Hudi dataset:

import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.hive.MultiPartKeysValueExtractor

//Set up various input values as variables
val inputDataPath = "s3://athena-examples-us-west-2/elb/parquet/year=2015/month=1/day=1/"
val hudiTableName = "elb_logs_hudi_cow"
val hudiTablePath = "s3://MY-BUCKET/PATH/" + hudiTableName

// Set up our Hudi Data Source Options
val hudiOptions = Map[String,String](
    DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "request_ip",
    DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "request_verb", 
    HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
    DataSourceWriteOptions.OPERATION_OPT_KEY ->
        DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
    DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "request_timestamp", 
    DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true", 
    DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> hudiTableName, 
    DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "request_verb", 
    DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY -> "false", 
    DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY ->
        classOf[MultiPartKeysValueExtractor].getName)

// Read data from S3 and create a DataFrame with Partition and Record Key
val inputDF = spark.read.format("parquet").load(inputDataPath)

// Write data into the Hudi dataset
inputDF.write
       .format("org.apache.hudi")
       .options(hudiOptions)
       .mode(SaveMode.Overwrite)
       .save(hudiTablePath)

In the Spark Shell, can now be calculated Hudi records in a dataset:

scala> inputDF2.count()
res1: Long = 10491958

In the options (options) are used for the cluster with Hive Metastore integrated, so that the database (default) create a table by default. In this way, I can use the Hive query Hudi data in the dataset:

hive> use default;
hive> select count(*) from elb_logs_hudi_cow;
...
OK
10491958

You can now update or delete a single record in the dataset. In Spark Shell, we set some variables to query update records and prepare the SQL statement used to select the value you want to change the column:

val requestIpToUpdate = "243.80.62.181"
val sqlStatement = s"SELECT elb_name FROM elb_logs_hudi_cow WHERE request_ip = '$requestIpToUpdate'"

Execute SQL statements to view the current value of the column:

scala> spark.sql(sqlStatement).show()
+------------+                                                                  
|    elb_name|
+------------+
|elb_demo_003|
+------------+

Then, select and update records:

// Create a DataFrame with a single record and update column value
val updateDF = inputDF.filter(col("request_ip") === requestIpToUpdate)
                      .withColumn("elb_name", lit("elb_demo_001"))

Now in a similar syntax for creating Hudi data set to update it. But DataFrame this writing contains only one record:

// Write the DataFrame as an update to existing Hudi dataset
updateDF.write
        .format("org.apache.hudi")
        .options(hudiOptions)
        .option(DataSourceWriteOptions.OPERATION_OPT_KEY,
                DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
        .mode(SaveMode.Append)
        .save(hudiTablePath)

In Spark Shell, check the updated results:

scala> spark.sql(sqlStatement).show()
+------------+                                                                  
|    elb_name|
+------------+
|elb_demo_001|
+------------+

Now you want to delete the same record. To remove it, you can be passed in writing options EmptyHoodieRecordPayloadPayload:

// Write the DataFrame with an EmptyHoodieRecordPayload for deleting a record
updateDF.write
        .format("org.apache.hudi")
        .options(hudiOptions)
        .option(DataSourceWriteOptions.OPERATION_OPT_KEY,
                DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
        .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY,
                "org.apache.hudi.EmptyHoodieRecordPayload")
        .mode(SaveMode.Append)
        .save(hudiTablePath)

In Spark Shell can be seen that the record is no longer available:

scala> spark.sql(sqlStatement).show()
+--------+                                                                      
|elb_name|
+--------+
+--------+

Hudi is how to manage all updates and deletes? We can connect to the data set by Hudi command-line interface (CLI), will be able to see these changes be interpreted as submission (commits):

We can see, this data set is 写时复制a data set, which means that each time the record is updated, the file containing the record would be rewritten to include the value after the update. You can view each commit (commit) the number of written records. Bottom row of the table describes the creation initial data set, the top is a single record updates, top single record is deleted.

Use Hudi, you can roll back to each submission. For example, you can use the following method to roll back the delete operation:

hudi:elb_logs_hudi_cow->commit rollback --commit 20191104121031

In Spark Shell, the records are now retreated back into position after the update:

scala> spark.sql(sqlStatement).show()
+------------+                                                                  
|    elb_name|
+------------+
|elb_demo_001|
+------------+

写入时复制Is the default storage type. By adding it to us hudiOptions, we can repeat the above steps to create and update the 读时合并data set type:

DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "MERGE_ON_READ"

If the update 读时合并data collection and use Hudi CLI View submit (commit), you can see that 读时合并with 写时复制any different compared. Use 读时合并, you write only rows updated, rather than 写时复制writing the whole file the same. This is why the 读时合并reasons for the use cases require more helpful to write or read using less number of updated or deleted heavy workload. Incremental submitted as Avro record (row-based storage) is written to disk, and writing compressed data file as Parquet (column stores). To avoid creating too many delta files, Hudi will automatically compress data set, in order to make reading as efficient as possible.

Create a 读时合并data set, Hive creates two tables:

  • A table of the same name of the first name of the data set.
  • Behind the name of the second table of additional characters _rt; _rt suffix in real time.

Query, the first table to return the compressed data, and does not show the latest incremental submission. Use this table to provide optimal performance, but will ignore the latest data. Check real-time data compression table will be submitted when combined with the incremental reading, so that the data set is called 读时合并. This will lead to access to the latest data, but can cause performance overhead, and is better query performance data compression. Thus, data analysis and engineers can flexibly choose between performance and data freshness.

Already available

All regions of EMR 5.28.0 can now use this new feature. The Hudi with EMR use at no extra charge in combination. You can learn more about Hudi in the EMR document. This new tool you can simplify the process in S3, update, and delete data. Let us know what you are going to be the scene for!

Welcome attention ApacheHudi

Guess you like

Origin www.cnblogs.com/apachehudi/p/11926960.html