CDC One Key to the Lake: When Apache Hudi DeltaStreamer Meets Serverless Spark

Please add a picture description The book "Big Data Platform Architecture and Prototype Realization: Actual Combat in Data Center Construction" was carefully created by the blogger for three years, and has now been published by the famous IT book brand Electronics Industry Publishing House. Click "Heavy Recommendation: Jianda The data platform is too difficult! Send me an engineering prototype! " Learn more about books, JD book purchase link: https://item.jd.com/12677623.html , scan the QR code on the left to enter the JD mobile phone book purchase page.

Apache Hudi's DeltaStreamer is a tool class that ingests data in near real-time and writes it to the Hudi table. It simplifies the operation of streaming data into the lake and storing it as a Hudi table. Since the release, Hudi has 0.10.0added Based on Debezium's CDC data processing capabilities, it can directly log the CDC data collected by Debezium into Hudi tables. This function greatly simplifies the data integration work from the source business database to the Hudi data lake. This article was published on the Apache Hudi public account, the address of this article: https://laurence.blog.csdn.net/article/details/132011197 , please indicate the source for reprinting!

On the other hand, thanks to the extreme experience of out-of-the-box and zero operation and maintenance, more and more cloud users begin to embrace serverless products. EMR on the Amazon cloud platform is a computing platform that integrates a variety of mainstream big data tools. Since the 6.6.0release, EMR has launched a Serverless version, which provides a serverless Spark operating environment. Users do not need to maintain Hadoop/Spark clusters to easily Submit the Spark job.

One is the "full configuration" Hudi tool class, and the other is the "out-of-the-box" Spark operating environment. Combining the two, there is no need to write CDC processing code, no need to build a Spark cluster, and it can be easily implemented with only one command Entering CDC data into the lake is a very attractive technical solution. In this article, we will introduce the overall architecture and implementation details of this solution in detail.

1. Overall structure

0.10.0The DeltaStreamer CDC introduced by Apache Huid in the version is the end link in the entire CDC data processing chain. In order to let everyone clearly understand the position and role of DeltaStreamer in it, we need to take a look at the complete architecture :

Please add a picture description

①: MySQL is a business database and the source of CDC data;

②: The system uses a CDC ingestion tool to read the binlog of MySQL in real time. The mainstream CDC ingestion tools in the industry include: Debezium, Maxwell, FlinkCDC, etc. In this architecture, the Kafka Connect with Debezium MySQL Connector installed is selected;

③: Now more and more CDC data ingestion solutions are beginning to introduce Schema Registry to better control the Schema changes of upstream business systems and achieve a more controllable Schema Evolution. In the open source community, the more mainstream product is Confluent Schema Registry, and currently Hudi's DeltaStreamer only supports the Confluent Schema Registry, so it is also selected for this architecture. After introducing the Schema Registry, when Kafka Connect captures a record, it will first check whether the corresponding Schema already exists in its local Schema Cache. If so, it will directly obtain the Schema ID from the local Cache. If not, it will save it Submit to the Schema Registry, the Schema Registry will complete the registration of the Schema and return the generated Schema ID to Kafka Connect, and Kafka Connect will encapsulate (serialize) the original CDC data based on the Schema ID: First, add the Schema ID to the message The second is that if the message is delivered in Avro format, Kafka Connect will remove the Schema part of the Avro message and only keep the Raw Data, because the Schema information has been cached locally in the Producer and Consumer or can be obtained through the Schema Registry at one time, and there is no need to accompany the Raw Data transmission, which can greatly reduce the size of Avro messages and improve transmission efficiency. io.confluent.connect.avro.AvroConverterThese jobs are done through the Avro Converter ( ) provided by Confluent ;

④: Kafka Connect delivers the encapsulated Avro message to Kafka

⑤: EMR Serverless provides a serverless Spark operating environment for DeltaStreamer;

⑥: Hudi's DeltaStreamer runs as a Spark job in the EMR Serverless environment. After it reads the Avro message from Kafka, it uses the Avro deserializer ( ) provided by Confluent to parse the Avro message to obtain the Schema ID and Raw Data io.confluent.kafka.serializers.KafkaAvroDeserializer. The serializer will also first search the corresponding Schema in the local Schema Cache according to the ID. If it finds it, it will deserialize the Raw Data according to the Schema. If it does not find it, it will request the Schema Registry to obtain the Schema corresponding to the ID, and then deserialize;

⑦: DeltaStreamer writes the parsed data into the Hudi table stored on S3. If the data table does not exist, it will automatically create the table and synchronize it to the Hive MetaStore

2. Environment preparation

Due to space limitations, this article will not introduce the construction of links ①, ②, ③, and ④. Readers can refer to the following documents to build a complete test environment by themselves:

①MySQL: If it is only for testing purposes, it is recommended to use the official Docker image provided by Debezium. For construction operations, refer to its official documentation (the CDC data processed in the operation examples given below is from the inventory database in the MySQL image);

②Kafka Connect: For testing purposes only, it is recommended to use the official Docker image provided by Confluent . For construction operations, refer to its official documentation , or use the Kafka Connect hosted on AWS: Amazon MSK Connect. It should be reminded that two plug-ins, Debezium MySQL Connector and Confluent Avro Converter, must be installed on Kafka Connect, so these two plug-ins need to be added manually on the basis of the official image;

③Confluent Schema Registry: If it is only for testing purposes, it is recommended to use the official Docker image provided by Confluent , and the construction operation can refer to its official documentation ;

④Kafka: If it is only for testing purposes, it is recommended to use the official Docker image provided by Confluent . For construction operations, refer to its official documentation , or use Kafka hosted on AWS: Amazon MSK

After completing the above work, we will obtain the addresses of two dependent services, "Confluent Schema Registry" and "Kafka Bootstrap Servers", which are necessary conditions for starting the DeltaStreamer CDC job, and will be passed to the DeltaStreamer job in the form of parameters.

3. Configure global variables

After the environmental preparations are ready, you can start the work of parts ⑤, ⑥, and ⑦. All operations in this article are completed through commands and provided to readers in the form of shell scripts. The serial numbers of the actual operation steps will be marked on the scripts. If it is an operation of two options, it will be marked with the letter a/b. Some operations also have examples for readers refer to. In order to make the script have good portability, we extract the environment-related dependencies and configuration items that need to be customized by the user, and centrally configure them in the form of global variables. If you perform this operation in your own environment, you only need to The following global variables need to be modified instead of specific commands:

variable illustrate set the time
APP_NAME The name assigned by the user for this application set in advance
APP_S3_HOME The S3 exclusive bucket set by the user for this application set in advance
APP_LOCAL_HOME The local working directory set by the user for this application set in advance
SCHEMA_REGISTRY_URL The Confluent Schema Registry address in the user environment set in advance
KAFKA_BOOTSTRAP_SERVERS Addresses of Kafka Bootstrap Servers in the user environment set in advance
EMR_SERVERLESS_APP_SUBNET_ID ID of the subnet to which the EMR Serverless Application to be created belongs set in advance
EMR_SERVERLESS_APP_SECURITY_GROUP_ID ID of the security group to which the EMR Serverless Application to be created belongs set in advance
EMR_SERVERLESS_APP_ID ID of the EMR Serverless Application to be created produced in the process
EMR_SERVERLESS_EXECUTION_ROLE_ARN ARN of the EMR Serverless Execution Role to be created produced in the process
EMR_SERVERLESS_JOB_RUN_ID Job ID returned after submitting an EMR Serverless job produced in the process

Next, we will enter the practical stage. You need to have a Linux environment with AWS CLI installed and user credentials configured (Amazon Linux2 is recommended). After logging in through SSH, first use the command sudo yum -y install jqto install the command-line tool for manipulating json files: jq (it will be used in subsequent scripts), and then export all the above global variables (please replace the corresponding values ​​in the command line according to your AWS account and local environment):

# 实操步骤(1)
export APP_NAME='change-to-your-app-name'
export APP_S3_HOME='change-to-your-app-s3-home'
export APP_LOCAL_HOME='change-to-your-app-local-home'
export SCHEMA_REGISTRY_URL='change-to-your-schema-registry-url'
export KAFKA_BOOTSTRAP_SERVERS='change-to-your-kafka-bootstrap-servers'
export EMR_SERVERLESS_APP_SUBNET_ID='change-to-your-subnet-id'
export EMR_SERVERLESS_APP_SECURITY_GROUP_ID='change-to-your-security-group-id'

Here is an example:

# 示例(非实操步骤)
export APP_NAME='apache-hudi-delta-streamer'
export APP_S3_HOME='s3://apache-hudi-delta-streamer'
export APP_LOCAL_HOME='/home/ec2-user/apache-hudi-delta-streamer'
export SCHEMA_REGISTRY_URL='http://localhost:8081'
export KAFKA_BOOTSTRAP_SERVERS='localhost:9092'
export EMR_SERVERLESS_APP_SUBNET_ID='subnet-0a11afe6dbb4df759'
export EMR_SERVERLESS_APP_SECURITY_GROUP_ID='sg-071f18562f41b5804'

As for EMR_SERVERLESS_APP_ID, EMR_SERVERLESS_EXECUTION_ROLE_ARN, EMR_SERVERLESS_JOB_RUN_IDthe three variables will be generated and exported in the subsequent operation.

4. Create a dedicated working directory and bucket

As a best practice, we first create an exclusive local working directory (that is, the APP_LOCAL_HOMEset path) and an S3 storage bucket (that is, APP_S3_HOMEthe set bucket) for the application (Job). The application scripts, configuration files, Dependent packages, logs, and generated data are all stored in dedicated directories and buckets, which is easy to maintain:

# 实操步骤(2)
mkdir -p $APP_LOCAL_HOME
aws s3 mb $APP_S3_HOME

5. Create EMR Serverless Execution Role

To run an EMR Serverless job, you need to configure an IAM Role. This Role will give the EMR Serverless job permission to access AWS-related resources. Our DeltaStreamer CDC job should at least need to be assigned:

  • Read and write permissions for S3 dedicated buckets
  • Read and write access to the Glue Data Catalog
  • Read and write access to the Glue Schema Registry

You can manually create this Role according to the official documentation of EMR Serverless , and then export its ARN as a variable (please replace the corresponding value in the command line according to your AWS account environment):

# 实操步骤(3/a)
export EMR_SERVERLESS_EXECUTION_ROLE_ARN='change-to-your-emr-serverless-execution-role-arn'

Here is an example:

# 示例(非实操步骤)
export EMR_SERVERLESS_EXECUTION_ROLE_ARN='arn:aws:iam::123456789000:role/EMR_SERVERLESS_ADMIN'

Considering that it is cumbersome to manually create this Role, this article provides the following script, which can create a Role with administrator privileges in your AWS account: , so as to help you quickly EMR_SERVERLESS_ADMINcomplete the work in this section (Note: Since this Role has the highest authority, It should be used with caution. After completing the quick verification, you should still configure a dedicated Execution Role with strictly limited permissions in the production environment):

# 实操步骤(3/b)
EMR_SERVERLESS_EXECUTION_ROLE_NAME='EMR_SERVERLESS_ADMIN'
cat << EOF > $APP_LOCAL_HOME/assume-role-policy.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "EMRServerlessTrustPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "emr-serverless.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
EOF
jq . $APP_LOCAL_HOME/assume-role-policy.json
export EMR_SERVERLESS_EXECUTION_ROLE_ARN=$(aws iam create-role \
    --no-paginate --no-cli-pager --output text \
    --role-name "$EMR_SERVERLESS_EXECUTION_ROLE_NAME" \
    --assume-role-policy-document file://$APP_LOCAL_HOME/assume-role-policy.json \
    --query Role.Arn)
aws iam attach-role-policy \
    --policy-arn "arn:aws:iam::aws:policy/AdministratorAccess" \
    --role-name "$EMR_SERVERLESS_EXECUTION_ROLE_NAME"

6. Create EMR Serverless Application

Before submitting a job to EMR Serverless, you need to create an EMR Serverless Application, which is a concept in EMR Serverless and can be understood as a virtual EMR cluster. When creating an Application, you need to specify information such as the EMR version, network configuration, cluster size, and preheated nodes. Usually, we only need the following command to complete the creation work:

# 示例(非实操步骤)
aws emr-serverless create-application \
    --name "$APP_NAME" \
    --type "SPARK" \
    --release-label "emr-6.11.0"

However, the Application created in this way has no network configuration. Since our DeltaStreamer CDC job needs to access the Confluent Schema Registry and Kafka Bootstrap Servers located in a specific VPC, we must explicitly set the subnet and security group for the Application to Make sure DeltaStreamer can communicate with these two services. Therefore, we need to create an Application with a specific network configuration using the following command:

# 实操步骤(4)
cat << EOF > $APP_LOCAL_HOME/create-application.json
{
    "name":"$APP_NAME",
    "releaseLabel":"emr-6.11.0",
    "type":"SPARK",
    "networkConfiguration":{
        "subnetIds":[
            "$EMR_SERVERLESS_APP_SUBNET_ID"
        ],
        "securityGroupIds":[
            "$EMR_SERVERLESS_APP_SECURITY_GROUP_ID"
        ]
    }
}
EOF
jq . $APP_LOCAL_HOME/create-application.json
export EMR_SERVERLESS_APP_ID=$(aws emr-serverless create-application \
    --no-paginate --no-cli-pager --output text \
    --release-label "emr-6.11.0" --type "SPARK" \
    --cli-input-json file://$APP_LOCAL_HOME/create-application.json \
    --query "applicationId")

7. Submit the Apache Hudi DeltaStreamer CDC job

Once the Application is created, the job can be submitted. Apache Hudi DeltaStreamer CDC is a relatively complex job with a lot of configuration items. This can be seen from the example given by the Hudi official blog. What we need to do is: use spark - submit The job submitted by the command is "translated" into an EMR Serverless job.

7.1 Prepare job description file

To submit an EMR Serverless job using the command line, you need to provide a job description file in JSON format. Usually, the parameters configured on the spark-submit command line are described by this file. Due to the large number of configuration items of the DeltaStreamer job, due to space limitations, we cannot explain them one by one. You can compare the following job description file with the native Spark job provided by Hudi's official blog, and then you can understand the job description relatively easily . The role of the file.

It should be noted that when executing the following script, please replace all parts of the script according to your AWS account and local environment. <your-xxx>These replaced parts depend on the source database, data table, Kakfa Topic and Schema in your local environment For Registry and other information, the corresponding values ​​need to be adjusted every time a table is changed, so they are not extracted into global variables.

In addition, this job does not actually depend on any third-party Jar package, and the Confluent Avro Converter it uses has been integrated hudi-utilities-bundle.jarinto --conf spark.jars=$(...). Reader reference required.

# 实操步骤(5)
cat << EOF > $APP_LOCAL_HOME/start-job-run.json
{
    "name":"apache-hudi-delta-streamer",
    "applicationId":"$EMR_SERVERLESS_APP_ID",
    "executionRoleArn":"$EMR_SERVERLESS_EXECUTION_ROLE_ARN",
    "jobDriver":{
        "sparkSubmit":{
        "entryPoint":"/usr/lib/hudi/hudi-utilities-bundle.jar",
        "entryPointArguments":[
            "--continuous",
            "--enable-sync",
            "--table-type", "COPY_ON_WRITE",
            "--op", "UPSERT",
            "--target-base-path", "<your-table-s3-path>",
            "--target-table", "<your-table>",
            "--min-sync-interval-seconds", "60",
            "--source-class", "org.apache.hudi.utilities.sources.debezium.MysqlDebeziumSource",
            "--source-ordering-field", "_event_origin_ts_ms",
            "--payload-class", "org.apache.hudi.common.model.debezium.MySqlDebeziumAvroPayload",
            "--hoodie-conf", "bootstrap.servers=$KAFKA_BOOTSTRAP_SERVERS",
            "--hoodie-conf", "schema.registry.url=$SCHEMA_REGISTRY_URL",
            "--hoodie-conf", "hoodie.deltastreamer.schemaprovider.registry.url=${SCHEMA_REGISTRY_URL}/subjects/<your-registry-name>.<your-src-database>.<your-src-table>-value/versions/latest",
            "--hoodie-conf", "hoodie.deltastreamer.source.kafka.value.deserializer.class=io.confluent.kafka.serializers.KafkaAvroDeserializer",
            "--hoodie-conf", "hoodie.deltastreamer.source.kafka.topic=<your-kafka-topic-of-your-table-cdc-message>",
            "--hoodie-conf", "auto.offset.reset=earliest",
            "--hoodie-conf", "hoodie.datasource.write.recordkey.field=<your-table-recordkey-field>",
            "--hoodie-conf", "hoodie.datasource.write.partitionpath.field=<your-table-partitionpath-field>",
            "--hoodie-conf", "hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor",
            "--hoodie-conf", "hoodie.datasource.write.hive_style_partitioning=true",
            "--hoodie-conf", "hoodie.datasource.hive_sync.database=<your-sync-database>",
            "--hoodie-conf", "hoodie.datasource.hive_sync.table==<your-sync-table>",
            "--hoodie-conf", "hoodie.datasource.hive_sync.partition_fields=<your-table-partition-fields>"
        ],
         "sparkSubmitParameters":"--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory --conf spark.jars=<your-app-dependent-jars>"
        }
   },
   "configurationOverrides":{
        "monitoringConfiguration":{
            "s3MonitoringConfiguration":{
                "logUri":"<your-s3-location-for-emr-logs>"
            }
        }
   }
}
EOF
jq . $APP_LOCAL_HOME/start-job-run.json

Here is an example:

# 示例(非实操步骤)
cat << EOF > $APP_LOCAL_HOME/start-job-run.json
{
    "name":"apache-hudi-delta-streamer",
    "applicationId":"$EMR_SERVERLESS_APP_ID",
    "executionRoleArn":"$EMR_SERVERLESS_EXECUTION_ROLE_ARN",
    "jobDriver":{
        "sparkSubmit":{
        "entryPoint":"/usr/lib/hudi/hudi-utilities-bundle.jar",
        "entryPointArguments":[
            "--continuous",
            "--enable-sync",
            "--table-type", "COPY_ON_WRITE",
            "--op", "UPSERT",
            "--target-base-path", "$APP_S3_HOME/data/mysql-server-3/inventory/orders",
            "--target-table", "orders",
            "--min-sync-interval-seconds", "60",
            "--source-class", "org.apache.hudi.utilities.sources.debezium.MysqlDebeziumSource",
            "--source-ordering-field", "_event_origin_ts_ms",
            "--payload-class", "org.apache.hudi.common.model.debezium.MySqlDebeziumAvroPayload",
            "--hoodie-conf", "bootstrap.servers=$KAFKA_BOOTSTRAP_SERVERS",
            "--hoodie-conf", "schema.registry.url=$SCHEMA_REGISTRY_URL",
            "--hoodie-conf", "hoodie.deltastreamer.schemaprovider.registry.url=${SCHEMA_REGISTRY_URL}/subjects/osci.mysql-server-3.inventory.orders-value/versions/latest",
            "--hoodie-conf", "hoodie.deltastreamer.source.kafka.value.deserializer.class=io.confluent.kafka.serializers.KafkaAvroDeserializer",
            "--hoodie-conf", "hoodie.deltastreamer.source.kafka.topic=osci.mysql-server-3.inventory.orders",
            "--hoodie-conf", "auto.offset.reset=earliest",
            "--hoodie-conf", "hoodie.datasource.write.recordkey.field=order_number",
            "--hoodie-conf", "hoodie.datasource.write.partitionpath.field=order_date",
            "--hoodie-conf", "hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor",
            "--hoodie-conf", "hoodie.datasource.write.hive_style_partitioning=true",
            "--hoodie-conf", "hoodie.datasource.hive_sync.database=inventory",
            "--hoodie-conf", "hoodie.datasource.hive_sync.table=orders",
            "--hoodie-conf", "hoodie.datasource.hive_sync.partition_fields=order_date"
        ],
         "sparkSubmitParameters":"--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory --conf spark.jars=$(aws s3 ls $APP_S3_HOME/jars/ | grep -o '\S*\.jar$'| awk '{print "'"$APP_S3_HOME/jars/"'"$1","}' | tr -d '\n' | sed 's/,$//')"
        }
   },
   "configurationOverrides":{
        "monitoringConfiguration":{
            "s3MonitoringConfiguration":{
                "logUri":"$APP_S3_HOME/logs"
            }
        }
   }
}
EOF
jq . $APP_LOCAL_HOME/start-job-run.json

7.2 Submitting assignments

After preparing the job description file, you can officially submit the job, the command is as follows:

# 实操步骤(6)
export EMR_SERVERLESS_JOB_RUN_ID=$(aws emr-serverless start-job-run \
    --no-paginate --no-cli-pager --output text \
    --name apache-hudi-delta-streamer \
    --application-id $EMR_SERVERLESS_APP_ID \
    --execution-role-arn $EMR_SERVERLESS_EXECUTION_ROLE_ARN \
    --execution-timeout-minutes 0 \
    --cli-input-json file://$APP_LOCAL_HOME/start-job-run.json \
    --query jobRunId)

7.3 Monitoring jobs

After the job is submitted, you can view the running status of the job on the console. If you want to continuously monitor jobs in the command line window, you can use the following script:

# 实操步骤(7)
now=$(date +%s)sec
while true; do
    jobStatus=$(aws emr-serverless get-job-run \
                    --no-paginate --no-cli-pager --output text \
                    --application-id $EMR_SERVERLESS_APP_ID \
                    --job-run-id $EMR_SERVERLESS_JOB_RUN_ID \
                    --query jobRun.state)
    if [ "$jobStatus" = "PENDING" ] || [ "$jobStatus" = "SCHEDULED" ] || [ "$jobStatus" = "RUNNING" ]; then
        for i in {
    
    0..5}; do
            echo -ne "\E[33;5m>>> The job [ $EMR_SERVERLESS_JOB_RUN_ID ] state is [ $jobStatus ], duration [ $(date -u --date now-$now +%H:%M:%S) ] ....\r\E[0m"
            sleep 1
        done
    else
        echo -ne "The job [ $EMR_SERVERLESS_JOB_RUN_ID ] is [ $jobStatus ]\n\n"
        break
    fi
done

7.4 Error retrieval

After the job starts running, Spark Driver and Executor will continue to generate logs, which are stored in the configured $APP_S3_HOME/logspath. If the job fails, you can use the following script to quickly retrieve the error information:

# 实操步骤(8)
JOB_LOG_HOME=$APP_LOCAL_HOME/log/$EMR_SERVERLESS_JOB_RUN_ID
rm -rf $JOB_LOG_HOME && mkdir -p $JOB_LOG_HOME
aws s3 cp --recursive $APP_S3_HOME/logs/applications/$EMR_SERVERLESS_APP_ID/jobs/$EMR_SERVERLESS_JOB_RUN_ID/ $JOB_LOG_HOME >& /dev/null
gzip -d -r -f $JOB_LOG_HOME >& /dev/null
grep --color=always -r -i -E 'error|failed|exception' $JOB_LOG_HOME

7.5 Stopping a job

DeltaStreamer is a continuously running job. If you need to stop the job, you can use the following command:

# 实操步骤(9)
aws emr-serverless cancel-job-run \
    --no-paginate --no-cli-pager\
    --application-id $EMR_SERVERLESS_APP_ID \
    --job-run-id $EMR_SERVERLESS_JOB_RUN_ID

8. Result Verification

After the job is started, a data table will be automatically created and data will be written in the specified S3 location. Use the following command to view the automatically created data table and the landing data file:

# 实操步骤(10)
aws s3 ls --recursive <your-table-s3-path>
aws glue get-table --no-paginate --no-cli-pager \
    --database-name <your-sync-database> --name <your-sync-table>
# 示例(非实操步骤)
aws s3 ls --recursive $APP_S3_HOME/data/mysql-server-3/inventory/orders/
aws glue get-table --no-paginate --no-cli-pager \
    --database-name inventory --name orders

9. Evaluation and Outlook

In this article, we introduce in detail how to run Apache Hudi DeltaStreamer on EMR Serverless to connect CDC data to Hudi tables. This is an ultra-lightweight solution featuring "zero coding" and "zero operation and maintenance". However, its limitations are also obvious, that is: a DeltaStreamer job can only access one table, which is not practical for data lakes that need to access hundreds or even thousands of tables at every turn, although Hudi also provides MultiTableDeltaStreamer for multi-table access, but the current maturity and completeness of this tool class is not enough for production. In addition, Hudi has since 0.10.0provided the Hudi Sink plug-in for Kafka Connect (currently only supports a single table), which has opened up a new way for CDC data to access the Hudi data lake, which is a new highlight worthy of continuous attention.

In the long run, it is a very common demand for CDC data to enter the lake and land as a Hudi table. Iterating and improving various native components including DeltaStreamer, HoodieMultiTableDeltaStreamer and Kafka Connect Hudi Sink plug-ins will become more and more popular in the community. Strongly, I believe that with the vigorous development of Hudi, these components will continue to mature and be gradually applied to the production environment.

Guess you like

Origin blog.csdn.net/bluishglc/article/details/132011197