Best practices for building data ETL tasks on Google Cloud Platform

In data processing, we often need to build ETL tasks, load the data, convert it and then write it to the data storage. Google's cloud platform provides a variety of solutions to build ETL tasks. I also studied these solutions and compared the advantages and disadvantages between the solutions to find a solution that best suits my business scenario.

Assume that our business scenario requires regularly obtaining data from Kafka. After some data cleaning, data association, and data enrichment operations, the data is written to the Bigquery data warehouse to facilitate the generation of statistical analysis reports in the future.

Google Cloud Platform provides several solutions to accomplish this task:

1. Datafusion, by designing the ETL pipeline on the UI interface, then converting the Pipeline into a Spark application, and deploying it to run on Dataproc.

2. Write Spark application code, and then run it on Dataproc or schedule execution through Spark operator on K8S cluster.

3. Write Apache Beam code to execute tasks on the VM through Dataflow runner.

The advantage of option one is that there is basically no need to write code, and the pipeline design can be completed on the graphical interface. The disadvantage is that it may be inconvenient to implement if there are some additional requirements, and the most important thing is that it is too expensive. Datafusion needs to be deployed separately on an instance to run 24 hours a day. The enterprise version of this instance costs about a few dollars an hour. In addition, the Pipeline will schedule the Dataproc Instance when it is running, which will incur additional costs.

The advantage of option two is that it can flexibly meet various needs through Spark code. The disadvantage is also relatively expensive, because Dataproc is based on the Hadoop cluster and requires Zookeeper, driver and executor VMs. If a K8S cluster is used, the Spark operator also needs to run on a separate pod 24 hours a day, and additional driver and executor pods need to be scheduled for execution.

Option 3 is the best solution after comprehensive consideration, because Beam's code provides a general stream batch processing framework that can run on Spark, Flink, Dataflow and other engines, and Dataflow is an excellent engine provided by Google. During the task, Dataflow schedules VMs to run on demand and only charges runtime fees.

Therefore, for my business scenario, using option three is the most appropriate. Below I will introduce the entire implementation process.

Implementation of Beam batch processing tasks

In the official Template of Dataflow, there is an example of consuming Kafka data and writing it to Bigquery, but this is implemented by stream processing. For my business scenario, there is no need to process data in such real-time, and only need to consume it regularly. Yes, so batch processing is more appropriate, which can also save costs significantly.

Beam's Kafka I/O connector processes unbounded data by default, that is, streaming data. To process in batch mode, you need to call the withStartReadTime and withStopReadTime methods to obtain the start and end offset of the Kafka topic to be read, so that the data can be converted into bounded data. What needs to be noted when calling these two methods is that if Kafka does not have a message with a timestamp greater than or equal to this timestamp, an error will be reported, so we need to determine the specific timestamp.

The following code checks whether the timestamp of the message in all partitions of the Kafka message is greater than the timestamp we specify. If it does not exist, then we need to find the earliest one among the latest timestamps in these partitions. For example, Topic has 3 partitions, and the timestamp to be specified is 1697289783000, but all messages in the 3 partitions are less than this timestamp, so we need to find out the latest timestamp of the messages in each partition, and then get The earliest of the latest timestamps of these three partitions is used as our designated timestamp.

public class CheckKafkaMsgTimestamp {
    private static final Logger LOG = LoggerFactory.getLogger(CheckKafkaMsgTimestamp.class);
   
    public static KafkaResult getTimestamp(String bootstrapServer, String topic, long startTimestamp, long stopTimestamp) {
        long max_timestamp = stopTimestamp;
        long max_records = 5L;
        Properties props = new Properties();
        props.setProperty("bootstrap.servers", bootstrapServer);
        props.setProperty("group.id", "test");
        props.setProperty("enable.auto.commit", "true");
        props.setProperty("auto.commit.interval.ms", "1000");
        props.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);

        // Get all the partitions of the topic
        int partition_num = consumer.partitionsFor(topic).size();
        HashMap<TopicPartition, Long> search_map = new HashMap<>();
        ArrayList<TopicPartition> tp = new ArrayList<>();
        for (int i=0;i<partition_num;i++) {
            search_map.put(new TopicPartition(topic, i), stopTimestamp);
            tp.add(new TopicPartition(topic, i));
        }
        // Check if message exist with timestamp greater than search timestamp
        Boolean flag = true;
        ArrayList<TopicPartition> selected_tp = new ArrayList<>();
        //LOG.info("Start to check the timestamp {}", stopTimestamp);
        Map<TopicPartition, OffsetAndTimestamp> results = consumer.offsetsForTimes(search_map);
        for (Map.Entry<TopicPartition, OffsetAndTimestamp> entry : results.entrySet()) {
            OffsetAndTimestamp value = entry.getValue();
            if (value==null) {   //there is at least one partition don't have timestamp greater or equal to the stopTime
                flag = false;
                break;
            }
        }
        // Get the latest timestamp of all partitions if the above check result is false
        // Note the timestamp is the earliest of all the partitions. 
        if (!flag) {
            max_timestamp = 0L;
            consumer.assign(tp);
            Map<TopicPartition, Long> endoffsets = consumer.endOffsets(tp);
            for (Map.Entry<TopicPartition, Long> entry : endoffsets.entrySet()) {
                Long temp_timestamp = 0L;
                int record_count = 0;
                TopicPartition t = entry.getKey();
                long offset = entry.getValue();
                if (offset < 1) {
                    LOG.warn("Can not get max_timestamp as partition has no record!");
                    continue;
                }
                consumer.assign(Arrays.asList(t));
                consumer.seek(t, offset>max_records?offset-5:0);
            
                Iterator<ConsumerRecord<String, String>> records = consumer.poll(Duration.ofSeconds(2)).iterator();
                while (records.hasNext()) {
                    record_count++;
                    ConsumerRecord<String, String> record = records.next();
                    LOG.info("Topic: {}, Record Timestamp: {}, recordcount: {}", t, record.timestamp(), record_count);
                    if (temp_timestamp == 0L || record.timestamp() > temp_timestamp) {
                        temp_timestamp = record.timestamp();
                    }
                }
                //LOG.info("Record count: {}", record_count);
                if (temp_timestamp > 0L && temp_timestamp > startTimestamp) {
                    if (max_timestamp == 0L || max_timestamp > temp_timestamp) {
                        max_timestamp = temp_timestamp;
                    }
                    selected_tp.add(t);
                    LOG.info("Temp_timestamp {}", temp_timestamp);
                    LOG.info("Selected topic partition {}", t);
                    LOG.info("Partition offset {}", consumer.position(t));
                    //consumer.seek(t, -1L);
                }
            }
        } else {
            selected_tp = tp;
        }
        consumer.close();
        LOG.info("Max Timestamp: {}", max_timestamp);
        return new KafkaResult(max_timestamp, selected_tp);
    }
}

By calling the above code, we can get the partition to be selected and the corresponding timestamp. Using these two pieces of information, we can convert the Kafka data within the specified time range into bounded data. The following is the code for Beam to create a Pipeline, process the data, and then write it to Bigquery:

KafkaResult checkResult = CheckKafkaMsgTimestamp.getTimestamp(options.getBootstrapServer(), options.getInputTopic(), start_read_time, stop_read_time);
stop_read_time = checkResult.max_timestamp;
ArrayList<TopicPartition> selected_tp = checkResult.selected_tp;

PCollection<String> input = pipeline
    .apply("Read messages from Kafka",
        KafkaIO.<String, String>read()
            .withBootstrapServers(options.getBootstrapServer())
            .withKeyDeserializer(StringDeserializer.class)
            .withValueDeserializer(StringDeserializer.class)
            .withConsumerConfigUpdates(ImmutableMap.of("group.id", "telematics_statistic.app", "enable.auto.commit", true))
            .withStartReadTime(Instant.ofEpochMilli(start_read_time))
            .withStopReadTime(Instant.ofEpochMilli(stop_read_time))
            .withTopicPartitions(selected_tp)
            .withoutMetadata())
    .apply("Get message contents", Values.<String>create());

PCollectionTuple msgTuple = input
    .apply("Filter message", ParDo.of(new DoFn<String, TelematicsStatisticsMsg>() {
        @ProcessElement
        public void processElement(@Element String element, MultiOutputReceiver out) {
            TelematicsStatisticsMsg msg = GSON.fromJson(element, TelematicsStatisticsMsg.class);
            if (msg.timestamp==0 || msg.vin==null) {
                out.get(otherMsgTag).output(element);
            } else {
                if (msg.timestamp<start_process_time || msg.timestamp>=stop_process_time) {
                    out.get(otherMsgTag).output(element);
                } else {
                    out.get(statisticsMsgTag).output(msg);
                }
            }
        }
    })
    .withOutputTags(statisticsMsgTag, TupleTagList.of(otherMsgTag))); 

// Get the filter out msg
PCollection<TelematicsStatisticsMsg> statisticsMsg = msgTuple.get(statisticsMsgTag);
// Save the raw records to Bigquery
statisticsMsg
    .apply("Convert raw records to BigQuery TableRow", MapElements.into(TypeDescriptor.of(TableRow.class))
        .via(TelematicsStatisticsMsg -> new TableRow()
            .set("timestamp", Instant.ofEpochMilli(TelematicsStatisticsMsg.timestamp).toString())
            .set("vin", TelematicsStatisticsMsg.vin)
            .set("service", TelematicsStatisticsMsg.service)
            .set("type", TelematicsStatisticsMsg.messageType)))
    .apply("Save raw records to BigQuery", BigQueryIO.writeTableRows()
        .to(options.getStatisticsOutputTable())
        .withSchema(new TableSchema().setFields(Arrays.asList(
            new TableFieldSchema().setName("timestamp").setType("TIMESTAMP"),
            new TableFieldSchema().setName("vin").setType("STRING"),
            new TableFieldSchema().setName("service").setType("STRING"),
            new TableFieldSchema().setName("type").setType("STRING"))))
        .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
        .withWriteDisposition(WriteDisposition.WRITE_APPEND));

PipelineResult result = pipeline.run();
try {
    result.getState();
    result.waitUntilFinish();
} catch (UnsupportedOperationException e) {
    // do nothing
} catch (Exception e) {
    e.printStackTrace();
}

It should be noted that after each processing task is completed, we need to record the current stopReadTime and use this timestamp as startReadTime the next time the task is run. This can avoid the problem of missing data in some cases. We can record this timestamp in the GCS bucket. Skip this part of the code here.

Submit Dataflow task

Then we can call Google's Cloud Build function to package the code into Flex Template

First run mvn clean package in the Java project to package the jar file

Then set the following environment variables on the command line:

export TEMPLATE_PATH="gs://[your project ID]/dataflow/templates/telematics-pipeline.json" 
export TEMPLATE_IMAGE="gcr.io/[your project ID]/telematics-pipeline:latest" 
export REGION="us-west1"

Then run the gcloud build command to build the image:

gcloud dataflow flex-template build $TEMPLATE_PATH --image-gcr-path "$TEMPLATE_IMAGE" --sdk-language "JAVA" --flex-template-base-image "gcr.io/dataflow-templates-base/java17-template-launcher-base:20230308_RC00" --jar "target/telematics-pipeline-1.0-SNAPSHOT.jar" --env FLEX_TEMPLATE_JAVA_MAIN_CLASS="com.example.TelematicsBatch"

Finally, you can call the command to submit the task for execution:

gcloud dataflow flex-template run "analytics-pipeline-`date +%Y%m%d-%H%M%S`" --template-file-gcs-location "$TEMPLATE_PATH" --region "us-west1" --parameters ^~^bootstrapServer="kafka-1:9094,kafka-2:9094"~statisticsOutputTable="youprojectid:dataset.tablename"~serviceAccount="[email protected]"~region="us-west1"~usePublicIps=false~runner=DataflowRunner~subnetwork="XXXX"~tempLocation=gs://bucketname/temp/~startTime=1693530000000~stopTime=1697216400000~processStartTime=1693530000000~processStopTime=1697216400000

If we need tasks to be executed automatically and regularly, we can also import a Pipeline in dataflow and use the previously specified Template_path to import it. Then just set the regular cycle and start time of the task, which is very convenient.

Guess you like

Origin blog.csdn.net/gzroy/article/details/133827729