Google Data Fusion builds data ETL tasks

Google Cloud Platform provides a Data Fusion product, which is a graphical editing tool based on the open source CDAP, which can easily complete data processing tasks without writing code. Suppose we now want to build an ETL task, consume some data from Kafka, and store the data in Bigquery after processing.

First we need to prepare some test data to send to Kafka. Here I set up a Kafka pod in the GKE environment, and then sent some simple JSON format messages to the test topic.

Create Data Fusion Instance

Open Data Fusion in the console page of GCP and select Create an instance. In the configuration page, I did not choose the latest version 6.9.2 because I found that this version has problems parsing the JSON format. I chose 6.8.3 . Then in the Advanced Option, I selected Enable Private IP because my Kafka does not expose the public IP to the outside world, so Data Fusion Instance can only use Private IP to communicate with Kafka. In the Associating Network, select the VPC in which this Private IP needs to be allocated. Then click Create to create an Instance.

VPC Network Peering

The creation of the Instance needs to wait for a while. After it is completed, click on the name of the Instance. We can see the relevant information and copy the Tenant project ID of the Instance. Then go to the VPC Network and establish a VPC network peering. Because Data Fusion needs to schedule tasks to run on the Dataproc cluster, which runs on a separate network. If we want to communicate with our Kafka, we need to establish a peer with the VPC network where Kafka is located. Select create peering connection, enter the VPC network where Kafka is located in Your VPC network, select in another project in peered network, enter the Tenant project ID you just copied, check Export custom routes in Exchange IPv4 custom routes, and then Click create to create.

Firewall rule settings

Because of the firewall rules in my VPC network, the default-allow-internal rule was deleted for security reasons. But the VMs in the Dataproc cluster need to be able to communicate with each other, so we need to add a rule. In the Firewall rule of the VPC network, add a rule in which the Direction selects ingress. We set both Sources and Targets to Tag. The name of the tag is Dataproc. This tag will be set in the compute profile of Data Fusion later.

Set up ETL Pipeline

In the Instances list of Data Fusion, select View instance to open the Instance we just created, then click Wrangler, select Add connection, then select Kafka Connection, enter the address in Kafka Brokers, such as 10.0.0.100:9094, and then click Test connection. If the previous network settings are completed, you should be able to connect correctly. After that, Wrangler will open the Kafka connection, and then you can see all the topics in the Kafka. Click on the topic we want to test, and you can see the existing data in it. Then we can click the small arrow of Message in the upper left corner and select Parse as JSON in the drop-down menu. In this way, we can directly parse the fields in JSON as The corresponding fields are there. The parsed corresponding fields are listed in the interface on the right. Here I need to modify the names of the fields so that the field names correspond to the field names in the Bigquery table I will create later.

After there are no problems with data parsing, you can directly click create pipeline in the upper right corner to create a task.

In the newly opened Pipeline editing window, we can see that there are currently two steps, one is Kafka connection and the other is Wrangler. We need to add one more step to save the data parsed by Wrangler into Bigquery. In the Sink on the left menu, click Bigquery, and then drag the arrow of the Wrangler box to connect to the newly added Bigquery. Click Bigquery's properties, select Use connection, and then select the bigquery data set and data table that you have created before. Mainly, the field names output by Wrangler need to match the field names of the data table.

After the Pipeline is set up, we can choose Deploy to deploy.

SetupComputeProfile

Click System admin in the upper right corner of Data Fusion, and then in System compute profiles in Configuration, we can create a new profile.

In Profile we can set the configuration of the machine of the Dataproc cluster to be deployed. In the subnet of General setting, I need to set the correct subnet, because there are multiple subnets in my VPC network, and the policies of different subnets are different. I need to set up the subnet where kafka is located. In the network tag of Cluster metadata, enter the tag Dataproc that we set when configuring the firewall rules. This allows our cluster to apply that rule.

After setting the new profile as default, we can run the Pipeline. Just click Run on the interface. Opening Log, we can see the entire Pipeline operation. After the entire Pipeline is successfully run, we can open the corresponding data table of Bigquery and see that the data can be successfully consumed from Kafka and written to the Bigquery table.

Guess you like

Origin blog.csdn.net/gzroy/article/details/132858770