Use Google CDC to synchronize Cloud SQL data to Bigquery

In Google's cloud platform, I created a Postgresql instance of Cloud SQL and saved some business data on it. Now we need to regularly synchronize this data to the Bigquery data warehouse, so that we can perform subsequent analysis and processing of the data on Bigquery and generate data reports.

Google provides a Datastream service that synchronizes changes to the Cloudsql database, such as additions, deletions, updates and other operations, to the Bigquery data set through CDC (Capture data change). The following will describe how to set up Datastream to complete.

Because my CloudSQL instance does not expose the public IP, we need to set up VPC peering to connect Datastream's VPC to the VPC network of my GCP project. In addition, CloudSQL is in a separate Service network, and we also need to connect to CloudSQL through a reverse proxy.

Set up Datastream private connection

In Datastream's Private Connectivity, create a new connection profile. In the profile we need to set up the VPC network, which is the VPC network where our project is currently located. Then you need to assign an IP address segment to Datastream to create a subnet. This IP address segment cannot be an assigned IP address segment, and must have at least a /29 address.

Set firewall rules

In the firewall settings of the VPC network, add two new rules, corresponding to ingress and egress respectively. The target needs to enter the address segment we just assigned, and then open the TCP:5432 port.

Set up reverse proxy

Set up a VM in the VPC network, and then run the following code script to set up a reverse proxy

#! /bin/bash

export DB_ADDR=[IP]
export DB_PORT=[PORT]

export ETH_NAME=$(ip -o link show | awk -F': ' '{print $2}' | grep -v lo)

export LOCAL_IP_ADDR=$(ip -4 addr show $ETH_NAME | grep -Po 'inet \K[\d.]+')

echo 1 > /proc/sys/net/ipv4/ip_forward
iptables -t nat -A PREROUTING -p tcp -m tcp --dport $DB_PORT -j DNAT \
--to-destination $DB_ADDR:$DB_PORT
iptables -t nat -A POSTROUTING -j SNAT --to-source $LOCAL_IP_ADDR

Here, DB_ADDR and DB_PORT fill in the address and port of CloudSQL’s PG database.

PG configuration

Connect to the PG database and create a publication and replication slot

The following command will give the user the role of creating a replication

ALTER USER USER_NAME WITH REPLICATION;

Create a publication. Assume here that we want to copy the test table of the public schma. Then replace the following SCHEMA1 with public and TABLE1 with test.

CREATE PUBLICATION PUBLICATION_NAME FOR TABLE SCHEMA1.TABLE1, SCHEMA2.TABLE2;

Create a replication slot

SELECT PG_CREATE_LOGICAL_REPLICATION_SLOT('REPLICATION_SLOT_NAME', 'pgoutput');

SetupDatastream

Finally, you can set up a stream. The address to connect to the database here needs to enter the address and port of the reverse proxy VM we just set, as well as the name of the PG's publication and replication slot we configured before. Finally it can run successfully. We can test it, make corresponding changes in the PG data table, and then wait for a while in the Bigquery data table to see that the data can be synchronized.