Copy data from Postgres DB to AWS S3 in one step

Nav_cfc :

I have a use-case to move selected data from Postgres to Amazon S3. This should happen in a single step. I am writing a java program to get this done.

I have figured out a way to copy the data in 2 steps. I use the CopyManager Library and the copyOut method to get the data into my local. After this, I'm moving the same file into S3 using Java.

postgres code to get the data into my local

CopyManager copyManager = new CopyManager((BaseConnection) con);
FileWriter fileWriter = new FileWriter("file.csv");
copyManager.copyOut("COPY (SELECT stmt) TO STDOUT WITH DELIMITER '\t' CSV HEADER", fileWriter);

AWS code to move from local to S3

AmazonS3 conn = new AmazonS3Client(credentials);
conn.setEndpoint("xxx.com");
conn.putObject(
            bucket1.getName(),
            "request.json",
            new File("file.csv")
    );

I expect it to happen in one go instead of writing to a file and then moving the file to S3.

Adam Bethke :

If you're open to doing it in python, here's an example that should work:

import boto
import gzip
import psycopg2
import tempfile

# database connection setup
connection = psycopg2.connect('postgresql://scott:tiger@localhost/mydatabase')
connection.autocommit = True
cursor = connection.cursor()

# aws connection setup
s3_connection = boto.connect_s3('<aws access key>', '<aws secret key>')
bucket = s3_connection.get_bucket('<bucket>')

with tempfile.NamedTemporaryFile() as t:
    with gzip.GzipFile(t.name, mode='wb') as g:
        cursor.copy_expert("COPY ({0}) TO STDOUT WITH CSV HEADER".format('<select_query>'), g)
    key = boto.s3.key.Key(bucket, '<s3_key>')
    key.set_contents_from_filename(g.name)

This process makes use of the tempfile module in python, which allows you to create a file that gets is used and then removed within the process. The context manager (with tempfile...) simplifies the management of the file writing process, so you don't have to delete it manually. Depending on how you set up the tempfile, you can make the file accessible to, or never visible to, system users. Basically, you're streaming the SELECT statement to the STDOUT, and then writing the STDOUT to a tempfile. You're still beholden to your database for the SELECT statement in terms of memory management, speed and access.

The benefit is that you don't need to keep the entire file in memory while trying to transfer it to S3; the drawbacks are that you need enough disk space to temporarily store the file, and that it's obviously slower because you're writing to disk as opposed to doing the entire thing in memory.

The other thing to note is that I kept in the step where python is compressing the file using gzip before uploading. I did that to save space on upload; this is especially useful if you're uploading a table with a lot of repetitive data.

As an aside: you should not use this as-is in an environment where you're open to SQL injection; there are better ways of generating the COPY command if that's a part of your use-case.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=109595&siteId=1