I'm reading a potentially massive CSV file from Google Cloud Storage using a http client. Once i've got the CSV file, I need to upload it to another bucket, and load the data into a BigQuery table. Unfortunately, the file that i'm downloading is encoded in UTF-16, and BigQuery only supports UTF-8. I need a way to convert the data from UTF-16 to UTF-8. I know that I could simply read the data from the http response input stream as UTF-16, then write it to a new input stream as UTF-8 like so:
byte[] data = IOUtils.toByteArray(response.getEntity().getContent());
String csv = new String(data, StandardCharsets.UTF_8);
ByteArrayInputStream inputStream = new ByteArrayInputStream(csv.getBytes(StandardCharsets.UTF_8));
However, given that the CSV file has no maximum size and has the potential to be really big, i'd like to avoid reading it into memory if possible. I need the end product of this process to be an InputStream so as to not break the contract of the interface.
I've thought about using a BufferedReader to read the input stream one line at a time and convert the encoding, but i'm not really sure if that's any more efficient once it's been converted to a new input stream.
Is there any way to convert from UTF-16 contents in an input stream to UTF-8 contents in a memory efficient way?
Since you already use commons.io
library. This might be just what you're looking for:
InputStreamReader utf16Reader = new InputStreamReader(is, StandardCharsets.UTF_16);
ReaderInputStream utf8IS = new ReaderInputStream(utf16Reader, StandardCharsets.UTF_8);
Which double wraps is
into utf16
-decoding reader and then into utf8
encoding byte-stream