Is there a memory efficient way to convert input stream encoding

Ben Green :

I'm reading a potentially massive CSV file from Google Cloud Storage using a http client. Once i've got the CSV file, I need to upload it to another bucket, and load the data into a BigQuery table. Unfortunately, the file that i'm downloading is encoded in UTF-16, and BigQuery only supports UTF-8. I need a way to convert the data from UTF-16 to UTF-8. I know that I could simply read the data from the http response input stream as UTF-16, then write it to a new input stream as UTF-8 like so:

byte[] data = IOUtils.toByteArray(response.getEntity().getContent());
String csv = new String(data, StandardCharsets.UTF_8);
ByteArrayInputStream inputStream = new ByteArrayInputStream(csv.getBytes(StandardCharsets.UTF_8));

However, given that the CSV file has no maximum size and has the potential to be really big, i'd like to avoid reading it into memory if possible. I need the end product of this process to be an InputStream so as to not break the contract of the interface.

I've thought about using a BufferedReader to read the input stream one line at a time and convert the encoding, but i'm not really sure if that's any more efficient once it's been converted to a new input stream.

Is there any way to convert from UTF-16 contents in an input stream to UTF-8 contents in a memory efficient way?

maklas :

Since you already use commons.io library. This might be just what you're looking for:

InputStreamReader utf16Reader = new InputStreamReader(is, StandardCharsets.UTF_16);
ReaderInputStream utf8IS = new ReaderInputStream(utf16Reader, StandardCharsets.UTF_8);

Which double wraps is into utf16-decoding reader and then into utf8 encoding byte-stream

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=165639&siteId=1