How do I encode supplementary unicode characters from a byte array in Java?

rm-green :

I'm using an InputStream to read bytes from a TCP server (written in C#) into a byte[], and encoding them into a string using new String(byteArray, "UTF-16LE"). This method encodes characters in the Basic Multilingual Plane just fine, but does not handle supplementary characters.

I understand that bytes in C# are unsigned whereas Java bytes are signed, and that a supplementary character can be composed of either one or two unicode values.

        ByteBuffer wrapped = ByteBuffer.wrap(dataBytes);
        wrapped.order(ByteOrder.LITTLE_ENDIAN);
        short noOfSites = wrapped.getShort();

        for(int i = 0; i < noOfSites; i++){
            short siteNo = wrapped.getShort();
            short textLength = wrapped.getShort();
            byte[] textBytes = new byte[textLength];
            wrapped.get(textBytes, 0, textLength);

            for(byte bite : textBytes){
                System.out.print(bite+" ");
            } //just to see what's in the byte array

            String siteText = new String(textBytes, "UTF_16LE");
            System.out.println(siteNo + ": " + siteText);
            siteList.add(new Site(siteNo, siteText));
            publishProgress(siteNo + " - " + siteText);
        }

In this instance, dataBytes is the byte array containing the bytes read from the server, noOfSites is the number of objects to be read from the server, siteNo is an ID, textLength is the number of bytes containing the name of the site, and textBytes is the array that holds these bytes.

When receiving the word "MÜNSTER" from the server, the bytes read into the buffer are: 77 0 -3 -1 78 0 83 0 84 0 69 0 82 0. However, the "Ü" character is unrecognised, which I suppose is down to the -3 -1 UTF-16 value that Java is trying (and failing) to encode. I understand that in C#, "Ü" is represented by DC-00, but I don't understand why this becomes -3 -1 in Java. Any help would be greatly appreciated.

jsbueno :

The "Û" character is not being encoded in your source - the sequence that is getting to your sink side "-3, -1", is 0xfffd - UTF 16 LE encoding for the replacement character.

Without seeing the server-side code it is hard to tell what is going on, but its bad. Utf-16 can handle chars like "Ü" without going out of its way. Actually, it is not even out of the first 256 unicode codepoints, much less outside the Base Multilingual Plane. (That is a character common enough in lots of western languages, and even latin character, how could it be out of a the plane designed to hold characters for all languages in the World?)

What is happening is that the code path from your text to the utf-16 meant for wire-transfer is, at some point, being explicitly instructed to set the replacement character for any char which is not ASCII only (legacy unicode code-points 0x20-0x7f, which include just unaccented latin characters).

To be clear, in other words: the data is being corrupted server-side, and all non-ASCII fitting characters there will likely be squashed to the "replacement character". No amount of fiddling on your client-side code can fix that.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=132198&siteId=1