Java string "hello" has 12 bytes when getBytes("UTF-16")?

Hind Forsum :

I expected that, when a java character is stored as "UTF-16", each character uses 2 bytes, so "hello" should consume 10 bytes, but this code:

String h = "hello";
System.out.println(new String(h.getBytes("UTF-16"), "UTF-16").length());
System.out.println(new String(h.getBytes("UTF-8"), "UTF-8").getBytes("UTF-16").length);

Will print "5 12"

My question:

(1) I expected that the first println should get "10" as I mentioned. But why 5?

(2) For the second println, I am trying to getBytes for it first as "UTF-8" then as "UTF-16". I suppose it should also be 10. But actually it's 12.

I'm using MAC and my region is HongKong. Would you help to explain what's happening in the program, and how "5 12" actually came out?

Thanks a lot!

Stephen C :

(1) I expected that the first println should get "10" as I mentioned. But why 5?

You take a 5 character string, encode it as bytes using UTF-16 encoding.
Then you create a new string by decoding the bytes (correctly) from UTF-16, which gives you a new string consisting of your original 5 characters again.

(2) For the second println, I am trying to getBytes for it first as "UTF-8" then as "UTF-16". I suppose it should also be 10. But actually it's 12.

This part of the code:

    new String(h.getBytes("UTF-8"), "UTF-8")

is actually a no-op. It is just a rather expensive way to copy a string. You encode the string to bytes using UTF-8 as the encoding scheme, and then you create a new string by decoding the UTF-8 encoded bytes.

So effectively, you are doing this:

    "hello".getBytes("UTF-16").length

The reason for the extra 2 bytes is that UTF-16 encoding puts a BOM (byte order mark) as the first (2 byte) code unit.

For more information, read the Unicode FAQs on "UTF-8, UTF-16, UTF-32 & BOM".

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=112714&siteId=1