Java String.getBytes(charset) and new String(bytes, charset) with two different character sets

ParkCheolu :

As far as I know, in String.getBytes(charset), the argument, charset means that the method returns bytes of a string encoded as the given charset.

In new String(bytes, charset), the second argument, charset means that the method decodes bytes as the given charset and returns the decoded result.

According to the above, and as my understanding, the charset arguments of the two different methods must be the same so that new String(bytes, charset) can return a proper string. (I guess here is what I'm missing.)

I have an incorrectly decoded string and I tested the following code with this:

String originalStr = "Å×½ºÆ®"; // 테스트 
String [] charSet = {"utf-8","euc-kr","ksc5601","iso-8859-1","x-windows-949"};

for (int i=0; i<charSet.length; i++) {
 for (int j=0; j<charSet.length; j++) {
  try {
   System.out.println("[" + charSet[i] +"," + charSet[j] +"] = " + new String(originalStr.getBytes(charSet[i]), charSet[j]));
  } catch (UnsupportedEncodingException e) {
   e.printStackTrace();
  }
 }
}

The output is:

[utf-8,utf-8] = Å×½ºÆ®
[utf-8,euc-kr] = ��쩍쨘�짰
[utf-8,ksc5601] = ��쩍쨘�짰
[utf-8,iso-8859-1] = Å×½ºÆ®
[utf-8,x-windows-949] = 횇횞쩍쨘횈짰
[euc-kr,utf-8] = ?����������
[euc-kr,euc-kr] = ?×½ºÆ®
[euc-kr,ksc5601] = ?×½ºÆ®
[euc-kr,iso-8859-1] = ?¡¿¨ö¨¬¨¡¢ç
[euc-kr,x-windows-949] = ?×½ºÆ®
[ksc5601,utf-8] = ?����������
[ksc5601,euc-kr] = ?×½ºÆ®
[ksc5601,ksc5601] = ?×½ºÆ®
[ksc5601,iso-8859-1] = ?¡¿¨ö¨¬¨¡¢ç
[ksc5601,x-windows-949] = ?×½ºÆ®
[iso-8859-1,utf-8] = �׽�Ʈ
[iso-8859-1,euc-kr] = 테스트
[iso-8859-1,ksc5601] = 테스트
[iso-8859-1,iso-8859-1] = Å×½ºÆ®
[iso-8859-1,x-windows-949] = 테스트
[x-windows-949,utf-8] = ?����������
[x-windows-949,euc-kr] = ?×½ºÆ®
[x-windows-949,ksc5601] = ?×½ºÆ®
[x-windows-949,iso-8859-1] = ?¡¿¨ö¨¬¨¡¢ç
[x-windows-949,x-windows-949] = ?×½ºÆ®

As you can see, I figure out the way of getting the original string:

[iso-8859-1,euc-kr] = 테스트  
[iso-8859-1,ksc5601] = 테스트  
[iso-8859-1,x-windows-949] = 테스트 

How can it be possible? How can the string be encoded and decoded properly as different character sets?

Holger :

According to the above, and as my understanding, the charset arguments of the two different methods must be the same so that new String(bytes, charset) can return a proper string.

That’s what you should aim at, to write correct code. But this does not imply that every wrong operation will always produce wrong results. A simple example would be a string consisting of ASCII letters only. A lot of encodings produce the same byte sequence for such a string, so a test using only such a string is not sufficient to spot encoding related errors.

As you can see, I figure out the way of getting the original string:

[iso-8859-1,euc-kr] = 테스트  
[iso-8859-1,ksc5601] = 테스트  
[iso-8859-1,x-windows-949] = 테스트 

How can it be possible? How can the string be encoded and decoded properly as different character sets?

Well, when I execute

System.out.println(Charset.forName("euc-kr") == Charset.forName("ksc5601"));

on my machine, it prints true. Or, if I execute

System.out.println(Charset.forName("euc-kr").aliases());

it prints

[ksc5601-1987, csEUCKR, ksc5601_1987, ksc5601, 5601, euc_kr, ksc_5601, ks_c_5601-1987, euckr]

So for euc-kr and ksc5601, the answer is simple. These are different names for the same character encoding.

For x-windows-949, I have to resort to Wikipedia:

Unified Hangul Code (UHC), or Extended Wansung, also known under Microsoft Windows as Code Page 949 (Windows-949, MS949 or ambiguously CP949), is the Microsoft Windows code page for the Korean language. It is an extension of Wansung Code (KS C 5601:1987, encoded as EUC-KR) to include all 11172 Hangul syllables present in Johab (KS C 5601:1992 annex 3).

So it is an extension of ksc5601 which will lead to the same result, as long as you’re not using any characters affacted by the extension (think of the ASCII example above).

Generally, this does not invalidate you premise. Correct results are only guaranteed when using the same encoding for both sides. It just means, testing code is much harder, as it requires sufficient test input data to spot errors. E.g. a common error in the Western world, is to confuse iso-latin-1 (ISO 8859-1) with Windows codepage 1252, which may not get spotted with simple text.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=101062&siteId=1