java's inconsistent behavior when handling a utf-8 with BOM string

ZhaoGang :

I open my Windows notepad, enter 18, and save the file as utf-8 encoding. I know that my file will have a BOM header, and my file is a utf-8 encoded file(with a BOM header).

Problem is that, when printing that string by below code:

//str is that string read from the file using StandardCharsets.UTF_8 encoding
System.out.println(str);

In windows I got:

?18

But in linux I got:

18

So why the behavior of java is different? How to understand it?

Joop Eggen :

A BOM is a zero-width space, so invisible in principle.

However Window has no UTF-8 encoding but uses one of the many single byte encodings. The conversion from String to the output will turn the BOM, missing in the charset, into a question mark.

Still Notepad will recognize the BOM and display UTF-8 text.

Linux nowadays generally uses UTF-8, so has no problems, also in the console.


Further explanation

On Windows System.out uses the console, and that console for instance uses as charset/encoding for instance Cp-850, a single byte charset of some 256 characters. Missing might very well be ĉ or the BOM char. If a java String contains these chars, they can not be encoded to one of the 256 available chars. Hence they will be converted to a ?.

Using a CharsetEncoder:

String s = ...
CharsetEncoder encoder = Charset.defaultCharset().newEncoder();
if (!encoder.canEncode(s)) {
    System.out.println("A problem");
}

Windows generally also runs on a single byte encoding, like Cp-1252. Again 256 chars. However editors may deal with several encodings, and if the font can represent the character (Unicode code point), then everything works.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=144545&siteId=1