I open my Windows notepad, enter 18
, and save the file as utf-8 encoding. I know that my file will have a BOM header, and my file is a utf-8 encoded file(with a BOM header).
Problem is that, when printing that string by below code:
//str is that string read from the file using StandardCharsets.UTF_8 encoding
System.out.println(str);
In windows I got:
?18
But in linux I got:
18
So why the behavior of java is different? How to understand it?
A BOM is a zero-width space, so invisible in principle.
However Window has no UTF-8 encoding but uses one of the many single byte encodings. The conversion from String to the output will turn the BOM, missing in the charset, into a question mark.
Still Notepad will recognize the BOM and display UTF-8 text.
Linux nowadays generally uses UTF-8, so has no problems, also in the console.
Further explanation
On Windows System.out
uses the console, and that console for instance uses as charset/encoding for instance Cp-850, a single byte charset of some 256 characters. Missing might very well be ĉ
or the BOM char. If a java String contains these chars, they can not be encoded to one of the 256 available chars. Hence they will be converted to a ?
.
Using a CharsetEncoder:
String s = ...
CharsetEncoder encoder = Charset.defaultCharset().newEncoder();
if (!encoder.canEncode(s)) {
System.out.println("A problem");
}
Windows generally also runs on a single byte encoding, like Cp-1252. Again 256 chars. However editors may deal with several encodings, and if the font can represent the character (Unicode code point), then everything works.