The problem of XML garbled characters and understanding of encoding has been explained very clearly in this blog. To summarize the logic of reading XML and selecting encoding methods:
- If the file has a BOM , the file encoding is defined, and this encoding is used;
- If there is no BOM , check the encoding declared in the file header encoding . If it is declared as , GB2312 encoding is used;
<?xml version="1.0"
encoding="GB2312"?> - If neither of the above two are present, UTF-8 encoding is used by default.
What needs to be added are:
- How does Linux view the file encoding:
file -i config.xml
, the result is, for exampleconfig.xml: application/xml; charset=utf-8
, that the encoding of this file is UTF-8 . - Why ISO-8859-1 encoded files can be read correctly by GBK or GB2312 encoding? The reason is that GBK and GB2312 are specially used to represent Chinese characters, which are double-byte encodings, while the English alphabet encoding is consistent with ISO- 8859-1 (compatible with ISO-8859-1 encoding). Among them, GBK encoding can be used to represent both traditional and simplified characters, while GB2312 can only represent simplified characters, and GBK is compatible with GB2312 encoding.