XML garbled problem analysis

The problem of XML garbled characters and understanding of encoding has been explained very clearly in this blog. To summarize the logic of reading XML and selecting encoding methods:

  • If the file has a BOM , the file encoding is defined, and this encoding is used;
  • If there is no BOM , check the encoding declared in the file header encoding . If it is declared as , GB2312 encoding is used;<?xml version="1.0"
    encoding="GB2312"?>
  • If neither of the above two are present, UTF-8 encoding is used by default.

What needs to be added are:

  • How does Linux view the file encoding: file -i config.xml, the result is, for example config.xml: application/xml; charset=utf-8, that the encoding of this file is UTF-8 .
  • Why ISO-8859-1 encoded files can be read correctly by GBK or GB2312 encoding? The reason is that GBK and GB2312 are specially used to represent Chinese characters, which are double-byte encodings, while the English alphabet encoding is consistent with ISO- 8859-1 (compatible with ISO-8859-1 encoding). Among them, GBK encoding can be used to represent both traditional and simplified characters, while GB2312 can only represent simplified characters, and GBK is compatible with GB2312 encoding.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324828924&siteId=291194637