Python + request + unittest learning (1)-Reasons and solutions for garbled errors (UTF-8 BOM problem) in reading text

phenomenon

When reading text, there will often be a series of errors.
Example 1: The county at the beginning, in fact, the text at the beginning is h, http is shown as the county at ttp
Example 2: The seam at the beginning, in fact, the text at the beginning is p, the public is shown as the seam in the ulic,
as long as the first letter of the text is a city This kind of error will be encountered during the use of Python, Java, PHP, etc. This type of error has nothing to do with the language. The cause of the error is UTF-8 BOM.
the reason

BOM is Byte Order Mark, which is the Unicode signature of UTF-8 documents, that is, the three bytes of EF BB BF. When the file encoding is selected as UTF-8, the system will automatically add the three EF BB BF in the file header Bytes, and when UTF-8 NO BOM is selected, these three bytes will be removed automatically.
BOM is optional and can be used to detect whether a byte stream is UTF-8 encoded. Microsoft does this test, but some software does not do this test and treats it as a normal character.
Microsoft added three bytes of EF BB BF in front of its own UTF-8 format text file. Programs such as notepad on Windows are based on these three bytes to determine whether a text file is ASCII or UTF-8. However, this is just a secret sign made by Microsoft. There is no such mark on UTF-8 text files on other platforms.
Solution

1. Try to use notepad ++, sublime, editplus and other text editors that do not directly add BOM
2. Use a binary editor such as ultraedit to remove the BOM
3. Use the editor in 1 to reopen the document and save it again in UTF-8 without BOM format coding.
4. Set the encoding to ASCII, of course this is also a lot of trouble in Chinese
5. Remove BOM with Python

 

import codecs

data = open("Test.txt").read()
if data[:3] == codecs.BOM_UTF8:
     data = data[3:]
print data.decode("utf-8")

  

 


Original link: https://blog.csdn.net/mighty13/java/article/details/78077867

Guess you like

Origin www.cnblogs.com/zhaocbbb/p/12676366.html