python3 in Unicode encoding and decoding of the bytes (reprint)

Python reptiles play today, download a Web page, and then put all the content is written in a txt file, an error occurs;

TypeError: the Write () argument the MUST BE str, not bytes

AttributeError: 'UrlError' Object attribute has NO 'code'

UnicodeEncodeError: ' gbk 'codec can not encode character' \ xa0 'inposition 5747: illegal multibyte sequence

is a look at the coding problem, do not understand, degree of your mother speaks little above this respect, the feeling was not clear that he studied for a night, work out a little doorway.

Begin at the beginning, due to the different national languages, initially to be represented in the computer, there is a wide variety of encoding (eg Chinese in gb2312). But this appeared compatibility problems, so there is Unicode, also known as Unicode, python3 type string str is in Unicode encoding format, so we see multiple languages and characters in the string Python3 The garbage does not appear.

Is an abstract character encoding (Unicode) represented converted to binary form (bytes) in a particular manner, i.e. in python3 encode. Decoding binary data is represented by a particular manner in a certain way is transformed Unicode, i.e. decode.

The figure is the core coding:

a character code:

the Python to the bites of data types with 'b' prefix single quotes double quotes indicate viable.

Below with respect to the character code of the codec flowchart well explained above:

S = 'Hello'

print (type (s)) # output: <class' STR '>
S = s.encode (' UTF-. 8 ')
Print (S) # output: b' \ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ XBD '
print (type (s)) # output: <class' bytes'>
S = s.decode (' UTF-. 8 ')
Print (S) # output: Hello
print (type (s)) # output The results: <class 'str'>

say one more thing, if you were to decode the type of characters str error, empathy, types of bytes will encode error.

Second, the file encoding

in python 3 characters are stored as Unicode is , of course, mentioned here refers to storage is stored in the computer memory of them , if it is stored in the hard disk, Python 3 is based on the character bytes stored, that is, If you want to say that the character is written to disk, you must encode the character. To further explain the above paragraph, if you want to write to a file str, if written in 'w' mode, the contents of the write request must be str type; if written in 'wb' form is required to write the content must be bytes type. Focus error at the beginning of the article appeared, because the write mode and write the contents of the data type does not match the result.

s1 = 'Hello'
# if by 'w' way of writing must be written before encoding, otherwise it will error
with open ( 'F: \\ 1.txt ', 'w', encoding = 'utf -8 ') as f1:


# This time it must be a way to write 'wb', and must not add encoding parameters
with Open ( 'F: \\ 2.txt', 'wb') AS F2:
    f2.write (s2)

Some people will ask, I opened inside the system with a text editor 2.txt files are written in the form of bytes, and discovered that the display is 'hello' rather than 'b' \ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd '', because when a text document open 2.txt, it will be decode, before you see it.

Third, the encoded page

page encoding and file encoding method is similar to the following urlopen downloaded pages read () and is decoded by decoding ( 'utf-8') , it must be written to the file 'w' manner. If only read () instead of encoding ( 'utf-8') encoding, must be to 'wb' way of writing:

When writing to 'w' mode:

Response = url_open ( 'HTTP: //blog.csdn. NET / gs_zhaoyang / Article This article was / Details / 13,768,925 ', = timeout. 5)
# here in UTF-8 decoding mode, the decoded data is stored in unicode in html
html = response.read (). decode ( ' UTF- . 8 ')
Print (type (HTML)) output #: <class' STR'>
# write mode must be added at this time encoding, to encoding
# manner i.e. UTF-8 encoding of binary data can be written
with open ( 'F: \ DownloadAppData \ html.txt '
    f.write (html)

to 'wb' way of writing:

Response = url_open ( 'http://blog.csdn.net/gs_zhaoyang/article/details/13768925', = timeout. 5)
HTML response.read = () # here does not need to be decoded, the downloaded
print (type (html)) # output: <class 'bytes'>
with Open ( 'F.: \ DownloadAppData \ html.txt', "WB") AS F:
    f.write (html)

If you want to Python3, the urlopen down the page to search for character, will almost certainly be carried out decode, for example lxml.etree must be decode.
----------------
Disclaimer: This article is the original article CSDN bloggers "Austrian Chen _", and follow CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source and link this statement.
Original link: https: //blog.csdn.net/chb4715/article/details/79104299

Guess you like

Origin www.cnblogs.com/sidianok/p/11769019.html