Micro-channel group share: in layman's language and character encoding Python

Python language by Guido van Rossum Daniel invention in 1989, it is one of the world's most popular computer programming language, as well as a "learning useful, can learn, can learn a long time with" the language of the computing ecosystem.

To this end, CSDN as the largest Chinese IT community, especially to the majority of fans set up Python Python classes to help people detours on the road to learn more efficiently. March 21, 20:00, we have invited well-known technical experts Liu Zhijun teacher Python sharing activities held in the class.

Liu Zhijun, 6 years of development experience, ZTE was the inauguration, liberal arts interactive. He specializes in Web technology infrastructure, strong of reptiles, data mining interests. Data analysis is currently engaged in a major pharmaceutical group. Micro-channel public number: Python Zen (vttalk)

The following is the share of last night:
Why talk about this subject?
It is said, each doing Python development have been confused by the character encoding problems too, the most common mistake is UnicodeEncodeError, UnicodeDecodeError, you seem to know how to solve, unfortunately, an error appeared in other places, the problem is always the same mistakes, str to encode or decode conversion method between unicode also especially difficult to remember, always confused, where exactly is the problem?

To clarify this issue, I decided to be analyzed in simple terms from the details of the composition and character encoding of python string.
Byte character
all the data stored in a computer, text characters, images, video, audio, software is composed of a sequence of bytes of a string 01, a byte is equal to 8 bits.

The character is a symbol, such as a character, a letter, a number, a punctuation can be called a character.

Bytes for easy storage and transport network, and the character display for easy reading. For example, the character "p" is stored in the hard disk is a series of binary data 01110000, one byte length.
Encoding and decoding
time we use a text editor to open, see a character, and ultimately stored on disk are kept up in a binary sequence of bytes. Then the character-to-byte conversion process is called encoding (encode), in turn, is called decoding (decode), it is both a reversible process. Coding for storing transmission, decoding is performed in order to facilitate reading the display.

For example, the character "p" to the hard disk after encoding process is a string of binary sequence 01110000 bytes, one byte length. Character "Zen" is likely "11100111 1,010,011,010,000,101" occupy the length of 3 bytes of storage, why is it possible? This bit later say.

Python coding why so boring? Of course, this can not blame the developers.

This is because Python2 use ASCII character encoding as the default encoding, and ASCII can not handle Chinese.

Why not UTf8 it? Because Python Guido father to write the first line of code is in the winter of 1989, in February 1991 officially released the first open source version, and Unicode was released in October 1991, that is to say the creation of the language Python when UTF8 yet born, this is one.

This is why, Python mess two kinds of the type string, unicode and str, so that developers are uploaded here somersault, which is the other.

python3 thoroughly to achieve a string reshaped, leaving only one type, which is something for later.
str and unicode
Python 2 string divided into two types, and are unicode str. is a sequence of binary byte sequence str nature, can be seen in the following sample code type str "Zen" print hexadecimal \ xec \ xf8, it is a binary sequence of bytes '1,110,110,011,111,000' corresponds.

>>> s = '禅'

>>> s

'\xec\xf8'

>>> type(s) <type 'str'>

Unicode type and u "Zen" symbol is a corresponding unicode u '\ u7985'

>>>u = u"禅"

>>>u

u'\u7985'

>>> type(u) <type 'unicode'>

We want to save the file unicode symbol or to a network need to be converted into coded by the type str, then encode python provides methods to convert from unicode str, and vice versa.
image description

encode


>>>u = u"禅"

>>>u u'\u7985'

>>> u.encode("utf­8") '\xe7\xa6\x85'

decode

>>>s = "禅"

>>>s.decode("utf­8") u'\u7985'

>>>

Beginners can not remember how many transitions between str and unicode with encode or decode, if you remember str essentially is actually a string of binary data, and unicode character (symbol), coding (encode) is to character (symbol) is converted to process binary data, unicode to str conversion method to use encode, decode method is to use in turn.

After the clear transformation between str and unicode, let's look at what will appear when the UnicodeEncodeError, UnicodeDecodeError error.
UnicodeEncodeError
UnicodeEncodeError occurs converted to a unicode string str when a sequence of bytes, look at an example, a string of unicode string to save the file.

# ­*­ coding:utf­8 ­*­

def main():

name = u'Python之禅'

f = open("output.txt", "w") f.write(name)

Error Log

UnicodeEncodeError: 'ascii' codec can not encode characters in position 67: ordinal not in range (128)
Why is there UnicodeEncodeError?

Because when you call the write method, Python will first determine what type string, if it is str, directly write to a file, no coding, because the type of string str itself is a sequence of bytes in a string of binary.

If the string is unicode type, then it will first call the encode method to convert unicode string into a binary form of type str, was saved to a file, and encode method will use the default python ascii code encoding, equivalent to:

>>> u"Python之禅".encode("ascii")

However, we know only the ASCII character set contains 128 Latin alphabet, not including Chinese characters, an error has occurred 'ascii' codec can not encode characters of. To properly encode, you must specify a character set that contains Chinese characters, such as: UTF8, GBK.

>>>u"Python之禅".encode("utf­8") 'Python\xe4\xb9\x8b\xe7\xa6\x85'

>>>u"Python之禅".encode("gbk") 'Python\xd6\xae\xec\xf8'

So should unicode string file written correctly, it should advance the string UTF8 GBK encoding or conversion.

def main():

name = u'Python之禅'

name = name.encode('utf­8')

with open("output.txt", "w") as f:

f.write(name)

Of course, the unicode strings correctly write the file more than one way, but the principle is the same, no introduction here, the strings into the database, the network is transmitted to the same principle.
A UnicodeDecodeError
a UnicodeDecodeError occurs when the byte sequence str type decoded into unicode string type.

>>>a = u"禅"

>>>a u'\u7985'

>>>b = a.encode("utf­8")

>>>b

'\xe7\xa6\x85'

>>> b.decode("gbk")

Traceback (most recent call last): File "<stdin>", line 1, in <module>

UnicodeDecodeError: 'gbk' codec can't decode byte 0x85 in position 2: incomplete multibyte sequence

When a sequence of bytes after the code generation UTF8 '\ xe7 \ xa6 \ x85' GBK then converted to a unicode string decoded, a UnicodeDecodeError occurs, because (for Chinese characters) GBK encoding only occupies two bytes, and UTF8 It occupies 3 bytes, when the conversion GBK, more than one byte, so it can not resolve. Key to avoiding UnicodeDecodeError coding type is consistent with the holding of encoding and decoding.

This also answers the article said at the beginning of the character "Zen", saved to a file may have accounted for three-byte, 2 bytes possible, to encode specific execution when the specified encoding format yes.

As another example of UnicodeDecodeError

 x = u"Python"

>>> y = "之禅"

>>> x + y

Traceback (most recent call last): File "<stdin>", line 1, in <module>

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

>>>

+ str and unicode string operation is performed, the type byte sequence str Python will implicitly converted (decoded) into the same x and unicode type, but default Python is to convert the encoded ascii, and in ASCII It does not include the Chinese, so the error.

>>> y.decode('ascii')

Traceback (most recent call last): File "<stdin>", line 1, in <module>

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

Is a proper way to be decoded by the y UTF8 or GBK.


>>>x = u"Python"

>>>y = "之禅"

>>>y = y.decode("utf­8")

>>>x + y u'Python\u4e4b\u7985'

Above it is based on the terms of Python2, on Python3 character encoding and it will open another topic, stay tuned.


[Save] class activities CSDN Python classes are hot in the organization, welcomed the Python fans signed up, click on >> == Python Reptile entry and Practice

Published 155 original articles · won praise 964 · views 80000 +

Guess you like

Origin blog.csdn.net/mengyidan/article/details/80128690