A nice article on coding for python 2

https://mp.weixin.qq.com/s/ImVH-XZk5RyjT8D7aMCmZw

Note: The experiments in this paper are mainly based on win7, Python2.7; and Linux, Python2.7. Unless otherwise specified, all commands are entered interactively in the terminal; if no platform is emphasized, the result is the window. The following are some default environment information (the importance of which will be introduced later)

windows

>>> import sys,locale
>>> sys.getdefaultencoding()
'ascii'
>>> locale.getdefaultlocale()
('zh_CN', 'cp936 ')
>>> sys.stdin.encoding
'cp936'
>>> sys.stdout.encoding
'cp936'
>>> sys.getfilesystemencoding()
'mbcs

' .python.org/2/library/codecs.html#standard-encodings can be viewed.

Linux

>>> import sys,locale
>>> sys.


('zh_CN', 'UTF-8')
>>> sys.stdin.encoding
'UTF-8'
>>> sys.stdout.encoding
'UTF-8'
>>> sys.getfilesystemencoding()
'UTF-8' Let's

start with character encoding

. First , let's talk about the terms gbk gb2312 unicode utf-8. These terms have nothing to do with language.

The computer world has only 0s and 1s, so any character (that is, the actual literal symbol) is also composed of a string of 01s. For the convenience of operation, computers are composed of 8 bits to form a byte, and the smallest unit of character expression is a byte, that is, a character occupies one or more bytes. Character encoding is the character set code, and encoding is the process of mapping the characters in the character set into a unique binary.

The computer originated in the United States and uses English letters (characters). All 26 letters are upper and lower case plus numbers 0 to 10, plus symbols and control characters. The total number is not much, just use one byte (8 bits). Can represent all characters, this is the ANSI "Ascii" encoding (American Standard Code for Information Interchange, American Standard Code for Information Interchange). For example, the ascii code of the lowercase letter 'a' is 01100001, which is 97 in decimal and 0x61 in hexadecimal. In computers, hexadecimal is generally used to describe character encoding.

However, when the computer was transmitted to China, ASCII encoding would not work. There were so many Chinese characters that one byte would definitely not be able to represent it, so there was GB 2312 (China National Standard Simplified Chinese Character Set). GB2312 uses two bytes to encode a character, of which the first byte (called the high byte) is from 0xA1 to 0xF7, and the latter byte (low byte) is from 0xA1 to 0xFE, GB2312 can represent Thousands of Chinese characters, and it is also compatible with asill.

However, it was later found that GB2312 was not enough, so it was expanded, resulting in GBK (ie, Chinese character internal code extension specification). GBK is the same as Gb2312. Two bytes represent one character, but the difference is that the requirements for low bytes are relaxed. , so the range that can be represented is expanded to more than 20,000. Later, GB13080 appeared in order to accommodate minority nationalities and the characters of other Chinese character countries. GB13080 is compatible with GBK and GB2312, and can accommodate more characters. Different from GBK and GB2312, GB18030 encodes characters in three ways: single-byte, double-byte and four-byte.

Therefore , as far as the Chinese characters we care about are In other words, the representation range of the three encoding methods is:

GB18030 》 GBK 》 GB2312

, that is, GBK is a superset of GB2312, and GB1803 is a superset of GBK. As will be seen later, a Chinese character can be represented by GBK, but not necessarily by GB2312.

Of course , there are more languages ​​and characters in the world, and each character has its own set of coding rules, so that once it is transnational, There will be garbled characters, and a globally unified solution is urgently needed. At this time, ISO (International Organization for Standardization) came out and invented the "Universal Multiple-Octet Coded Character Set", referred to as UCS, commonly known as "unicode". The goal is simple: do away with all regional coding schemes and create a new one that includes all cultures, all letters and symbols on Earth!

Unicode sets a unified and unique binary encoding for each character in each language to meet the requirements of cross-language and cross-platform text conversion and processing. unicode encoding must start with u.

However, unicode is just an encoding specification, a collection of all characters corresponding to binary, not a specific encoding rule. In other words, unicode is a representation form, not a storage form, which means it is useless to define how each character is stored in binary form. This is different from GBK. GBK is as follows, and the form of expression is the form of storage.

For example, the unicode encoding of the Chinese character "strict" is u4e25, and the corresponding binary is 1001110 00100101, but when it is transmitted over the network or stored in a file, it is impossible to know how to parse these binary, and it is easy to be mixed with other bytes. So how to store unicode, so UTF (UCS Transfer Format) appeared, which is a specific encoding rule, that is, the representation of UTF is the same as the storage format.

Therefore, it can be said that GBK and UTF-8 are on the same level, and unicode is on another level. unicode floats in the air. If it is to land, it needs to be converted to utf-8 or GBK. However, when converting to Utf-8, everyone can understand and understand how to use it, and when converting to GBK, only Chinese people can understand

UTF and there are different implementations, such as UTF-8, UTF-16, here UTF-8 is used as Example to explain (the following subsection quotes Ruan Yifeng's article). One of the biggest features of

unicode and utf-8 UTF-8 is that it is a variable-length encoding method.

It can use 1 to 4 bytes to represent a symbol, and the byte length varies according to different symbols. The encoding rules of UTF-8 are very simple, there are only two:

1) For a single-byte symbol, the first bit of the byte is set to 0, and the last 7 bits are the unicode code of the symbol. So for English letters, UTF-8 encoding and ASCII code are the same.

2) For the symbol of n bytes (n>1), the first n bits of the first byte are set to 1, the n+1th bit is set to 0, and the first two bits of the following bytes are set to 10. The remaining unmentioned binary bits are all the unicode codes of this symbol.

The following table summarizes the encoding rules, the letter x indicates the bits of the available encoding.

Unicode symbol range | UTF-8 encoding
(hexadecimal) | (binary)
----------------------+-------- -------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Take the Chinese character "Yan" as an example to demonstrate how to implement UTF-8 encoding.

It is known that "strict" unicode is 4E25 (100111000100101). According to the above table, it can be found that 4E25 is in the range of the third line (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes , that is, the format is "1110xxxx 10xxxxxx 10xxxxxx". Then, starting from the last binary bit of "strict", fill in the x in the format from the back to the front, and add 0 to the extra bits. In this way, the UTF-8 encoding of "strict" is "11100100 10111000 10100101", and the conversion to hexadecimal is E4B8A5.

When the codec meets Python2.x

, the Python language is used to verify the above theory. In this chapter, when unicode is mentioned, it generally refers to unicode type, that is, the type in Python; unicode encoding and unicode functions are also mentioned, please pay attention to the difference.

In addition, for encoding, there are two meanings. The first is the name, which refers to the binary representation of the character, such as unicode encoding, gbk encoding. The second is a verb, referring to the process of mapping from characters to binary. However, in the following text, encoding as a verb is narrowly understood as the process of converting from unicode type to str type, and decoding is the opposite process. It is also emphasized that the unicode type must be unicode encoding, and the str type may be gbk, ascii or utf-8 encoding.

The difference between unicode and str

In python2.7, there are two "string" types, str and unicode, which have the same base class basestring. str is a plain string, which should actually be called a byte string, because each byte is replaced by a unit length. And unicode is unicode string, this is the real string, one character (possibly multiple bytes) counts as one unit length.

In python2.7, the unicode type needs to be represented by adding u between texts.

>>> us = u'strict'
>>> print type(us), len(us)
<type 'unicode'> 1
>>> s = 'strict'
>>> print type(s), len(s)
<type 'str'> 2
>

As can be seen from the above, first, the types of us and s are different; second, the length of different types of the same Chinese character is also different. For an instance of unicode type, the length must be the number of characters. , and for instances of type str, the length is the number of bytes corresponding to the character. It is emphasized here that the length of s (s = 'strict') is different in different environments!

The difference between __str__

__repr__ will be explained later. These are two magic methods in python, which are easy to confuse newbies, because in many cases, the implementation of the two is the same, but the two functions are used in different places

_str__, mainly It is used for display. It is called when str(obj) or print obj. The return value must be a str object

__repr__, which is repr(obj), or called when the terminal directly prints obj.

>>> us = u'yan'
>>> us
u'\u4e25'
>>> print us
Yan

You can see that without print, what is returned is a result that better reflects the essence of the object, that is, us is a unicode object (the front u means, and the unicode encoding is used u), and the "strict" unicode encoding is indeed 4E25. The print call can be us.__str__, which is equivalent to print str(us), making the result more user-friendly. So how is unicode.__str__ converted to str? The answer will be revealed later on

unicode str utf-8 relationship

As mentioned earlier, unicode is just an encoding specification (just a set of mappings between characters and binary), while utf-8 is a specific encoding rule (not only a set of mappings between characters and binary, but also the mapped binary can be used for storage and transmission. ), that is, utf-8 is responsible for converting unicode into a binary string that can be stored and transmitted, that is, str type, and we call this conversion process encoding. And the process from str type to unicode type, we call it decoding.

Python uses decode() and encode() for decoding and encoding, with unicode type as the intermediate type. As shown in the following figure,

  decode encode
str ---------> unicode ---------> str

means that str type calls the decode method to convert to unicode type, and unicode type calls the encode method to convert to str type. for example

>>> us = u'yan'
>>> ss = us.encode('utf-8')
>>> ss
'\xe4\xb8\xa5'
>>> type(ss)
<type 'str' >
>>> ss.decode('utf-8') == us
True

From the above, we can see the functions of the encode and decode functions. It can also be seen that the utf8 encoding of 'strict' is E4B8A5.

That means we use unicode. Encode converts the unicode type to the str type. It is also mentioned above that unicode.__str__ also converts the unicode type to the str type. What is the difference between the two?

The difference between unicode.encode and unicode.__str__

First look at the documentation

str.encode([encoding[, errors]])
  Return an encoded version of the string. Default encoding is the current default string encoding.
  
object.__str__(self)
  Called by the str() built-in function and by the print statement to compute the “informal” string representation of an object.

Note: str.encode where str is basestring, which is the base class between str and unicode.

You can see encode The method has optional parameters: encoding and errors. In the above example, encoding is utf-8; and __str__ has no parameters. We can guess that for unicode types, the __str__ function must also use some kind of encoding to encode unicode.

First of all, I can't help but ask, what if the encode method does not take parameters:

>>> us.encode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u4e25' in position 0: ordinal not in range(128)

It is not difficult to see that the default is to use ascii code to encode unicode, why is ascii code, in fact The system default encoding (the return value of sys.getdefaultencoding). The ascii code obviously cannot represent Chinese characters, so an exception is thrown. When using utf-8 encoding, since utf can represent this Chinese character, no error is reported.

What happens if you print ss directly (the return value of us.encode('utf-8'))

>>> print ss
trickle

The result is a little strange, the result of us.__str__ (that is, print us directly) is different, then try What about encoding=gbk?

>>> print us.encode('gbk')

Yan

U got it! In fact, python will use the default encoding of the terminal (check with locale.getdefaultlocale(), windows is gbk) to encode unicode into str type.

In Linux (terminal encoding is utf-8), the result is as follows:

>>> us= u'yan'
>>> print us.encode('utf-8')
yan
>>> print us.encode('gbk'





Note the garbled characters above!

Conversion between unicode gbk

In the previous section, it was introduced that unicode can be encoded by utf-8 (encoding = utf-8) and converted into str represented by utf-8. In the previous section, it can also be seen that unicode can also be encoded by gbk Encoding (encoding=gbk), converted to str represented by gbk. It's a bit dizzy here, I'll leave it as the first question. Later, I explain

that the mutual conversion between unicode and utf8 can be calculated, but there is no calculation formula for the mutual conversion between unicode and gbk. A mapping table, there is a mapping relationship between the unicode representation of a certain Chinese character and the gbk representation

>> us = u'yan'
>>> us
u'\u4e25'
>>> us.encode('gbk')
'\xd1\ xcf'
>>> us.encode('gb2312')
'\xd1\xcf'
>>> us.encode('gb18030')
'\xd1\xcf'
>>> s = 'yan'
>>> s
'\ xd1\xcf'
>>>

It is not difficult to see from the above that Yan's unicdoe encoding is 4e25 and GBK encoding is d1cf, so us is d1cf through gbk encoding. It can also be seen that GB18030, GBK, and GB2312 are compatible.

Why does print us.encode('utf-8') print "trickle"

ss = us.encode('utf-8'), ss is a str type, it is a bit strange to print the result directly, a word "trickle", what binary composition of "trickle" of that str type?

>>> s = 'Juan'
>>> s
'\xe4\xb8' As

you can see, the "Juan" of str type, its binary is E4B8, which is different from the 'strict' utf8 encoding (E4B8A5) by A5, then it is because A5 does not display It comes out and the verification is as follows:

>>> print '--%s--' % ss
--trickle?-

So, it just happens to show "trickle", in fact ss has nothing to do with ""trickle""
Answer first Question: What is the str type?

In the previous section, I mentioned the utf-8 encoded str, and the gbk encoded str, which feels a bit confusing. We know that a Chinese character 'strict' can be stored in either gbk ('xd1xcf') or utf-8 ('xe4xb8xa5'), so when we type this Chinese character in the terminal, which one is it? what format? Depends on terminal default encoding.

On windows (default terminal encoding is gbk):

>>> s = 'strict'
>>> s
'\xd1\xcf'

On Linux (default terminal encoding is utf-8):

>>> a = 'strict'
>> > a
'\xe4\xb8\xa5'


The same Chinese character is also the str type in Python, and its binary is different under different encoding formats. Therefore, its length is also different. For str type, its length is the corresponding byte length.

It can also be seen that the byte length encoded by gbk is generally smaller than utf-8, which is one reason why gbk continues to exist.

Here, it should be emphasized that the binary form of unicode has nothing to do with the encoding format of the terminal! This is not difficult to understand.

The conversion of unicode function

str type to unicode type, out of the str.decode mentioned above, there is also a unicode function. The signatures of the two functions are:

unicode(object[, encoding[, errors]])

Return the Unicode string version of object using one of the following modes:

str.decode([encoding[, errors]])
Decodes the string using the codec registered for encoding. encoding defaults to the default string encoding. The

parameters of the two are the same, in fact, they are equivalent, and the default value of encoding is the same, both are the results of sys.getdefaultencoding(). for example:

>>> s = 'strict'
>>> newuse = unicode(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 0: ordinal not in range(128)

>>> newuse = unicode(s, 'utf-8')
Traceback (most recent call last):
File "<stdin >", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd1 in position 0: invalid continuation byte
>>> newuse = unicode(s, 'gbk')
>>> newuse
u'\u4e25 '

The first UnicodeDecodeError is because the default encoding of the system is asill; the second UnicodeDecodeError is because the encoding of s (an instance of str type) depends on the default encoding of the terminal (ie gbk under windows), in order to print it out , it is necessary to use gbk encoding to represent this str, so you can only query the mapping table between gbk and unicode to convert s to unicode type.

Why call sys.







It is not difficult to guess that setdefaultencoding and getdefaultencoding are paired. Why set the default encoding of the system to utf-8 is to solve the conversion problem from str to unicode.

As mentioned in the previous section, when using the unicode function to convert the str type to the unicode type, two factors must be considered: first, what encoding is str itself; second, if the encoding parameter is not passed in, the default is to use sys. getdefaultencoding. The encoding parameter must correspond to the encoding of str itself, otherwise it will be UnicodeDecodeError.

Programs that write python code know that we need to write on the first line of the py file:

# -*- coding: utf-8 -*-

The purpose of this sentence is to tell the editor that all str in the file use utf-8 encoding, and utf-8 format is also used when storing files.

The following code is then used in the file.

s='Chinese'
us=unicode(s)

When using unicode force conversion, it is not used to take parameters. In order to ensure that the encoding parameter must be consistent with the encoding of str itself, use setdefaultencoding to set the system default encoding to utf-8

garbled characters and UnicodeError

The following introduces several common garbled characters and exceptions UnicodeError. The reasons for most garbled characters or exceptions have been mentioned before. At the same time, for some garbled characters, we also try to give feasible solutions.

UnicodeError includes UnicodeDecodeError and UnicodeEncodeError. The former is decode, that is, an exception occurs when str is converted to unicode, and the latter is encode, that is, an exception occurs when unicode is converted to str.

For a str, the example of direct printing is the

example mentioned repeatedly above

>>> ss = us.encode('utf-8')
>>> print ss
trickle

If a str type comes from network or file reading, it is best to follow The encoding method of the opposite end first decodes into unicode, and then outputs (it will be automatically converted to str in the encoding format expected by the terminal when outputting)

Chinese characters that cannot be included in the encoding range.

Directly on the example

>>> newus = u'囍'
>> > newus
u'\u56cd'
>>> newus.encode('gbk')
'\x87\xd6'
>>> newus.encode('gb2312')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'gb2312' codec can'


It can be seen that the word '囍' can be encoded by gbk, but cannot be encoded by gb2312.

When str is converted to unicode

, an example has been given when talking about the unicode function above, and a UnicodeDecodeError exception will pop up.

The reason for this wrong comparison comes more from the default conversion of str to unicode, such as when a str is added to a unicode:

>>> a = 'strict'
>>> b = u'strict'
>>> c = a + b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 0: ordinal not in range(128)

unicode and str Adding, str will be converted to unicode, using the default unicode (strobj, encoding = sys.getdefaultencoding())

looks like a unicode-encoded string

In some cases, we print out a str type and see that the result is '\u4e25', or 'u4e25'. Is this string familiar? Yes, the unicode encoding of 'strict' is u'u4e25'. If you look closely, there is just an extra u (indicating a unicode type) in front of the quotation marks. So when we see a 'u4e25', how do we know what the corresponding Chinese character is? For the known str in this format, you can naturally add a u manually, and then output it in the terminal, but if it is a variable, it needs to be automatically converted to unicode. At this time, you can use unicode_escape in python-specific-encodings

>> > s = '\u4e25'
>>> s
'\\u4e25'
>>> us = s.decode('unicode_escape')
>>> us
u'\u4e25'

hexadecimal format string

Sometimes , also You will see a str like this, '\xd1\xcf', which looks very familiar, very similar to the gbk code 'xd1xcf' of the Chinese character "Yan", the difference is that the former has an extra '', so it cannot be interpreted as a hexadecimal too. The solution is string_escape in python-specific-encodings

>>> s='\\xd1\\xcf'
>>> s
'\\xd1\\xcf'
>>> print s
\xd1\xcf
>>>


>>> print news
Yan A question
for readers

Leave a question here:

u'yan' == 'yan'

return value is True or False? Of course, the context is deliberately omitted here, but it is clear that in different coding environments, the answer is different, and the reasons are all above!

Summary and suggestions

No matter how you explain it, the character encoding in python2.x is still a headache. Even if you understand it, you may forget it later. For this problem, many suggestions are as follows:

first: using python3, you don't have to worry about str and unicode; but this is difficult for developers to have the final say;

second: don't use Chinese, and all comments are in English; the ideal is very full , it is difficult in reality, but it leads to a lot of pinyin;

third: for Chinese strings, do not use str, but unicode; in reality, it is not easy to implement, and everyone is reluctant to write an extra u

Fourth: only in Encode unicode when transmitting or persisting. In the opposite process, decode.

Fifth: For the network interface, agree on the codec format. It is strongly recommended to use utf-8

. Sixth: Don't panic when you see UnicodeXXXError, if XXX is Encode, Then there must be a problem when converting unicode to str; if it is Decode, there must be a problem when converting str to unicode.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326432500&siteId=291194637