Strings and character encodings in Python

Hello, here is Token_w's blog, welcome to come.
Today's article explains the string and character encoding in Python. It includes basic theoretical knowledge and practical application. I hope it will be helpful to you
. If it is helpful to you, I hope to get your likes and favorites. grateful

I. Introduction

Character encoding in Python is an old-fashioned topic, and colleagues have written many articles in this regard. Some people echo what others say, and some write deeply. Recently, I saw this issue was discussed again in a teaching video of a well-known training institution, and the explanation was still not satisfactory, so I wanted to write this text. On the one hand, sort out the relevant knowledge, on the other hand, I hope to give some help to others.

The default encoding of Python2 is ASCII, which cannot recognize Chinese characters and needs to be explicitly specified; the default encoding of Python3 is Unicode, which can recognize Chinese characters.

I believe that you have seen the explanation of "Chinese processing in Python" similar to the above in many articles, and I believe that you really understand it when you first saw such an explanation. But after a long time, if you repeatedly encounter related problems, you will feel that your understanding is not so clear. If we understand the role of the default encoding mentioned above, we will understand the meaning of that sentence more clearly.

2. Related concepts

2.1 Characters and bytes

A character is not equivalent to a byte. Characters are symbols that humans can recognize, and these symbols need to be represented by bytes that computers can recognize to save them in computing storage. There are often multiple representation methods for a character, and different representation methods use different numbers of bytes. The different representation methods mentioned here refer to character encoding. For example, the letters AZ can be represented by ASCII code (occupies one byte), can also be represented by UNICODE (occupies two bytes), and can also be represented by UTF-8 ( occupies one byte). The role of character encoding is to convert human-recognizable characters into machine-recognizable bytecodes, and vice versa.

UNICDOE is the real character string, and character codes such as ASCII, UTF-8, and GBK represent byte strings . Regarding this point, we can often see such descriptions in Python's official documents "Unicode string", "translating a Unicode string into a sequence of bytes"

We write code in files, and characters are stored in files in bytes, so it is understandable that when we define a string in a file, it is treated as a byte string. However, what we need is a string, not a byte string. An excellent programming language should strictly distinguish the relationship between the two and provide ingenious and perfect support. The JAVA language is so good that I never considered these issues that should not be handled by programmers before learning about Python and PHP. Unfortunately, many programming languages ​​​​try to confuse "string" and "byte string", and they use byte strings as strings. PHP and Python2 belong to this programming language. The operation that best illustrates this problem is to take the length of a string containing Chinese characters:

  • Take the length of the string, the result should be the number of all strings, no matter Chinese or English
  • The length of the byte string corresponding to the string is related to the character encoding used in the encoding (encode) process (for example: UTF-8 encoding, a Chinese character needs to be represented by 3 bytes; GBK encoding, a Chinese character It needs 2 bytes to represent)
    Note: The default character encoding of Windows cmd terminal is GBK, so the Chinese characters entered in cmd need to be represented by two bytes
# Python2
a = 'Hello,中国'  # 字节串,长度为字节个数 = len('Hello,')+len('中国') = 6+2*2 = 10
b = u'Hello,中国'  # 字符串,长度为字符个数 = len('Hello,')+len('中国') = 6+2 = 8
c = unicode(a, 'gbk')  # 其实b的定义方式是c定义方式的简写,都是将一个GBK编码的字节串解码(decode)为一个Uniocde字符串

print(type(a), len(a))
# (<type 'str'>, 10)
print(type(b), len(b))
# (<type 'unicode'>, 8)
print(type(c), len(c))
# (<type 'unicode'>, 8)

The support for strings in Python 3 has been greatly changed, and the details will be introduced below.

2.2 Encoding and decoding

First do some popular science: UNICODE character encoding is also a mapping between characters and numbers, but the numbers here are called code points, which are actually hexadecimal numbers.

The official Python documentation describes the relationship between Unicode strings, byte strings, and encodings:

A Unicode string is a sequence of code points (code point), and the code point ranges from 0 to 0x10FFFF (corresponding to 1114111 in decimal). This sequence of code points needs to be represented as a set of bytes (values ​​between 0 and 255) in storage (including memory and physical disk), and the rules for converting Unicode strings to sequences of bytes are called encodings.

The encoding mentioned here does not refer to character encoding, but refers to the encoding process and the mapping rules between code points and bytes of Unicode characters used in this process . This mapping doesn't have to be a simple one-to-one mapping, so the encoding process doesn't have to handle every possible Unicode character, for example:

The rules for converting a Unicode string to an ASCII encoding are simple - for each code point:

  • If the code point value < 128, each byte has the same value as the code point

  • If the code point value >= 128, the Unicode string cannot be represented in this encoding (in this case, Python will raise a UnicodeEncodeError exception). Converting a Unicode string
    to UTF-8 encoding uses the following rules:

  • If the code point value is < 128, it is represented by the corresponding byte value (same as Unicode to ASCII byte)

  • If the code point value is >= 128, it is converted to a sequence of 2 bytes, 3 bytes or 4 bytes, each byte in the sequence is between 128 and 255.
    Brief summary:

  • Encoding (encode) : the process and rules of converting a Unicode string (code point in) into a byte string corresponding to a specific character encoding

  • Decoding (decode) : The process and rules of converting a byte string of a specific character encoding into a corresponding Unicode string (code point in it) It
    can be seen that whether it is encoding or decoding, an important factor is required, which is a specific character encoding . Because a character is encoded with different character encodings, the byte value and the number of bytes are different in most cases, and vice versa.

3. The default encoding in Python

3.1 Execution process of Python source code file

We all know that the files on the disk are stored in binary format, and the text files are all stored in the form of bytes with a specific encoding. The character encoding of the program source code file is specified by the editor. For example, when we use Pycharm to write a Python program, we will specify the project encoding and file encoding as UTF-8, then the Python code will be converted to UTF when it is saved to the disk. -8 encodes the corresponding bytes (encode process) and writes them to disk. When executing the code in the Python code file, after the Python interpreter reads the byte string in the Python code file, it needs to convert it into a UNICODE string (decode process) before performing subsequent operations.

As explained above, this conversion process (decode, decoding) requires us to specify the character encoding used by the bytes saved in the file, so that we can know that these bytes can find their corresponding code points in the UNICODE Universal Code and Unicode what is it. Everyone is familiar with the way of specifying character encoding here, as follows:

# -*- coding:utf-8 -*-

insert image description here

3.2 Default encoding

So, if we don't specify the character encoding at the beginning of the code file, which character encoding will the Python interpreter use to convert the bytes read from the code file into UNICODE code points? Just like when we configure some software, there are many default options. We need to set the default character encoding inside the Python interpreter to solve this problem. This is the "default encoding" mentioned at the beginning of the article. Therefore, the Python Chinese character problem that everyone said can be summed up in one sentence: When the byte cannot be converted through the default character encoding, a decoding error (UnicodeEncodeError) will occur.

The default encoding used by the Python2 and Python3 interpreters is different. We can get the default encoding through sys.getdefaultencoding():

# Python2
import sys
print(sys.getdefaultencoding() )
# 'ascii'

# Python3
import sys
print(sys.getdefaultencoding() )
# 'utf-8'

Therefore, for Python2, when the Python interpreter reads the bytecode of Chinese characters and tries to decode the operation, it will first check whether the header of the current code file indicates whether the character encoding corresponding to the bytecode stored in the current code file is What. If not specified, the default character encoding "ASCII" will be used for decoding, resulting in decoding failure, resulting in the following error:

SyntaxError: Non-ASCII character '\xc4' in file xxx.py on line 11, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

For Python3, the execution process is the same, but the Python3 interpreter uses "UTF-8" as the default encoding, but this does not mean that it is fully compatible with Chinese problems. For example, when we develop on Windows, the Python project and code files use the default GBK encoding, which means that the Python code files are converted into bytecodes in GBK format and saved to disk. When the interpreter of Python3 executes the code file, when it tries to decode with UTF-8, it will also fail to decode, resulting in the following error:

SyntaxError: Non-UTF-8 code starting with '\xc4' in file xxx.py on line 11, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

3.3 Best Practices

  • After creating a project, first confirm whether the character encoding of the project has been set to UTF-8
  • In order to be compatible with Python2 and Python3, declare the character encoding in the code header: - - coding:utf-8 - -

4. Support for strings in Python2 and Python3

In fact, the improvement of string support in Python3 is not only to change the default encoding, but to re-implement the string, and it has realized the built-in support for UNICODE. From this point of view, Python is as good as JAVA. . Let's take a look at the difference between the support for strings in Python2 and Python3:

(1) Python2

Support for strings in Python2 is provided by the following three classes

class basestring(object)
    class str(basestring)
    class unicode(basestring)

Executing help(str) and help(bytes) will find that the result is the definition of the str class, which also shows that str in Python2 is a byte string, and the subsequent unicode object corresponds to the real string.

#!/usr/bin/env python
# -*- coding:utf-8 -*-

a = '你好'
b = u'你好'

print(type(a), len(a))
print(type(b), len(b))

Output result:

(<type 'str'>, 6)
(<type 'unicode'>, 2)

(2) Python3

The support for strings in Python3 has been simplified at the level of the implementation class, the unicode class has been removed, and a bytes class has been added. On the surface, it can be considered that str and unicode in Python3 have been combined into one.

class bytes(object)
class str(object)

In fact, Python3 has realized the previous mistakes and began to clearly distinguish between strings and bytes. Therefore, str in Python3 is already a real string, and bytes are represented by a separate bytes class. In other words, Python3 defines strings by default, which realizes built-in support for UNICODE and reduces the burden on programmers for string processing.

#!/usr/bin/env python
# -*- coding:utf-8 -*-

a = '你好'
b = u'你好'
c = '你好'.encode('gbk')

print(type(a), len(a))
print(type(b), len(b))
print(type(c), len(c))

Output result:

<class 'str'> 2
<class 'str'> 2
<class 'bytes'> 4

Five. Character encoding conversion

As mentioned above, UNICODE strings can be converted to bytes of any character encoding, as shown in the figure:
insert image description here
Then it is easy for everyone to think of a question, that is, can bytes of different character encodings be converted to each other through Unicode? The answer is yes.

The character encoding conversion process of strings in Python2 is:
byte string –> decode ('original character encoding') –> Unicode string –> encode ('new character encoding') –> byte string

#!/usr/bin/env python
# -*- coding:utf-8 -*-


utf_8_a = '我爱中国'
gbk_a = utf_8_a.decode('utf-8').encode('gbk')
print(gbk_a.decode('gbk'))

Output result:

我爱中国

The string defined in Python3 is unicode by default, so it does not need to be decoded first, and can be directly encoded into a new character encoding:
string –> encode('new character encoding') –> byte string

#!/usr/bin/env python
# -*- coding:utf-8 -*-


utf_8_a = '我爱中国'
gbk_a = utf_8_a.encode('gbk')
print(gbk_a.decode('gbk'))

Output result:

I love China
The last thing to explain is that Unicode is not a Youdao dictionary, nor is it a Google translator. It cannot translate a Chinese into an English. The correct character encoding conversion process only changes the byte representation of the same character, but the symbol of the character itself should not change, so not all conversions between character encodings are meaningful. How to understand this sentence? For example, after the GBK-encoded "China" is converted into UTF-8 character encoding, it is only represented by 4 bytes into 6 bytes, but its character representation should still be "China" instead of "Hello" or "China".

I spent a lot of space to introduce concepts and theories in the front, and focus on practice later, I hope it will be helpful to you.

Appendix 6. Character Encoding

1. ASCII code

The ASCII code is an early encoding specification developed by the United States, which can only represent 128 characters, including English characters, Arabic numerals, Western characters and 32 control characters. Simply put, it is the following table:
insert image description here

2. Extended ASCII code (Extended ASCII)

Simply put, the emergence of the extended ASCII code is because ASCII is not enough, so the ASCII table continues to expand to 256 symbols. But because of the extended ASCII, different countries have different standards, which prompted the birth of Unicode encoding. The extended ASCII code table is as follows:
insert image description here

3. Unicode

To be precise, Unicode is not an encoding format, but a character set. This character set contains all current symbols in the world. In addition, in the original, some characters could be represented by one byte, that is, 8 bits. In Unicode, the length of all characters is unified to 16 bits, so the characters are of fixed length. Unicode looks like this:

\u4f60\u597d\u4e2d\u56fd\uff01\u0068\u0065\u006c\u006c\u006f\uff0c\u0031\u0032\u0033

The Unicode above means "Hello China! hello, 123".

Regarding Unicode, all characters can be found on this website: https://unicode-table.com/en/
insert image description here
insert image description here

4. GB2312

When Chinese people get computers, they need to encode Chinese characters. On the basis of the ASCII code table, the meaning of characters less than 127 is the same as the original; and two bytes greater than 127 are connected together to represent Chinese characters. The previous byte is 87 from 0xA1 (161) to 0xF7 (247). A byte is called the high byte, and the latter byte is 94 bytes from 0xA1 (161) to 0xFE (254), which is called the low byte. The two can be combined to form about 8000 combinations, which are used to represent 6763 Simplified Chinese characters, mathematical symbols, Roman letters, Japanese characters, etc. The recoded numbers, punctuation, and letters are two-byte long codes, which are called "full-width" characters; and the original characters below 127 in the ASCII code table are called "half-width" characters. Simply put, GB2312 is an extension of simplified Chinese characters based on ASCII.

GB2312 code table: http://www.fileformat.info/info/charset/GB2312/list.htm

5. GBK

In simple terms, GBK is a further extension of GB2312 (K is the initial consonant of the word "expand" in Chinese pinyin kuo zhan (extension), which contains 21886 Chinese characters and symbols, which are fully compatible with GB2312.

6. GB18030

GB18030 contains 70244 Chinese characters and characters, which is more comprehensive and compatible with GB 2312-1980 and GBK. GB18030 supports Chinese characters of ethnic minorities, and also includes traditional Chinese characters and Japanese and Korean Chinese characters. Its encoding is single, double, four-byte variable-length encoding.

7. UTF(UCS Transfer Format)

UTF is the most widely used implementation of Unicode on the Internet. The most commonly used one is UTF-8, which means that data is transmitted 8 bits at a time. In addition, there is UTF-16.

UTF-8 looks like this, "Hello China! hello, 123":

Hello China! hello, 123

8. Brief summary

  • The Chinese people have expanded and transformed the Chinese ASCII code to produce the GB2312 code, which can represent more than 6,000 commonly used Chinese characters.
  • There are too many Chinese characters, including traditional characters and various characters, so the GBK encoding was produced, which included the encoding in GB2312 and expanded a lot at the same time.
  • China is a multi-ethnic country, each nation has its own independent language system, in order to represent those characters, continue to expand the GBK code to GB18030 code.
  • Every country, like China, codes its own language, so various codes appear. If you don't install the corresponding code, you can't explain what the corresponding code wants to express.
  • Finally, an organization called ISO couldn't stand it anymore. Together, they created a code called UNICODE, which is so large that it can accommodate any character and logo in the world. So as long as the computer has a UNICODE encoding system, no matter what kind of text is in the world, you only need to save the file in UNICODE encoding and it can be interpreted by other computers normally.
  • In the network transmission of UNICODE, two standards UTF-8 and UTF-16 appeared, each transmission of 8 bits and 16 bits respectively. So some people will have questions, since UTF-8 can save so many characters and symbols, why are there so many people using GBK and other encodings in China? Because UTF-8 and other encodings are relatively large in size and take up a lot of computer space, if most of the users are Chinese, GBK and other encodings can also be used.

Guess you like

Origin blog.csdn.net/weixin_61587867/article/details/132296654