Understand the application of coding in Python

Before fully understanding the origins of character encoding and Python, it is necessary to clarify some basic concepts, although some concepts we touch and even use it every day, but do not necessarily understand it. For example: byte, character, character set, character code, character encoding.

byte

Byte is an abstract unit of computer measurement. The binary data composed of 8 0s and 1s is called 1 byte (1Byte=8bits). A byte is the basic unit of data storage in a computer .

All data in a computer, whether it is stored on a disk file or transmitted over the network (text, pictures, video, audio files), is composed of bytes

character

Character is also an abstract concept. A character is an information unit. It is a general term for various words and symbols. For example, an English letter is a character, a Chinese character is a character, and a punctuation mark is also a character.

character set

A character set is a collection of characters within a certain range. Different character sets specify different numbers of characters. For example, the ASCII character set has a total of 128 characters, including English letters, Arabic numerals, punctuation marks and control characters. . The GB2312 character set defines 7445 characters, including most Chinese characters.

character code

Character code (Code Point) refers to the numerical number of each character in the character set. For example, the ASCII character set uses 128 consecutive numbers from 0-127 to represent 128 characters. For example, the character code number of "A" is 65. (Actually, it should be binary data such as 01000001. In order to facilitate people's memory, the Ascii code table records the binary numbers corresponding to characters through decimal numbers)

Character Encoding

Character encoding (Character Encoding) is a specific implementation scheme for mapping character codes in character sets to byte streams. Common character encodings include ASCII encoding, UTF-8 encoding, and GBK encoding. In a sense, the character set has a corresponding relationship with the character encoding, for example, the ASCII character set corresponds to the ASCII encoding. ASCII character encoding specifies that all characters are encoded using the lower 7 bits of a single byte. For example, the number of "A" is 65, which is 0×41 in single-byte representation, so when writing to the storage device, it is b'01000001'.

encode decode

The process of encoding is to convert characters into a stream of bytes, and the process of decoding is to parse the stream of bytes into characters.

The evolution of computer coding

1.ASCII code

We know that inside a computer, all information is ultimately represented as a binary string. Each binary bit (bit) has two states of 0 and 1, so eight binary bits can be combined into 256 states, which is called a byte (byte). That is to say, a byte can be used to represent a total of 256 different states, each state corresponds to a symbol, that is, 256 symbols, from 0000000 to 11111111.

The computer was invented in the United States. In the English-speaking world, the commonly used characters are very limited, 26 letters (upper and lower case), 10 numbers, punctuation marks, and control characters. These characters are more than enough to represent in a computer with one byte of storage space.

The American National Standards Institute (ANSI) has developed a set of character codes, which uniformly stipulates the relationship between English characters and binary bits. This is called ASCII (American Standard Code for Information Interchange) and is still in use today.

The ASCII code specifies a total of 128 characters of encoding. For example, the space "SPACE" is 32 (binary 00100000), and the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed out) only occupy the last 7 bits of a byte, and the first 1 bit is uniformly specified as 0.

2. Non-ASCII encoding

English is enough to encode with 128 symbols, but to represent other languages, 128 symbols is not enough. For example, in French, there is a phonetic symbol above the letter, which cannot be represented in ASCII. As a result, some European countries have decided to program new symbols with the highest bits that are idle in bytes. For example, é in French is encoded as 130 (binary 10000010). In this way, the coding system used by these European countries can represent up to 256 symbols.

However, here comes a new problem. Different countries have different letters, so even though they all use a 256-symbol encoding method, they represent different letters. For example, 130 represents é in French encoding, but the letter Gimel (ג) in Hebrew encoding, and another symbol in Russian encoding. But in any case, in all these encoding methods, the symbols represented by 0-127 are the same, and the only difference is this section of 128-255.

As for the characters of Asian countries, there are more symbols used, and there are as many as 100,000 Chinese characters. One byte can only represent 256 kinds of symbols, which is definitely not enough, and multiple bytes must be used to express one symbol . For example, the common encoding method for Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so theoretically it can represent up to 256x256=65536 symbols.

3.Unicode

As mentioned in the previous section, there are multiple encodings in the world, and the same binary number can be interpreted as different symbols. Therefore, if you want to open a text file, you must know its encoding method, otherwise it will be garbled if you use the wrong encoding method to interpret it. Why do emails often appear garbled? It is because the encoding method used by the sender and the recipient is different.

It is conceivable that if there is an encoding that incorporates all the symbols in the world. Each symbol is given a unique character code, so the problem of garbled characters will disappear. This is Unicode, as its name implies, an encoding of all symbols.

Unicode is also a character encoding method. The scientific name of Unicode is "Universal Multiple-Octet Coded Character Set", abbreviated as UCS. UCS can be seen as an abbreviation for "Unicode Character Set".

UCS specifies how to use multiple bytes to represent various characters. How to transmit these encodings is specified by the UTF (UCS Transformation Format) specification. Common UTF specifications include UTF-8, UTF-7, and UTF-16.

Unicode comes in two formats: UCS-2 and UCS-4. UCS-2 is encoded with two bytes , a total of 16 bits, so theoretically it can represent up to 65536 characters, but it is far from showing 65536 numbers to represent all the characters in the world, because Chinese characters alone are close to 100,000, so the Unicode4.0 specification defines a set of additional character encodings, UCS-4 uses 4 bytes (actually only 31 bits are used, and the highest bit must be 0). In theory, it can completely cover symbols used in all languages.

For the specific symbol correspondence table, you can query unicode.org , or the special Chinese character correspondence table .

Unicode is equivalent to a translator between humans and computers, translating the character set that humans can understand into character codes (byte streams) that computers can understand

4. Limitations of Unicode

It should be noted that Unicode is only a symbol set, it only specifies the binary code of the symbol, but does not specify how the binary code should be stored.

For example, the unicode of the Chinese character "strict" is the hexadecimal number 4E25, which is converted into a binary number with 15 bits (100111000100101), which means that the representation of this symbol requires at least 2 bytes. For other larger symbols, it may take 3 bytes or 4 bytes, or even more.

The first question is, how can we distinguish between unicode and ascii? How does the computer know that three bytes represent a symbol, rather than three symbols each?

The second problem is that when a Unicode character is transmitted on the network or finally stored, it does not necessarily require two bytes for each character. We already know that English letters (the character set contained in ASCII encoding) only use one Byte representation is enough, unicode stipulates that the character code corresponding to each symbol is represented by at least two bytes. Then there must be at least one byte before each English letter is 0, which is a great waste for storage, and the size of the text file will be two or three times larger, which is unacceptable.

The result of these two problems is: 1) There are multiple storage methods of unicode, that is to say, there are many different binary formats, which can be used to represent the unicode character set. 2) unicode could not be popularized for a long time until the advent of the internet.

5.UTF-8

When UNICODE came, it also came with the rise of computer networks. How UNICODE is transmitted on the network is also a problem that must be considered, so many UTF (UCS Transfer Format) standards for transmission appeared. One-bit data is transmitted, and UTF16 is 16 bits at a time, but for the reliability of transmission, there is no direct correspondence from UNICODE to UTF, but some algorithms and rules are required to convert.

The popularity of the Internet strongly demands the emergence of a unified coding method. UTF-8 is the most widely used implementation of unicode on the Internet. Other implementations include UTF-16 and UTF-32, but they are rarely used on the Internet. To repeat, the relationship here is that UTF-8 is one of the ways Unicode is implemented.

One of the biggest features of UTF-8 is that it is a variable-length encoding method. It can use 1 to 4 bytes to represent a symbol, and the byte length varies according to different symbols.

The encoding rules of UTF-8 are very simple, there are only two:

1) For a single-byte symbol, the first bit of the byte is set to 0, and the last 7 bits are the unicode code of the symbol. So for English letters, UTF-8 encoding and ASCII code are the same.

2) For the symbol of n bytes (n>1), the first n bits of the first byte are set to 1, the n+1th bit is set to 0, and the first two bits of the following bytes are set to 10. The remaining unmentioned binary bits are all the unicode codes of this symbol. The extra bits are filled with 0.

The following table summarizes the encoding rules, the letter x indicates the bits of the available encoding.

Unicode symbol range
(hex)
UTF-8 encoding method
(binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Next, take the Chinese character "Yan" as an example to demonstrate how to implement UTF-8 encoding.

It is known that "strict" unicode is 4E25 (100111000100101). According to the above table, it can be found that 4E25 is in the range of the third line (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes , that is, the format is "1110xxxx 10xxxxxx 10xxxxxx". Then, starting from the last binary bit of "strict", fill in the x in the format from the back to the front, and add 0 to the extra bits. In this way, the UTF-8 encoding of "strict" is "11100100 10111000 10100101", and the conversion to hexadecimal is E4B8A5.

Python code to demonstrate the example

6. Conversion between Unicode and UTF-8

Through the example in the previous section, you can see that the Unicode code of "strict" is 4E25, and the UTF-8 code is E4B8A5, and the two are different. The conversion between them can be realized by program.

On the Windows platform, the easiest way to convert is to use the built-in Notepad applet Notepad.exe. After opening the file, click the "Save As" command in the "File" menu, a dialog box will pop up, and there is a "Encoding" drop-down bar at the bottom.

There are four options: ANSI, Unicode, Unicode big endian and UTF-8.

1) ANSI is the default encoding. For English files, it is ASCII encoding, and for Simplified Chinese files, it is GB2312 encoding (only for Windows Simplified Chinese version, if it is Traditional Chinese version, Big5 code will be used).

2) Unicode encoding refers to the UCS-2 encoding method, that is, the Unicode code that directly stores characters in two bytes. The little endian format used for this option.

3) Unicode big endian encoding corresponds to the previous option. I'll explain what little endian and big endian mean in the next section.

4) UTF-8 encoding, which is the encoding method mentioned in the previous section.

After selecting the "encoding method", click the "Save" button, and the encoding method of the file will be converted immediately.

7. Little endian and Big endian

As mentioned in the previous section, Unicode codes can be stored directly in UCS-2 format. Taking the Chinese character "Yan" as an example, the Unicode code is 4E25, which needs to be stored in two bytes, one byte is 4E and the other byte is 25. When storing, 4E is in the front and 25 is in the back, which is the Big endian method; 25 is in the front and 4E is in the back, which is the Little endian method.

These two odd names come from the British author Swift's "Gulliver's Travels". In the book, there is a civil war in Lilliput, which is caused by people arguing over whether to eat eggs from the Big-Endian or the Little-Endian. For this matter, six wars broke out before and after, one emperor lost his life, and another emperor lost his throne.

Therefore, the first byte first is "Big endian", and the second byte first is "Little endian".

So naturally, a question arises: how does the computer know which way a certain file is encoded?

As defined in the Unicode specification, a character representing the encoding sequence is added to the top of each file. The name of this character is "ZERO WIDTH NO-BREAK SPACE", which is represented by FEFF. That's exactly two bytes, and FF is 1 greater than FE.

If the first two bytes of a text file are FE FF, it means that the file adopts the large-end mode; if the first two bytes are FF FE, it means that the file adopts the small-end mode.

8. Examples

Below, give an example.

Open the "Notepad" program Notepad.exe, create a new text file, the content is a "strict" word, and save it in ANSI, Unicode, Unicode big endian and UTF-8 encoding.

Then, use the "hexadecimal function" in the text editing software UltraEdit to observe the internal encoding of the file.

1) ANSI: The encoding of the file is two bytes "D1 CF", which is the "strict" GB2312 encoding, which also implies that GB2312 is stored in a large-end manner.

2) Unicode: The encoding is four bytes "FF FE 25 4E", where "FF FE" indicates that it is stored in a small-end format, and the real encoding is 4E25.

3) Unicode big endian: The encoding is four bytes "FE FF 4E 25", where "FE FF" indicates that it is stored in big endian mode.

4) UTF-8: The encoding is six bytes "EF BB BF E4 B8 A5", the first three bytes "EF BB BF" indicate that this is UTF-8 encoding, and the last three "E4B8A5" are "strict" For specific encoding, its storage order is consistent with the encoding order.

Reference source:
https://www.cnblogs.com/gavin-num1/p/5170247.html
https://foofish.net/python-unicode-error.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325154878&siteId=291194637