Thoroughly understand the python Chinese garbled problem

foreword

Once upon a time, the problem of Chinese garbled characters in Python has troubled me for many years. Every time there is a Chinese garbled character, I have to search the Internet for the answer. Although the problem encountered at that time has been solved, I will be confused again when the garbled character appears next time. The reason is still known. Of course I don't know why. In order to avoid the problem of Chinese garbled characters, some friends don't even use Chinese in the code, and the comments and prompts are in English. I have done this before, but this is not to solve the problem, but to escape the problem. Today, we will solve Python completely Chinese garbled problem.

Basic knowledge

ASCII

A long time ago, there was a group of people who decided to use 8 transistors that could be switched on and off to combine into different states to represent everything in the world. They see that 8 switch states are good, so they call this a "byte". Later, they made some machines that could process these bytes. The machine started, and the bytes could be used to combine many states, and the states began to change. They saw that this was good, so they called the machine a "computer". Initially the computer will only be used in the United States. Eight-bit bytes can be combined into a total of 256 (2 to the 8th power) different states. They defined the 32 states whose numbers start from 0 for special purposes. Once the bytes agreed on by the terminal and the printer are passed, some agreed actions must be performed. When encountering 0x10, the terminal will wrap the line. When encountering 0x07, the terminal will beep to people. For example, when encountering 0x1b, the printer will print the words in reverse, or the terminal will display letters in color. They see that this is good, so they call these byte states below 0x20 "control codes". They also represented all spaces, punctuation marks, numbers, and uppercase and lowercase letters with consecutive byte states, until the number 127, so that the computer could use different bytes to store English text. Everyone feels good when they see this, so everyone calls this scheme the ANSI "Ascii" encoding (American Standard Code for Information Interchange). All the computers in the world at the time used the same ASCII scheme to store English text.

GB2312

Later, just like the building of the Tower of Babylon, computers all over the world began to be used, but many countries did not use English, and many of their letters were not in ASCII. In order to save their words in the computer, they decided to use 127 The space after the number represents these new letters and symbols, and many shapes such as horizontal lines, vertical lines, and crosses that need to be used when drawing tables are added, and the serial number has been numbered to the last state 255. The character set on pages 128 to 255 is called the "extended character set". Since then, there will be no new state for greedy human beings to use. The US imperialists may not have thought that people in third world countries also hope to use computers! When Chinese people got computers, there was no byte state that could be used to represent Chinese characters, and there were more than 6,000 commonly used Chinese characters that needed to be saved.

However, this cannot be difficult for the wise Chinese people. We rudely cancel the strange symbols after the number 127, and stipulate that the meaning of a character smaller than 127 is the same as the original one, but two characters larger than 127 are connected together. When , it means a Chinese character, the first byte (he calls it high byte) goes from 0xA1 to 0xF7, and the next byte (low byte) goes from 0xA1 to 0xFE, so we can combine more than 7000 Simplified Chinese characters. In these codes, we also compiled mathematical symbols, Roman and Greek letters, and Japanese pseudonyms. Even the numbers, punctuation, and letters that were originally in ASCII were all re-coded into two-byte long codes. , which is often referred to as "full-width" characters, and those below 127 are called "half-width" characters. The Chinese people saw that this was very good, so they called this Chinese character scheme "GB2312". GB2312 is a Chinese extension to ASCII.

GBK

But there are too many Chinese characters in China, and we soon found that there are many people whose names cannot be typed here, especially some national leaders who are very troublesome to others. So we have to continue to find out the unused code points of GB2312 and use them honestly. Later, it was still not enough, so I no longer required that the low byte must be the internal code after number 127. As long as the first byte is greater than 127, it is fixed to indicate that this is the beginning of a Chinese character, regardless of whether it is followed by an extended character set. content in. As a result, the expanded encoding scheme is called the GBK standard. GBK includes all the contents of GB2312, while adding nearly 20,000 new Chinese characters (including traditional Chinese characters) and symbols. Later, the ethnic minorities also had to use computers, so we expanded and added thousands of new ethnic minority characters, and GBK was expanded to GB18030.

Since then, the culture of the Chinese nation can be inherited in the computer age. Chinese programmers saw that this series of Chinese character encoding standards were good, so they called them "DBCS" (Double Byte Charecter Set). In the DBCS series of standards, the biggest feature is that two-byte long Chinese characters and one-byte long English characters coexist in the same encoding scheme. Therefore, in order to support Chinese processing, the programs they write must pay attention to the characters in the string. The value of each byte, if the value is greater than 127, then a character in a double-byte character set is considered to be present. At that time, all computer monks who had received blessings and knew programming had to recite the following mantra hundreds of times a day: "One Chinese character counts as two English characters! One Chinese character counts as two English characters..."

Because at that time, every country came up with its own coding standards like China, and as a result, no one understood each other's coding, and no one supported others' coding. Even the mainland and Taiwan were only 150 nautical miles apart, using Brother regions of the same language also adopted different DBCS encoding schemes. At that time, Chinese people had to install a "Chinese character system" to deal with the display and input of Chinese characters if they wanted to display Chinese characters on the computer. But the fortune-telling program written by the ignorant feudal people in Taiwan must be installed with another set of "Yi Tian Chinese Character System" that supports BIG5 encoding. If the wrong character system is installed, the display will be messed up! How to do this? Moreover, there are still those poor people in the forest of nations who cannot use computers for a while. What about their writing? It is really the proposition of the Babylonian tower of the computer!

UNICODE

At this time, the Archangel Gabriel appeared in time, and an international organization called ISO (International Organization for Standardization) decided to solve this problem. Their approach is simple: scrap all regional coding schemes and start over with a new code that includes all cultures, all letters and symbols on earth! They are going to call it "Universal Multiple-Octet Coded Character Set", or UCS for short, commonly known as "unicode". When unicode was first developed, the memory capacity of computers had grown enormously, and space was no longer an issue. So ISO directly stipulates that two bytes, that is, 16 bits, must be used to represent all characters uniformly. For those "half-width" characters in ASCII, unicode contains the original encoding unchanged, but its length is changed from the original 8. Bits are extended to 16 bits, while characters from other cultures and languages ​​are all re-encoded uniformly.

Since the "half-width" English symbol only needs to use the lower 8 bits, its upper 8 bits are always 0, so this atmospheric scheme will waste twice as much space when saving English text. At this time, programmers from the old society began to find a strange phenomenon: their strlen function was unreliable, and a Chinese character was no longer equivalent to two characters, but one! Yes, starting from unicode, whether it is half-width English letters or full-width Chinese characters, they are all unified "one character"! At the same time, they are also unified "two bytes", please note the difference between the terms "character" and "byte", "byte" is an 8-bit physical storage unit, and "character" is A culturally relevant symbol. In unicode, a character is two bytes. The era when one Chinese character counts as two English characters is almost over.

unicode is also not perfect, there are two problems here, one is, how can we distinguish between unicode and ASCII? How does the computer know that three bytes represent a symbol, rather than three symbols each? The second problem is that we already know that English letters are only represented by one byte. If unicode uniformly stipulates that each symbol is represented by three or four bytes, then there must be two letters before each English letter. To three bytes is 0, which is a huge waste of storage space, and the size of the text file will be two or three times larger, which is unacceptable.

UTF-8

unicode could not be promoted for a long time, until the emergence of the Internet, in order to solve the problem of how to transmit unicode on the network, so many UTF (UCS Transfer Format) standards for transmission appeared. As the name suggests, UTF-8 is every 8 One bit transmits data, and UTF-16 is 16 bits at a time. UTF-8 is the most widely used implementation of unicode on the Internet. This is an encoding designed for transmission and makes the encoding borderless, so that characters of all cultures in the world can be displayed. One of the biggest features of UTF-8 is that it is a variable-length encoding method. It can use 1 to 4 bytes to represent a symbol, and the byte length varies according to different symbols. When the character is in the range of ASCII code, it is represented by one byte, and the encoding of one byte of ASCII characters is reserved as Part of it, note that unicode occupies 2 bytes for a Chinese character, while UTF-8 occupies 3 bytes for a Chinese character). From unicode to uft-8 is not a direct correspondence, but requires some algorithms and rules to convert.

Seeing this, you are completely confused or suddenly realized. If you are completely confused, I suggest you watch it a few more times to learn new things by reviewing the past. If you suddenly realize it, we will continue to look down.

Chinese garbled example explanation

After introducing the basics, let's talk about how characters are stored in Python. Let's first look at an example of garbled characters. Create a new demo.py file, and the file storage format is utf-8as follows.

s = "中文"
print s

Running in cmd python demo.py, what, I just want to print 中文two words and it gives me an error, it is unreasonable!

CMD error

Quickly open the idle that comes with python and try it out, there is no problem at all, why is this?

python idle is correct

Go back and take a good look at the error reported by cmd. The Non-ASCII character '\xe4' in file demo.py on line 1, but no encoding declared;translation is that there is a non-ASCII character '\xe4' in the first line of the demo.py file, and no encoding is declared . From the above basic knowledge, ASCII encoding cannot represent Chinese characters and Chinese characters . , there are two Chinese characters in the first line of the demo.py file, and the storage format of the demo.py file is To Chinese Stored is .utf-8utf-8\xe4\xb8\xad\xe6\x96\x87

Hexadecimal storage

The hexadecimal view uses the HEX-Editor plug-in that comes with notepad++, and the function reprcan also display the original string, as follows.

# encoding:utf-8
import sys
print sys.getdefaultencoding()
s = "中文"
print repr(s)

repr

sys.getdefaultencoding()The default encoding of reading python is ASCII, and ASCII is not \xe4known , so an error will be reported Non-ASCII character '\xe4' in file demo.py on line 1, but no encoding declared;. At this time, you only need to add # encoding:utf-8to . Although it is a comment, python will know the next after seeing this sentence. It should be utf-8encoded, and the demo.py is also stored utf-8, so it is normal.

# encoding:utf-8
s = "中文"
print s

It is also possible to write encoding declaration comments # -*- coding: utf-8 -*-, as long as the regular expression ^[ \t\v]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)is satisfied.

Let's try again by running python demo.py under cmd.

cmd Chinese garbled

What, what, what, say good show Chinese? Isn't that kidding me? Go to python idle and try it out.

python idle is normal

Why is the same file normal in python idle? There must be a problem with cmd. Yes, I think so too. Then I will try to enter python interactive mode under cmd to output Chinese . I can actually output Chinese . All fainted.

cmd is normal

Don't worry, listen to my analysis. In fact, when printing characters in cmd or idle, it has nothing to do with the file encoding method. What works at this time is the output environment, that is, the encoding method of cmd or idle. Check the encoding command of cmd chcp, return 936, and go to the Internet Looking up, we can see that 936 represents the GBKencoding . Now we probably know the reason. The demo.py file storage and encoding declaration are both utf-8, but cmd shows that the encoding is yes GBK, and 中文the encoding will be garbled if it is forcedutf-8 to be converted to two bytes . A Chinese character, so it will be decoded into three characters. Unfortunately, these three characters are not commonly used characters or the characters we want, so they are considered garbled characters. Why is it OK to enter the Python interactive command line under cmd? This is because when inputting in the python interactive command line , these two Chinese characters are actually stored in encoding . The default encoding of cmd is , if you don't believe me, look at the print . This is the encoding method. storage, and the encoding method stores the same as . Let me tell you how to solve the problem of correctly outputting Chinese when executing files under cmd.\xe4\xb8\xad\xe6\x96\x87GBKGBK\xe4\xb8\xad\xe6\x96\x87涓枃s = "中文"中文GBKGBKs\xd6\xd0\xce\xc4GBKutf-8中文\xe4\xb8\xad\xe6\x96\x87

1. Both the demo.py file and the encoding declaration are GBK

This method is rather stupid. It is to change the demo.py file to GBKstorage , and the encoding declaration is also GBKnot recommended.

# encoding:gbk
s = "中文"
print s
print repr(s)

GBK

2. Chinese is represented by unicode

As long as a small umark is added in front of the Chinese, the following Chinese will be unicodestored with .

# encoding:utf-8
s = u"中文"
print s
print repr(s)

Under cmd, unicode characters can be printed, as follows.

unicode

3. Force Chinese to GBKor unicodeencode

Forced conversion to unicodeencoding, encoding can be converted to each other in Python, for example, from utf-8conversion to gbk, different encodings cannot be directly converted, you need to unicodetransition through the character set, from the above basic knowledge, it can be seen that unicodeit is a character set, not a code. , utf-8but unicodea kind of code that concretely realizes the idea. utf-8Converting to unicodeis a decoding process that decodecan be utf-8decoded from unicode.

# encoding:utf-8
s = "中文"
u = s.decode('utf-8')
print u
print type(u)
print repr(u)

unicode

Force conversion to gbkencoding, the previous step has been utf-8converted from unicodeto , from unicodeis the process of encoding, through encodeimplementation.

# encoding:utf-8
s = "中文"
u = s.decode('utf-8')
g = u.encode('gbk')
print g
print type(g)
print repr(g)

gbk

To sum up
, it is not supported under the windows cmd window. If utf-8you want to display Chinese, you must convert it to gbkor unicode, and all three encodings are supported in Python idle. The appearance of Chinese garbled characters is caused by inconsistency in encoding. The storage is used utf-8, and the garbled characters will be garbled when printing gbk. All to ensure that the characters are not garbled as much as possible, it is recommended to use them all unicode.

decode decode

It is called decoding from other encodings unicode. The decoding method is decodethat the first parameter is the original encoding format of the decoded string. If it is written incorrectly, an error will be reported. For example, if s is usedutf-8 for decoding, an error will be reported.gbk

# encoding:utf-8
s = "中文"
u = s.decode('gbk')
print u
print repr(u)

decode error

Small tip
Direct input under Python idle and cmd s = "中文"will be gbkencoded in , if input in a file s = "中文"and file storage format is utf-8, then s is utf-8stored in encoding, which is a bit different. I have stepped on the pit, and in time Python idle successfully ran the file. may fail at times.

encode encoding

It cannot be directly utf-8converted from to gbk, and must be unicodeconverted in the middle. This is very important. The encoded original string must be unicode, otherwise an error will be reported.

raw_input

raw_input is to obtain the user input value, and the obtained user input value is related to the current operating environment encoding. For example, the default encoding under cmd is gbk, then the input Chinese characters are in gbkencoding, regardless of the encoding format and encoding declaration of the demo.py file.

# encoding:utf-8
s = raw_input("input something: ")
print s
print type(s)
print repr(s)

gbk

GBK encodes a Chinese character with two bytes, and UTF-8 encodes a Chinese character with usually 3 bytes.

Careful friends have already noticed that raw_inputthe prompt language I use is English, so change it to Chinese and see, it is really garbled.

# encoding:utf-8
s = raw_input("请输入中文汉字:")
print s
print type(s)
print repr(s)

raw_input garbled

How to do it? It's fine to force the prompt string to be the gbkencoding, unicodeand utf-8neither.

# encoding:utf-8
s = raw_input(u"请输入中文汉字:".encode('gbk'))
print s
print type(s)
print repr(s)

raw_input is normal

Equality trap

The two strings "Chinese" are stored in different encodings, and the utf-8encoding and gbkencoding of stored "Chinese" are different.

not equal

Summarize

I said so much in one breath, I don't know if you understand? If you want to avoid garbled characters, just remember the following 5 rules.

  1. The file is stored in the utf-8format and the encoding is declared as utf-8,# encoding:utf-8
  2. Add in front of the place where Chinese characters appearu
  3. There is no direct conversion between different encodings, and an unicodeintermediate jump is required
  4. utf-8Encoding is not supported under cmd
  5. raw_inputThe prompt string can only be gbkencoded

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324520999&siteId=291194637