In simple terms, a thorough understanding of coding in python



1Question 1: Where is the problem?

The problem is our target. If we study without a problem in our mind, we will not be able to grasp the key point.
The programming environment used in this article is centos6.7, python2.7 .

We type python in the shell to open the python command line and type the following two sentences:

s = "中国zg"
e = s.encode( "utf-8" )

Now the question is: will this code work?


The answer is no , and the following error will be reported:
UnicodeDecodeError: 'ascii' codec can't decode byte  0xe4  in position 0: ordinal not in range(128)

Please pay attention to the 0xe4 described in the error, it is the breakthrough of our analysis of the error.

I believe many people have encountered this error. Then a new problem comes.



2Question 2: Why?

To find out the reason, we might as well carefully analyze the execution process of these two sentences: first, we typed Chinese zg
in the python command line interpreter through the keyboard       and added English double quotes to it, and then assigned it to The variable s, looks commonplace, doesn't it? In fact, there is a lot of mystery in it.


When we enter characters in a program through the keyboard, we do this through the operating system. The Chinese zg we see on the screen is      actually a feedback from the operating system to us humans, telling you: "Hi buddy, you entered the character   Chinese zg in the program    "


What is the feedback from the operating system to the program? The answer is the 01 string. What does this 01 string look like and how is it generated?
The answer is that the operating system uses its own default encoding method, encodes the Chinese zg , and sends the encoded 01 string to the program.


The default encoding of the centos system we use is utf-8, so as long as you know the utf-8 encoding of each character of Chinese zg , you can know what the 01 string is.

After querying, the encodings that can be obtained are (in hexadecimal and binary):

| China |China | z |g|
|: ------------- |:-------------:| -----:|
|E4B8AD |E59BBD|7A|67|
|11100101 10011011 10111101|11100101 10011011 10111101|01111010|01100111|

Now we know what the 01 string passed by the operating system to the program looks like. Then, what will the program do with it?

The program sees that the 01 string is surrounded by double quotes, and naturally knows that the 01 string is a string. This string is then assigned to s.


At this point, it is the execution logic of the first sentence.

Now proceed to the execution of the second sentence.


e = s.encode("utf-8") means to encode the string s with utf-8 and assign the encoded string to e. The problem is, the program now knows the 01 string in s, and also knows that the 01 string represents a string, but what is the encoding of this string?


我们必须知道01串的现有编码才能解析出里面的字符,也才能用新的编码方式,如utf-8来重新编码它。操作系统只给程序传来了01串,并没有告诉程序这个01串用的字符编码是什么。


此时,python程序就会用它自己默认的编码当作s的编码,进而来识别s中的内容。这个默认的编码是ASCII,所以,它会用ASCII来解释这个01串,识别出字符串的内容,再将这个字符串转为utf-8编码。


好了,程序碰到的第一个字节就是E4(11100101 ),傻眼! ASCII编码中没有这玩意儿,因为ASCII编码中字节第一位都是0。


怎么办?
报错呗,于是我们就看到了上面的错误。
错误中的0xe4就是字符 “中” 的utf8编码的第一个字节。


3问题3:How?

知道问题出在哪里了,怎么解决这个问题呢?


显然,我们只要告诉程序,这个s中的01串的编码是utf-8,程序就应该能正确工作。

但这样的解决方法有一个问题,就是不够通用。


假如我有个程序,它要读取很多文本文件,每个文本文件的编码都不一样,岂不是针对每个读进来的文件都维护一个编码信息?很繁琐。


进一步,如果这些文本文件的内容还要做相互的比较连接之类的操作,编码都不一致,岂不是更麻烦?


python是怎么聪明地解决这个问题的呢?

很简单,就是decode!


decode的意思是说,你有一个字符串,并且你知道它的编码,只要你用该编码decode这个字符串,那么,python就会识别出里面的字符内容,同时,建一个int数组,将每个字符的unicode序号存进去。


所有的字符串都这样做,就可以确保在程序运行过程中,各种来源获得的字符串都有一样的表示。它们就可以方便地进行各种操作了。


上面说的 int数组会被python封装成一个对象,即unicode对象。


4问题4:如何搞定?

下面,我们在python命令行中输入如下两行代码:

e = s.decode( "utf-8" )
isinstance (e , unicode )

The output of the program is True, which means that the e returned after decoding is indeed a unicode object.
unicode is a class here, a class in python.

e is called a unicode string, which means that it stores the unicode number of the character, and does not use any encoding.

Then, we can encode e into any encoding, for example, the following operations are possible

e.encode( "utf-8" )
e.encode( "gbk" )

As long as the encoding you choose can encode the characters in e, an error will be reported if it cannot be encoded.
For example, if you try this:

e.encode( "ascii" )

Since ASCII cannot encode these two Chinese characters, an encode error will occur.

So far, we have seen two kinds of errors, decode error and encode error, and solved them.


5Question 5: How to evaluate python's approach to character encoding?

First of all, this method of processing is very simple. Any text, as long as it is decoded once when it enters the program, will become a unicode object, which uses int to store the unicode serial number of each character. As long as the text is to be output, encode it again, and encode it into the encoding we need.


问题是,所有的字符都用一个int来表示会不会太浪费空间?毕竟,用ASCII编码,英文的字符只要一个字节就可以了。


确实会费点空间,但是现在的内存都足够大,而且我们只在程序内部使用这种方式,当字符串要写入文件或者通过网络传输时,我们都会进行相应的编码的。


还有一个问题,那些写死在程序中的字符串怎么办?难道每次使用都要进行一次decode?不同的操作系统默认使用的编码是不一样的,当我们在linux下,通常需要用utf8做decode,在Windows下,通常需要用gbk做 decode。这样,我们的代码就只能在特定的平台运行。


python给我们提供了一个很简单的办法,只要在字符串前面加一个u,它就会帮我们探测系统的编码,并自动完成decode。


6问题6:总结下,学到了什么?

本文用一个很常见的错误为起点,详细分析了python中的编码问题。我们看到了python处理字符问题的简单之处,也能够理解为什么python有这么强大的文本处理功能。


7测试题:看你是否真正理解了。

假设一台linux上有一个文件a.txt,里面的内容是"中文"两个字符,编码方式是utf-8。

现在,在python程序中写如下语句:

import codec
s= ""
with codec.open( "a.txt" , encoding = "utf-8" ) as f:
s=f.readline().strip()

with open ( "b.txt" , "w" ) as f:
f.write(s)

Will this code work? Why?

Answer: No!


The representation under s is unicode, and python will encode it when writing it out. The default ascii encoding cannot encode the two characters of "Chinese", so an error will be reported!


∞∞∞



640?wx_fmt=jpeg&wx_lazy=1

IT School - {Technology Youth Circle} continues to pay attention to the fields of Internet, blockchain and artificial intelligence 640?wx_fmt=jpeg&wx_lazy=1



The official account replied "Python" ,

Invite you to join {IT send Python technology group} 


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325762726&siteId=291194637