关于Python编码问题的理解

在Python2.x中, 编码问题一直让人很头疼, 从网上查阅和自己的实验得出一些自己理解的东西,记录一下,方便日后查阅.

因为Python的出现比Unicode出现要早, 所以Python解释器默认编码为ASCII码.

了解Python2的编码问题仅用于学习, 了解编码的过程. 但是主要学习还是Python3

注意:

Python3中的str 对应Python2中的unicode
Python3中的bytes 对应Python2中的str

在终端中

首先, Python中关于编码的几个函数

附加:

如果使用内建函数open()打开file-like-Object时, 如果不指定编码, 则encoding=None使用下面函数返回的编码.

python3中: 如果不在open()中指定编码, 可以用rb模式以字节码读取文件, 然后用decode()转为str

locale.getpreferredencoding() # 在文本类型用到的数据编码 open()会用到

在此函数的官方文档中有说明这是一个猜测的结果---so this function only returns a guess.

平台默认编码作为文件的编码, windows中默认为cp936, 等同于gbk

print sys.getdefaultencoding()    #Python环境系统默认编码 解释器会用到
print sys.getfilesystemencoding() #文件编码 print locale.getdefaultlocale()   #操作系统当前编码 是getpreferredencoding的父集
print sys.stdin.encoding          #输入编码 input()会用到
print sys.stdout.encoding         #输出编码 print()会用到

输入编码和输出编码应该和文件的编码Python默认编码保持一致

在windows的cmd或者shell中输出的结果是:

ascii
mbcs
('zh_CN', 'cp936')
cp936
cp936

在Linux的终端中输出结果是:

ascii
UTF-8
('zh_CN', 'UTF-8')
UTF-8
UTF-8

当然,在不同的操作系统中,在不同的编辑器中, 有很多不同编码, 但是Python解释器系统的编码默认就是ASCII码, 并不会随系统改变.

Python解释器的默认编码的作用就是, 只要不指定编码, 都用ASCII编码.

将解释器的默认编码(ASCII)改为utf-8: (只在Python2中生效)

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

其实这样就改变了Python解释器的默认编码. 完全支持中文了.

但是这种用法, 必须在Python的cmd或者终端中才可以, 因为在cmd中, Python的代码是输入一行执行一行. 所以在执行完上面的设置utf-8为默认编码后, 就立即生效了, 后面的代码就可以正常执行(终端输入\出编码也要是utf-8).

如下图所示 , windows的shell中是gbk编码, 所以sys.setdefaultencoding("gbk")才不会出现编码错误.

windows中输出出现乱码很大可能是因为没有注意终端的编码是gbk. 还有Python2的本身str类型的原因.

在Linux的终端中: sys.setdefaultencoding("utf-8")

在编辑器中

大部分时候,写代码的时候都是在编辑器中写一个py文件来执行, 所以, 上面的解决方法就行不通了

因为在py文件在执行的时候, 并不全是读一行执行一行, 还有生成编译后pyc文件. 所以, 即使在py文件中写了sys.setdefaultencoding("utf-8"), 并不会立刻执行生效, 此时的Python解释器默认编码还是ASCII码, 当继续用ASCII码编译后面的中文注释或者字符串的时候, 就会出现

SyntaxError: Non-ASCII character '\xe8' in file d:/xiangou/python/tian_jin/test.py on line 8, but no
encoding declared; see http://www.python.org/peps/pep-0263.html for details

Non-ASCII编码错误.

错误的提示还有一个网址, 打开这个

http://www.python.org/peps/pep-0263.html

网址就可以看到解决方案.

Defining the Encoding
Python will default to ASCII as standard encoding if no other encoding hints are given.

To define a source code encoding, a magic comment must be placed into the source files either as first or second line in the file, such as:

# coding=<encoding name>
or (using formats recognized by popular editors):

#!/usr/bin/python
# -*- coding: <encoding name> -*-
or:

#!/usr/bin/python
# vim: set fileencoding=<encoding name> :
More precisely, the first or second line must match the following regular expression:

^[ \t\v]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)
The first group of this expression is then interpreted as encoding name. If the encoding is unknown to Python, an error is raised during compilation. There must not be any Python statement on the line that contains the encoding declaration. If the first line matches the second line is ignored.

To aid with platforms such as Windows, which add Unicode BOM marks to the beginning of Unicode files, the UTF-8 signature \xef\xbb\xbf will be interpreted as 'utf-8' encoding as well (even if no magic encoding comment is given).

If a source file uses both the UTF-8 BOM mark signature and a magic encoding comment, the only allowed encoding for the comment is 'utf-8'. Any other encoding will cause an error.

英文就不翻译了, 提到的解决方法就是在py文件的开头加

# -*- coding: utf-8 -*- 编码随意指定. 而且样式也不一定必须这样, 只要符合

^[ \t\v]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)

这个正则表达式就可以.

这句话(# -*- coding: utf-8 -*- )的意思是表明文件里的内容是用utf-8编码的, 当然,这个编码应该和编辑器的输入编码保持一致,也就是说编辑器也应该是utf-8编码, 也就是说输入编码

print sys.stdin.encoding

这句话的结果必须是utf-8. 这样, 当解释器在第一行读到这个编码注释的时候, 就是告诉解释器, 这个文件的内容编码是utf-8, 所以解释器读取这个文件的内容的时候应该用utf-8编码. 所以后边出现中文就不会报错了, 但是,这句话的作用也仅限于此. 整个

Python解释器的默认编码还是ASCII码. 也就是说就是写了# -*- coding: utf-8 -*-这句话

print sys.getdefaultencoding()

的输出还是ASCII.

有时候写爬虫的时候, 得到的网页中的中文是乱码, 其中一个原因是: 如果网页中的Content-type中没有指定编码,就会用默认的ASCII解码, 所以就会出现乱码.

Python2中的str

Python内部对字符串的各种操作(拼接, 替换等)都是对字节码操作的. 不论是中文还是英文, 字符串都用Python默认的编码(缺省情况下是ASCII码)编码成字节码, 然后对字节码进行各种操作, 最后字节码根据输出编码(sys.stdout.encoding)解码后输出.

这就是在windows中py文件即使在第一行指定了#-*-coding:utf-8-*-,在终端运行py文件的时候,中文还是会乱码的原因.

如果第一行写了指定编码,

print("中文")    # 用utf-8编码成字节码

print("Hello World")    # 用utf-8编码成字节码

print(u"中文")    # Unicode字符串,用Unicode编码成字节码

如果第一行没有指定编码(默认ASCII)

print("中文")    # 用ascii编码成字节码, 报错!!!

print("Hello World")    # 用ascii编码成字节码

print(u"中文")    # 用Unicode编码成字节码 报错!!!

因为只有Unicode的字符串才是通用的, 所以不管指定什么编码, 编码成的字节码数组都可能出现乱码, 所以:

出现中文的地方尽量在前面加个u.

应该用内建函数unicode()和unichar可以看成Unicode版本的str()和char().

不要用str()函数,用unicode()代替

注意:

Python3中的str 对应Python2中的unicode
Python3中的bytes 对应Python2中的str