Python 编码问题——UnicodeDecodeError（一）

编码问题是python2使用中最令人头疼都问题之一。如果你读到这篇文章，那么你可能正在被这个问题困扰不已。

Python编码问题困难的主要问题是编码相关术语令人困惑，同时很多时候大家处理简单字符的时候并不会遇到什么问题，因此也不会在意这个问题。直到有一天在处理ASCII码能够表达的字符之外的字符时，发现自己一头撞进砖墙上……

如果你现在撞到了Python 2编码的墙上，这里有3个你可以参考的思路，以便更好地理解strings和unicode：

1，类型str是字节，不是strings！

解决Unicode编码问题第一步是:不要再认为type<'str'>是我们以往理解的strings的意思。而是，要开始理解，type<'str'>是一个存储字节的容器，str对象存储的是字节序列。

为了方便你理解，看一看你的代码中字符串，每次你看到‘abc’，“abc"，或者"""abc"""的时候，告诉你自己“那是一个3个字节组成的序列，这3个字节a, b, c的ASCII码编码。

2，unicode用以表示strings

解决Unicode编码问题第一步是：使用type<unicode>作为你的strings的存储容器。

对于新手，这意味着需要在字符串前加前缀'u'，它将创建unicode对象而非常规的引用（常规引用会会创建str对象）。

3，UTF-8， UTF-16，UTF-32是序列化格式，不是Unicode

UTF-8是一种编码，就像ASCII，以字节的形式呈现，二者的不同是UTF-8编码可以表示任意Unicode字符，但ASCII只能表示128以内的字符。而Unicode对象就是它本身，它没有被编码过。你可以将Unicode对象理解为一种存储抽象，而ASCII，UTF-8，UTF-16，UTF-32，是对你文本的编码方式。

好，但为啥不能用str代表strings？

因为str类型是被隐式编码了，这种编码（或者试图解码错误的编码）是python2大多数的Unicode问题根源。

我这里说的编码是什么呢？它是一个bits序列，用以表示我们的字符。例如，"abc"字符串实际存储为：01100001 0100010 01100011（ASCII码 97 98 99）.

但是也有其他方式来表示"abc"，如果你用UTF-8，那么它看起来就跟ASCII一样，因为UTF-8和ASCII在拉丁字符的表示上是一致的。如果你用UTF-16，则表示为：0000000001100001 0000000001100010 0000000001100011.

编码非常重要，当你需要传输文本的时候，你必须要对它进行编码，比如写文件，网络传输，存入数据库等等。如果你发送了错误的编码，就会出现Unicode错误。

str类型的问题是python 2.7编码问题的主要根源，其理解难点是str的编码是隐式的，这就意味着发现该编码的唯一方式是尝试解码该字节序列，然后看它有没有报错。不幸的是，有很多地方这些字节编码被隐式地解码，这就造成很多困扰和问题。这里是一些例子：

# Set up the variables we'll use
>>> uni_greeting = u'Hi, my name is %s.' #unicode object
>>> utf8_greeting = uni_greeting.encode('utf-8')

>>> uni_name = u'José'  # Note the accented e.
>>> utf8_name = uni_name.encode('utf-8')

# Plugging a Unicode into another Unicode works fine
>>> uni_greeting % uni_name
u'Hi, my name is Josxe9.'

# Plugging UTF-8 into another UTF-8 string works too
>>> utf8_greeting % utf8_name
'Hi, my name is Josxc3xa9.'

# You can plug Unicode into a UTF-8 byte sequence...
>>> utf8_greeting % uni_name  # UTF-8 invisibly decoded into Unicode; note the return type
u'Hi, my name is Josxe9.'

# But plugging a UTF-8 string into a Unicode doesn't work so well...
>>> uni_greeting % utf8_name  # Invisible decoding doesn't work in this direction.
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

# Unless you plug in ASCII-compatible data, that is.
>>> uni_greeting % u'Bob'.encode('utf-8')
u'Hi, my name is Bob.'

# And you can forget about string interpolation completely if you're using UTF-16.
>>> uni_greeting.encode('utf-16') % uni_name
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ValueError: unsupported format character '' (0x0) at index 33

# Well, you can interpolate utf-16 into utf-8 because these are just byte sequences
>>> utf8_greeting % uni_name.encode('utf-16')  # But this is a useless mess
'Hi, my name is xffxfeJx00ox00sx00xe9x00.'

上面的例子展示了str类型的很多问题，字符串运算时隐式的解码配合str类型隐式的编码，会隐藏很多严重的问题，除非你所有的字符都是ASCII可以表示的字符。直到有一天，一个非常规字符（也包括中文）出现了，就会出现严重问题。

解决方法：Unicode “空气锁”

解决这个问题最好的方法，就像python中其他问题一样，就是显示化（explicit）。这意味着你代码里的每一个string都需要清楚地标记为是Unicode还是字节序列。

实现这个方法最系统的方式是使你的代码进入Unicode-only“无菌室”。也就是说，你的代码内部用Unicode，你甚至可以在关键位置加入Unicode类型检查。然后，在你的代码入口处加入Unicode“空气锁”，以确保所有试图进入你的代码的字节序列已经用Unicode包装过了。（就是不给任何ascii隐式encode的机会）

例如：

with f = open('file.txt'):  # BAD--gives you bytes
    ...
with f = codecs.open('file.txt', encoding='utf-8'):  # GOOD--gives you Unicode
    ...

这个方法听起来很笨重，但是也很简单。许多著名的python库都采取了这一原则。

“空气锁”构建工具（有用的Unicode工具）

几乎每一个Unicode问题都可以被适当地应用这些工具解决掉，他们会帮助你构建空气锁，以保证你代码内部的干净：

encode(): Unicode -> bytes
decode(): bytes -> Unicode
codecs.open(encoding=”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
u'': Makes your string literals into Unicode objects rather than byte sequences.

注意：不要对bytes执行encode()，或者对Unicode执行decode()。

分析解决问题

分析Unicode错误的关键是直到你的数据类型是什么。然后，尝试以下步骤：

1，如果一部分变量是byte sequences而不是Unicode objects，那么在处理他们之前先用decode()或者u''将他们转化为Unicode，如：

>>> uni_greeting % utf8_name
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
# Solution:
>>> uni_greeting % utf8_name.decode('utf-8')
u'Hi, my name is Josxe9.'

2，如果所有的变量都是byte sequences，那么很可能是编码不匹配，先用decode()或者u''将他们转化为Unicode再试一次。

3，如果所有的变量都是Unicode，然后部分代码可能不知道怎么处理这些Unicode对象，要么修复这些代码，要么encode这些Unicode对象为byte sequence（发送数据前需要decode回Unicode）：

>>> with open('test.out', 'wb') as f:
>>>     f.write(uni_name)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in position 3: ordinal not in range(128)
# Solution:
>>> f.write(uni_name.encode('utf-8'))
# Better Solution:
>>> with codecs.open('test.out', 'w', encoding='utf-8') as f:
>>>     f.write(uni_name)

其他tips：

Python3在编码问题上更加显式：string literals被默认编码为Unicode，而byte sequences被存为一种新的数据类型byte。

祝你好运。

参考：https://www.azavea.com/blog/2014/03/24/solving-unicode-problems-in-python-2-7/