python unicode 转码问题详解

一、unicode与普通string字符串相互转换

字符串在Python内部的表示是unicode编码，因此，在做编码转换时，通常需要以unicode作为中间编码，即先将其他编码的字符串解码（decode）成unicode，再从unicode编码（encode）成另一种编码。

unicodestring = u"Hello world"

“decode” 解码

将普通Python字符串转化为Unicode

str->unicode: unicode(b, "utf-8") 或 b.decode("utf-8")

plainstring1 = unicode(utf8string, "utf-8") 
plainstring2 = unicode(asciistring, "ascii")

“encode” 编码

将Unicode转化为普通Python字符串

unicode->str :a.encode("utf-8")

utf8string = unicodestring.encode("utf-8")  
asciistring = unicodestring.encode("ascii")

因此，转码的时候一定要先搞明白，字符串str是什么编码，然后decode成unicode，然后再encode成其他编码

代码中字符串的默认编码与代码文件本身的编码一致。

以“utf-8”为例

unicode->str :a.encode("utf-8")
str->unicode: unicode(b, "utf-8") 或 b.decode("utf-8")

代码如下：

a = u'你好'
b = a.encode("utf-8")
c = unicode(b, "utf-8")
print a, type(a)
print b, type(b)
print c, type(c)
print c + u"hahaha", type(c + u"hahaha")

输出：
你好 <type ‘unicode’>
你好 <type ‘str’>
你好 <type ‘unicode’>
你好hahaha <type ‘unicode’>

扫描二维码关注公众号，回复： 4597507 查看本文章

s = u'\u4eba\u751f\u82e6\u77ed\uff0cpy\u662f\u5cb8'
print s

输出：人生苦短，py是岸

二、判断字符串编码

isinstance(s, unicode) #用来判断是否为unicode

用非unicode编码形式的str来encode会报错

三、获得系统默认编码

#!/usr/bin/env python
#coding=utf-8
import sys
print sys.getdefaultencoding()

该段程序在英文WindowsXP上输出为：ascii

四、控制台乱码问题

在某些IDE中，字符串的输出总是出现乱码，甚至错误，其实是由于IDE的结果输出控制台自身不能显示字符串的编码，而不是程序本身的问题。

如在UliPad中运行如下代码：

s=u"中文"
print s

会提示：UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 0-1: ordinal not in range(128)。这是因为UliPad在英文WindowsXP上的控制台信息输出窗口是按照ascii编码输出的（英文系统的默认编码是 ascii），而上面代码中的字符串是Unicode编码的，所以输出时产生了错误。

将最后一句改为：print s.encode('gb2312'),则能正确输出“中文”两个字。若最后一句改为：print s.encode('utf8'),则输出：\xe4\xb8\xad\xe6\x96\x87，这是控制台信息输出窗口按照ascii编码输出utf8编码的字符串的结果。

unicode(str,‘gb2312’)与str.decode(‘gb2312’)是一样的，都是将gb2312编码的str转为unicode编码

使用str.__class__可以查看str的编码形式