python程序设计第7课第四章Unicode与字符串

这堂课前半小时出了点状况,听不见声音。实际只讲了一个小时

python2与3在支持Unicode方面的异同

七，python2.7与Unicode

1.str和unicode

python2里面同样是表示字符串有两种类型，一种是byte string(a sequence of bytes即8-bit string,即8位文本)和二进制数据,一种是Unicode string(a sequence of Unicode code-points 即Unicode 文本),定义要加u(s=u’abc’)前者的类型是str,后者的类型是unicode

例子:
>>>s='abc'
>>>type(s)
<type 'str'>
>>>s=u'abc'
>>>type(s)
<type 'unicode'>

2.string与Unicode code points之间转化

unichr(),chr():

接受一个整数(可以理解为 Unicode code point),返回这个code points所对应的Unicode string和8-bit string(或者叫byte string),二者的参数范围不同(chr()参数范围是0~255),输出时unichr()会多个u

例:
>>>unichr(97)
u'a'
>>>unichr(127)
u'\x80'                    这个字符不能打印就把它的16进制值打印出来,
                           \x表示后面两位是16进制

>>>chr(255)
'\xff'                     再大就会越界出现问题了,字符没法打印出来
>>>unichr(256)
u'\u0100'                  \u实际上类似前面的u+

ord():

接受一个只包含一个字符的Unicode string和8-bit string,输出一个整数(code point或
者理解为acscii code)
*实际上在真实的工作中你不需要手动去操作这三个函数，这是很底层的东西

3.encode vs. decode

Unicode string(a sequence of Unicode code-points)转化为byte string的过程叫encode,过程中遵循的规则叫encoding(编码方式),byte string(a sequence of bytes)转化为Unicode string叫decode

*你以后写的程序在程序内部应该只处理Unicode string,程序外部的输入都应从byte string解码
成Unicode string放在程序中处理,处理后再把Unicode string encode为输出的目标平台上所用
的字节序列

4.用于解码的函数:unicode()

传入byte string,输出Unicode string

unicode(object[,encoding[,error]])
*>>>unicode(chr(128))会报错,原因是不指定编码方式就会用默认的ASCII encoding

*‘strict’,‘replace’,'ignore’三种error参数的异同
‘strict’ encoding方式出现异常就会抛出错误ValueError
‘replace’ 解码错误的字符统一换为unicode code point为U+FFFD字符
‘ignore’ 出现错误就会自动忽略

例子:
>>>unicode('\xabc',error='strict')      直接报错,因为0x80十进制为128
>>>unicode('\xabc',error='replace')
u'\ufffdabc'
>>>unicode('\xabc',error='ignore')
u'abc'

例子:
>>>unicode('abcdef')
u'abcdef'
>>>s=unicode('abcdef')
>>>type(s)
<type 'unicode'>
>>>chr(127)
'\x7f'
>>>unicode(chr(127))
u'\x7f'

各种各样的encoding方式,略过不记。怎么encode后面怎么decode,否则就很容易出错,除非两者兼容

5.用于编码的方法:.encode()

传入Unicode string,输出byte string

str(变量名).encode([encoding[,errors]])
*用过以后字符前面失去了u，意味着返回值无论如何不是一个Unicode string,而是个byte
string

例子:(注意与前面的UTF-8可变长度规则对照着看)
>>>s=u'a'                                         
>>>s.encode('utf-8')
'a' 
>>>s=u'\x7f'              127
>>>s.encode('utf-8')
'x7f'

>>>s=u'\x80'              128
>>>s.encode('utf-8')
'\xc2\x80'
>>>s=u'\xff'              255
>>>s.encode('utf-8')
'\xc3\xbf'
>>>s=u'\u0100'            256
>>>s.encode('utf-8')
'\xc4\x80'
>>>s=u'\u07ff'            2047
>>>s.encode('utf-8')
'\xdf\xbf'

>>>s=u'\u0800'            2048
>>>s.encode('utf-8')
'\xe0\xa0\x80'

例子:
>>>s=unichr(40960)+u'abcd'+unichr(1972)
>>>s.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>>s.encode('ascii')               会报错
>>>s.encode('ascii','ignore')
'abcd'
>>>s.encode('ascii','replace')
'?abcd?'

6.8-bit string用于解码的方法:str.decode()

str(变量名).decode([encoding[,errors]])

例子:
>>>s=unichr(40960)+u'abcd'+unichr(1972)
>>>s
u'\ua000abcd\u07b4'
>>>type(s)
<type 'unicode'>
>>>utf-8_version=s.encode('utf-8')
>>>utf-8_version
'\xea\x80\x80abcd\xde\xb4'
>>>type(utf-8_version)
<type 'str'>
>>>s2=utf-8_version=s.decode('utf-8')
>>>s2
u'\ua000abcd\u07b4'

7.是否可以对str对象使用encode()方法？

>>>s='a'
>>>type(s)
<type 'str'>
>>>s.encode('utf-8')
'a'                              没错

>>>s='\xff'                      255
>>>s.encode('utf-8')             报错

str.encode()实际上等价于str.decode(sys.getdefaultencoding()).encode(),即先decode(这里decode时认为的encoding方式是系统默认的,没指定就是ascii,这也是出错的原因)后encode

编辑于2020-3-26 16：43