《Python cookbook》笔记二

第二章字符串和文本

—使用多个界定符分割字符串—

你需要将一个字符串分割为多个字段，但是分隔符 (还有周围的空格) 并不是固定
的。

# str.split() 方法只适应于非常简单的字符串分割情形
# 当你需要更加灵活的切割字符串的时候，最好使用 re.split() 方法
>>> line = 'asdf fjdk; afed, fjek,asdf, foo'
>>> import re
>>> re.split(r'[;,\s]\s*', line)
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']
# 注意正则表达式中是否包含一个括号捕获分组
>>> fields = re.split(r'(;|,|\s)\s*', line)
>>> fields
['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']
# 你可以这样
>>> re.split(r'(?:,|;|\s)\s*', line)
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

# 获取分割字符在某些情况下也是有用的
>>> values = fields[::2]
>>> delimiters = fields[1::2] + ['']
>>> values
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']
>>> delimiters
[' ', ';', ',', ',', ',', '']
>>> # Reform the line using the same delimiters
>>> ''.join(v+d for v,d in zip(values, delimiters))
'asdf fjdk;afed,fjek,asdf,foo'

—字符串开头或结尾匹配—

你需要通过指定的文本模式去检查字符串的开头或者结尾，比如文件名后缀，URL
Scheme 等等。

# str.startswith() 或 者 是str.endswith() 方法
>>> filename = 'spam.txt'
>>> filename.endswith('.txt')
True
>>> filename.startswith('file:')
False

如果你想检查多种匹配可能，只需要将所有的匹配项放入到一个元组(只能是元组)中去，然后传
给 startswith() 或者 endswith() 方法

>>> import os
>>> filenames = os.listdir('.')
>>> filenames
[ 'Makefile', 'foo.c', 'bar.py', 'spam.c', 'spam.h' ]
>>> [name for name in filenames if name.endswith(('.c', '.h')) ]
['foo.c', 'spam.c', 'spam.h'
>>> any(name.endswith('.py') for name in filenames)
True

你可能还想到了用正则去实现：

>>> import re
>>> url = 'http://www.python.org'
>>> re.match('http:|https:|ftp:', url)
<_sre.SRE_Match object at 0x101253098>

当和其他操作比如普通数据聚合相结合的时候 startswith() 和endswith() 方法是很不错的

if any(name.endswith(('.c', '.h')) for name in listdir(dirname)):
...

—用shell通配符匹配字符串—

你想使用 Unix Shell 中常用的通配符 (比如 .py , Dat[0-9].csv 等) 去匹配文
本字符串

# fnmatch 模块提供了两个函数—— fnmatch() 和 fnmatchcase() ，可以用来实现这样的匹配
>>> from fnmatch import fnmatch, fnmatchcase
>>> fnmatch('foo.txt', '*.txt')
True
>>> fnmatch('foo.txt', '?oo.txt')
True
>>> fnmatch('Dat45.csv', 'Dat[0-9]*')
True
>>> names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py']
>>> [name for name in names if fnmatch(name, 'Dat*.csv')]
['Dat1.csv', 'Dat2.csv']

# fnmatch()依赖不同操作系统对大小写的敏感状况
>>> # On OS X (Mac)
>>> fnmatch('foo.txt', '*.TXT')
False
>>> # On Windows
>>> fnmatch('foo.txt', '*.TXT')
True
# 你可以用fnmatchcase()
>>> fnmatchcase('foo.txt', '*.TXT')
False

—字符串匹配和搜索—

你想匹配或者搜索特定模式的文本

# 如果你想匹配的是字面字符串，那么你通常只需要调用基本字符串方法就行，比如str.find() , str.endswith() , str.startswith(),对于复杂的匹配需要使用正则表达式和 re 模块
>>> text1 = '11/27/2012'
>>> text2 = 'Nov 27, 2012'
>>>
>>> import re
>>> # Simple matching: \d+ means match one or more digits
>>> if re.match(r'\d+/\d+/\d+', text1):
... print('yes')
... else:
... print('no')
...
yes

如果你想使用同一个模式去做多次匹配，你应该先将模式字符串预编译为模式对象re.compile()

>>> datepat = re.compile(r'\d+/\d+/\d+')
>>> if datepat.match(text1):
... print('yes')
... else:
... print('no')
...
yes
# match() 总是从字符串开始去匹配，如果你想查找字符串任意部分的模式出现位置，使用 findall() 方法去代替

在定义正则式的时候，通常会利用括号去捕获分组

>>> datepat = re.compile(r'(\d+)/(\d+)/(\d+)')
>>> m = datepat.match('11/27/2012')
>>> m
<_sre.SRE_Match object at 0x1005d2750>
>>> # Extract the contents of each group
>>> m.group(0)
'11/27/2012'
>>> m.group(1)
'11'
>>> m.group(2)
'27'
>>> m.group(3)
'2012'
>>> m.groups()    # month, day, year = m.groups()
('11', '27', '2012')

tip：如果你打算做大量的匹配和搜索操作的话，最好先编译正则表达式，然后再重复使用它

—字符串搜索和替换—

你想在字符串中搜索和匹配指定的文本模式

# 对于简单的字面模式，直接使用 str.repalce()
>>> text = 'yeah, but no, but yeah, but no, but yeah'
>>> text.replace('yeah', 'yep')
'yep, but no, but yep, but no, but yep'

# 对于复杂的模式，请使用 re 模块中的 sub() 函数
>>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
>>> import re
>>> re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text) # 反斜杠数字比如 \3 指向前面模式的捕获组号
'Today is 2012-11-27. PyCon starts 2013-3-13.'
# 如果你打算用相同的模式做多次替换，考虑先编译re.compile()它来提升性能

# 对于更加复杂的替换，可以传递一个替换回调函数来代替
>>> from calendar import month_abbr
>>> def change_date(m):
... mon_name = month_abbr[int(m.group(1))]
... return '{} {} {}'.format(m.group(2), mon_name, m.group(3))
...
>>> datepat.sub(change_date, text)
'Today is 27 Nov 2012. PyCon starts 13 Mar 2013.'

# 如果除了替换后的结果外，你还想知道有多少替换发生了，可以使用 re.subn()来代替
>>> newtext, n = datepat.subn(r'\3-\1-\2', text)
>>> newtext
'Today is 2012-11-27. PyCon starts 2013-3-13.'
>>> n
2

—字符串忽略大小写的搜索替换—

# 为了在文本操作时忽略大小写，你需要在使用 re 模块的时候给这些操作提供re.IGNORECASE 标志参数
def matchcase(word):
	def replace(m):
		text = m.group()
		if text.isupper():
			return word.upper()
		elif text.islower():
			return word.lower()
		elif text[0].isupper():
			return word.capitalize()
		else:
			return word
	return replace

>>> re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE)
'UPPER SNAKE, lower snake, Mixed Snake'

# matchcase('snake') 返回了一个回调函数 (参数必须是 match 对象)，前面一节一节提到过， sub() 函数除了接受替换字符串外，还能接受一个回调函数。

—最短匹配模式—

你正在试着用正则表达式匹配某个文本模式，但是它找到的是模式的最长可能匹
配。而你想修改它变成查找最短的可能匹配。

>>> str_pat = re.compile(r'\"(.*)\"') #  r'\"(.*)\"' 的意图是匹配被双引号包含的文本
>>> text2 = 'Computer says "no." Phone says "yes."'
>>> str_pat.findall(text2)
['no." Phone says "yes.']

>>> str_pat = re.compile(r'\"(.*?)\"') # 在模式中的 * 操作符后面加上? 变成懒惰模式
>>> str_pat.findall(text2)
['no.', 'yes.']

—多行匹配模式—

你正在试着使用正则表达式去匹配一大块的文本，而你需要跨越多行去匹配。

>>> comment = re.compile(r'/\*(.*?)\*/')
>>> text1 = '/* this is a comment */'
>>> text2 = '''/* this is a
... multiline comment */
... '''
>>>
>>> comment.findall(text1)
[' this is a comment ']
>>> comment.findall(text2)
[]

# 为了修正这个问题，你可以修改模式字符串，增加对换行的支持
>>> comment = re.compile(r'/\*((?:.|\n)*?)\*/')
>>> comment.findall(text2)
[' this is a\n multiline comment ']

# re.DOTALL 可以让正则表达式中的点 (.) 匹配包括换行符在内的任意字符
>>> comment = re.compile(r'/\*(.*?)\*/', re.DOTALL)
>>> comment.findall(text2)
[' this is a\n multiline comment ']

—将 Unicode 文本标准化—

你正在处理 Unicode 字符串，需要确保所有字符串在底层有相同的表示

>>> s1 = 'Spicy Jalape\u00f1o'
>>> s2 = 'Spicy Jalapen\u0303o'
>>> s1
'Spicy Jalapeño'
>>> s2
'Spicy Jalapeño'
>>> s1 == s2
False
>>> len(s1)
14
>>> len(s2)
15

# NFC 表示字符应该是整体组成 (比如可能的话就使用单一编码)，而 NFD 表示字符应该分解为多个组合字符表示
>>> import unicodedata
>>> t1 = unicodedata.normalize('NFC', s1)
>>> t2 = unicodedata.normalize('NFC', s2)
>>> t1 == t2
True
>>> print(ascii(t1))
'Spicy Jalape\xf1o'
>>> t3 = unicodedata.normalize('NFD', s1)
>>> t4 = unicodedata.normalize('NFD', s2)
>>> t3 == t4
True
>>> print(ascii(t3))
'Spicy Jalapen\u0303o'

>>> s = '\ufb01' # A single character
>>> s
' fi'
>>> unicodedata.normalize('NFD', s)
' fi'
# Notice how the combined letters are broken apart here
>>> unicodedata.normalize('NFKD', s)
'fi'
>>> unicodedata.normalize('NFKC', s)
'fi'

# combining() 函数可以测试一个字符是否为和音字符
>>> t1 = unicodedata.normalize('NFD', s1)
>>> ''.join(c for c in t1 if not unicodedata.combining(c))
'Spicy Jalapeno'

—在正则式中使用 Unicode—

你正在使用正则表达式处理文本，但是关注的是 Unicode 字符处理

# 默认情况下 re 模块已经对一些 Unicode 字符类有了基本的支持。比如， \\d 已经匹配任意的 unicode 数字字符了
>>> import re
>>> num = re.compile('\d+')
>>> # ASCII digits
>>> num.match('123')
<_sre.SRE_Match object at 0x1007d9ed0>
>>> # Arabic digits
>>> num.match('\u0661\u0662\u0663')
<_sre.SRE_Match object at 0x101234030>

# 如果你想在模式中包含指定的 Unicode 字符，你可以使用 Unicode 字符对应的转义序列 (比如 \uFFF 或者 \UFFFFFFF )
>>> arabic = re.compile('[\u0600-\u06ff\u0750-\u077f\u08a0-\u08ff]+')
>>> pat = re.compile('stra\u00dfe', re.IGNORECASE)
>>> s = 'straße'
>>> pat.match(s) # Matches
<_sre.SRE_Match object at 0x10069d370>
>>> pat.match(s.upper()) # Doesn't match
>>> s.upper() # Case folds
'STRASSE'

—删除字符串中不需要的字符—

strip() 方法能用于删除开始或结尾的字符。 lstrip() 和 rstrip() 分别从左和从右执行删除操作

>>> # Whitespace stripping
>>> s = ' hello world \n'
>>> s.strip()
'hello world'
>>> s.lstrip()
'hello world \n'
>>> s.rstrip()
' hello world'
>>>
>>> # Character stripping
>>> t = '-----hello====='
>>> t.lstrip('-')
'hello====='
>>> t.strip('-=h')
'ello'

# 如果你想处理中间的空格使用 replace() 方法或者是用正则表达式替换
>>> s = ' hello world \n'
>>> s.replace(' ', '')
'helloworld'
>>> import re
>>> re.sub('\s+', ' ', s)
'hello world'

—审查清理文本字符串—

一些无聊的幼稚黑客在你的网站页面表单中输入文本”pýtĥöñ”，然后你想将这些字
符清理掉

>>> s = 'pýtĥöñ\fis\tawesome\r\n'  # 还有upper(),lower(),re.replace(),re.sub()等
>>> s
'pýtĥöñ\x0cis\tawesome\r\n'

>>> remap = {
... ord('\t') : ' ',
... ord('\f') : ' ',
... ord('\r') : None # Deleted
... }
>>> a = s.translate(remap)
>>> a
'pýtĥöñ is awesome\n'

>>> import unicodedata
>>> import sys
>>> cmb_chrs = dict.fromkeys(c for c in range(sys.maxunicode)
... if unicodedata.combining(chr(c))) # 把字符的权威组合值返回，如果没有定义，默认是返回0
...
>>> b = unicodedata.normalize('NFD', a)
>>> b
'pýtĥöñ is awesome\n'

>>> b.translate(cmb_chrs)
'python is awesome\n'

—字符串对齐—

使用字符串的 ljust() , rjust() 和 center()方法

# 函数 format() 同样可以用来很容易的对齐字符串。你要做的就是使用 <,> 或者ˆ 字符后面紧跟一个指定的宽度
>>> format(text, '>20')
' Hello World'
>>> format(text, '<20')
'Hello World '
>>> format(text, '^20')
' Hello World '

>>> format(text, '=>20s')
'=========Hello World'
>>> format(text, '*^20s')
'****Hello World*****'

>>> '{:>10s} {:>10s}'.format('Hello', 'World')
' Hello World'

>>> x = 1.2345
>>> format(x, '>10')
' 1.2345'
>>> format(x, '^10.2f')
' 1.23 '

—合并拼接字符串—

>>> parts = ['Is', 'Chicago', 'Not', 'Chicago?']
>>> ' '.join(parts)
'Is Chicago Not Chicago?'

>>> a = 'Is Chicago'
>>> b = 'Not Chicago?'
>>> a + ' ' + b
'Is Chicago Not Chicago?'

>>> print('{} {}'.format(a,b))
Is Chicago Not Chicago?

>>> print(a, b, sep=' ')
Is Chicago Not Chicago?

—字符串中插入变量—

你想创建一个内嵌变量的字符串，变量被它的值所表示的字符串替换掉

>>> s = '{name} has {n} messages.'
>>> s.format(name='Guido', n=37)
'Guido has 37 messages.'

>>> name = 'Guido'
>>> n = 37
>>> s.format_map(vars())
'Guido has 37 messages.'

# format 和 format map() 的一个缺陷就是它们并不能很好的处理变量缺失的情况
class safesub(dict):
	""" 防止 key 找不到"""
	def __missing__(self, key):
		return '{' + key + '}'
    
>>> del n # Make sure n is undefined
>>> s.format_map(safesub(vars()))
'Guido has {n} messages.'

# 你可以将变量替换步骤用一个工具函数封装起来
import sys
def sub(text):
	return text.format_map(safesub(sys._getframe(1).f_locals))

>>> name = 'Guido'
>>> n = 37
>>> print(sub('Hello {name}'))
Hello Guido
>>> print(sub('You have {n} messages.'))
You have 37 messages.
>>> print(sub('Your favorite color is {color}'))
Your favorite color is {color}

# 还有一些可用的方法
>>> name = 'Guido'
>>> n = 37
>>> '%(name) has %(n) messages.' % vars()
'Guido has 37 messages.'

>>> import string
>>> s = string.Template('$name has $n messages.')
>>> s.substitute(vars())
'Guido has 37 messages.'

—以指定列宽格式化字符串—

你有一些长字符串，想以指定的列宽将它们重新格式化

>>> s = '123456789'
>>> import textwrap
>>> textwrap.fill(s, 4)
'1234\n5678\n9'
>>> textwrap.fill(s,4,initial_indent='----')
'----1\n2345\n6789'

#os.get terminal size() 方法来获取终端的大小尺寸
>>> import os
>>> os.get_terminal_size().columns
80

— 在字符串中处理 html 和 xml—

>>> s = 'Elements are written as "<tag>text</tag>".'
>>> import html
>>> print(s)
Elements are written as "<tag>text</tag>".
>>> print(html.escape(s))
Elements are written as &quot;&lt;tag&gt;text&lt;/tag&gt;&quot;.
>>> # Disable escaping of quotes
>>> print(html.escape(s, quote=False))
Elements are written as "&lt;tag&gt;text&lt;/tag&gt;".

# 如果你正在处理的是 ASCII 文本，并且想将非 ASCII 文本对应的编码实体嵌入进去，可以给某些 I/O 函数传递参数 errors='xmlcharrefreplace' 来达到这个目
>>> s = 'Spicy Jalapeño'
>>> s.encode('ascii', errors='xmlcharrefreplace')
b'Spicy Jalape&#241;o'

>>> s = 'Spicy &quot;Jalape&#241;o&quot.'
>>> from html.parser import HTMLParser
>>> p = HTMLParser()
>>> p.unescape(s)
'Spicy "Jalapeño".'
>>>
>>> t = 'The prompt is &gt;&gt;&gt;'
>>> from xml.sax.saxutils import unescape
>>> unescape(t)
'The prompt is >>>'

—字符串令牌解析—

你有一个字符串，想从左至右将其解析为一个令牌流。(使用的令牌是指用于取代敏感数据的字母数字代码)

text = 'foo = 23 + 42 * 10'
tokens = [('NAME', 'foo'), ('EQ','='), ('NUM', '23'), ('PLUS','+'),
('NUM', '42'), ('TIMES', '*'), ('NUM', '10')]
import re
NAME = r'(?P<NAME>[a-zA-Z_][a-zA-Z_0-9]*)'
NUM = r'(?P<NUM>\d+)'
PLUS = r'(?P<PLUS>\+)'
TIMES = r'(?P<TIMES>\*)'
EQ = r'(?P<EQ>=)'
WS = r'(?P<WS>\s+)'
master_pat = re.compile('|'.join([NAME, NUM, PLUS, TIMES, EQ, WS]))

>>> scanner = master_pat.scanner('foo = 42')
>>> scanner.match()
<_sre.SRE_Match object at 0x100677738>
>>> _.lastgroup, _.group()    # Python解释器模式中,"_"表示上次结果
('NAME', 'foo')
>>> scanner.match()
<_sre.SRE_Match object at 0x100677738>
>>> _.lastgroup, _.group()
('WS', ' ')
>>> scanner.match()
<_sre.SRE_Match object at 0x100677738>
>>> _.lastgroup, _.group()
('EQ', '=')
>>> scanner.match()
<_sre.SRE_Match object at 0x100677738>
>>> _.lastgroup, _.group()
('WS', ' ')
>>> scanner.match()
<_sre.SRE_Match object at 0x100677738>
>>> _.lastgroup, _.group()
('NUM', '42')

# 实际使用这种技术的时候，可以很容易的像下面这样将上述代码打包到一个生成器中
def generate_tokens(pat, text):
	Token = namedtuple('Token', ['type', 'value'])
	scanner = pat.scanner(text)
	for m in iter(scanner.match, None):
		yield Token(m.lastgroup, m.group())
# Example use
for tok in generate_tokens(master_pat, 'foo = 42'):
	print(tok)
# Produces output
# Token(type='NAME', value='foo')
# Token(type='WS', value=' ')
# Token(type='EQ', value='=')
# Token(type='WS', value=' ')
# Token(type='NUM', value='42')

# 如果你想过滤令牌流，你可以定义更多的生成器函数或者使用一个生成器表达式。比如，下面演示怎样过滤所有的空白令牌
tokens = (tok for tok in generate_tokens(master_pat, text)
			if tok.type != 'WS')
for tok in tokens:
	print(tok)

—实现一个递归下降分析器—

—字节字符串上的字符串操作—

字节字符串同样也支持大部分和文本字符串一样的内置操作

>>> data = b'Hello World'
>>> data[0:5]
b'Hello'
>>> data.startswith(b'Hello')
True
>>> data.split()
[b'Hello', b'World']
>>> data.replace(b'Hello', b'Hello Cruel')
b'Hello Cruel World'

# 这些操作同样也适用于字节数组
>>> data = bytearray(b'Hello World')
>>> data[0:5]
bytearray(b'Hello')
>>> data.startswith(b'Hello')
True
>>> data.split()
[bytearray(b'Hello'), bytearray(b'World')]
>>> data.replace(b'Hello', b'Hello Cruel')
bytearray(b'Hello Cruel World')

# 你可以使用正则表达式匹配字节字符串，但是正则表达式本身必须也是字节串
>>> data = b'FOO:BAR,SPAM'
>>> import re
>>> re.split('[:,]',data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.3/re.py", line 191, in split
return _compile(pattern, flags).split(string, maxsplit)
TypeError: can't use a string pattern on a bytes-like object
>>> re.split(b'[:,]',data) # Notice: pattern as bytes
[b'FOO', b'BAR', b'SPAM']

这里也有一些需要注意的不同点

# 字节字符串的索引操作返回整数而不是单独字符
>>> a = 'Hello World' # Text string
>>> a[0]
'H'
>>> a[1]
'e'
>>> b = b'Hello World' # Byte string
>>> b[0]
72
>>> b[1]
101

# 字节字符串不会提供一个美观的字符串表示，也不能很好的打印出来
>>> s = b'Hello World'
>>> print(s)
b'Hello World' # Observe b'...'
>>> print(s.decode('ascii'))
Hello World

# 也不存在任何适用于字节字符串的格式化操作
>>> b'{} {} {}'.format(b'ACME', 100, 490.1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'format'
    
# 如果你想格式化字节字符串，你得先使用标准的文本字符串，然后将其编码为字节字符串
>>> '{:10s} {:10d} {:10.2f}'.format('ACME', 100, 490.1).encode('ascii')
b'ACME 100 490.10'

# 最后需要注意的是，使用字节字符串可能会改变一些操作的语义，特别是那些跟文件系统有关的操作

最后提一点，一些程序员为了提升程序执行的速度会倾向于使用字节字符串而不
是文本字符串。尽管操作字节字符串确实会比文本更加高效 (因为处理文本固有的
Unicode 相关开销)。这样做通常会导致非常杂乱的代码。你会经常发现字节字符串并
不能和 Python 的其他部分工作的很好，并且你还得手动处理所有的编码/解码操作。
坦白讲，如果你在处理文本的话，就直接在程序中使用普通的文本字符串而不是字节
字符串。不做死就不会死！ ————《Python cookbook》