1. 以指定列宽格式化字符串

很多情况下，我们有一些长字符串，想以指定的列宽将它们重新格式化。

textwarp()

import textwrap
import os

s = "Look into my eyes, look into my eyes, \
the eyes, the eyes, the eyes, not around the eyes, don't look \
around the eyes, look into my eyes, you're under."
length = 50 # os.get_terminal_size().columns;
print(textwrap.fill(s, length))
# >>> Look into my eyes, look into my eyes, the eyes,
#     the eyes, the eyes, not around the eyes, don't
#     look around the eyes, look into my eyes, you're
#     under.

print(textwrap.fill(s, length, initial_indent = '    '))
# >>>     Look into my eyes, look into my eyes, the
#     eyes, the eyes, the eyes, not around the eyes,
#    don't look around the eyes, look into my eyes,
#    you're under.

print(textwrap.fill(s, length, subsequent_indent='    '))
# >>> Look into my eyes, look into my eyes, the eyes,
#         the eyes, the eyes, not around the eyes, don't
#        look around the eyes, look into my eyes,
#        you're under.

textwrap 模块对于字符串打印是非常有用的，特别是输出自动匹配终端大小的时候。可以使用 os.get_terminal_size() 方法来获取终端的大小尺寸

2. 字符串令牌解析

当有一个字符串，我们需要从左至右将其解析为一个令牌流。

为了令牌化字符串，我们不仅需要匹配模式，还得指定模式的类型。比如，可能想将字符串转换为序列对。

为了执行序列对的切分，第一步就是利用命名捕获组的正则表达式来定义所有可能的令牌，包括空格。

import re

text = 'foo = 23 + 42 * 10'
NAME = r'(?P<NAME>[a-zA-Z_][a-zA-Z_0-9]*)'
NUM = r'(?P<NUM>\d+)'
PLUS = r'(?P<PLUS>\+)'
TIMES = r'(?P<TIMES>\*)'
EQ = r'(?P<EQ>=)'
WS = r'(?P<WS>\s+)'
master_pat = re.compile('|'.join([NAME, NUM, PLUS, TIMES, EQ, WS]))
scanner = master_pat.scanner(text)
res = scanner.match()
print( res.lastgroup, res.group())
# >>>  NAME foo

在上面的模式中， ?P<TOKENNAME> 用于给一个模式命名，供后面使用。

为了令牌化，使用模式对象 scanner() 方法。这个方法会创建一个 scanner 对象，在这个对象上不断的调用 match() 方法会一步步的扫描目标文本，每步一个匹配。

实际使用这种技术的时候，可以很容易将上述代码打包到一个生成器中。

3. 字节字符串上的字符串操作

如果想在字节字符串上执行文本操作(比如移除，搜索和替换)？字节字符串同样也支持大部分和文本字符串一样的内置操作。

data = b'Hello World'
print( data[0:5] )
# >>> b'Hello'
print( data.startswith(b'Hello') )
# >>> True
print( data.split() )
# >>> [b'Hello', b'World']
print( data.replace(b'Hello', b'Hello Cruel') )
# >>> b'Hello Cruel World'

这些操作同样也适用于字节数组。比如：

data = bytearray(b'Hello World')
print( data[0:5] )
# >>> bytearray(b'Hello')
print( data.startswith(b'Hello') )
# >>> True
print( data.split() )
# >>> [bytearray(b'Hello'), bytearray(b'World')]
print( data.replace(b'Hello', b'Hello Cruel') )
# >>> bytearray(b'Hello Cruel World')

如果我们使用正则表达式匹配字节字符串，但是正则表达式本身必须也是字节串。比如：

data = b'FOO:BAR,SPAM'
import re
## print ( re.split('[:,]',data) )
# >>> Traceback (most recent call last):
#     File "<stdin>", line 1, in <module>
#     File "ByteString.py", line 3, in split
#     return _compile(pattern, flags).split(string, maxsplit)
#     TypeError: can't use a string pattern on a bytes-like object
print( re.split(b'[:,]',data) )# Notice: pattern as bytes
# >>> [b'FOO', b'BAR', b'SPAM']

文章参考《python3-codebook》

python字符串与文本处理技巧(4): 格式化输出、令牌解析、串上串

1. 以指定列宽格式化字符串

2. 字符串令牌解析

3. 字节字符串上的字符串操作

猜你喜欢