>>> # 空白符的移除
... s = ' hello world \n'
>>> s.strip()  # 移除前后空白符
'hello world'
>>> s.lstrip()  # 移除前空白符
'hello world \n'
>>> s.rstrip()  # 移除后空白符
' hello world'
>>>
>>> # 指定其他字符的移除
... s = 'www.example.com'
>>> s.strip('cmowz.')
'example'
>>> s.lstrip('cmowz.')
'example.com'
>>> s.rstrip('cmowz.')
'www.example'

对于上述代码，移除空白字符比较好理解。指定移除其他字符，实际上执行的过程：移除的过程中遇到一个字符未包含于 chars 所指定的字符集中时，将停止操作。（如 lstrip() 示例中遇到 e 字符并未包含于指定字符集 cmowz. 中，此时执行停止，返回剩余部分的内容。）

在处理数据以备后续使用，这些 strip() 方法往往会被频繁调用。但是，这里需要注意的是，这些方法并不能够对字符串中间的文本产生任何影响。如下示例：

>>> s = ' hello      world \n'
>>> s.strip()
'hello      world'

如果需要移除中间的空格，可以考虑使用 replace() 方法或者是用正则表达式中的 sub 进行替换。示例如下：

>>> s = 'hello      world'
>>> s.replace(' ','')
'helloworld'
>>> import re
>>> re.sub('\s+', ' ', s)
'hello world'

清洗文本字符串

有些用户，会恶作剧在网站页面表单输入 Unicode 文本，例如 Un̄ićŏdè ，现在的需求是将这些字符进行清理。

文本清理会涉及文本解析和数据处理等系列问题。较为简单的情形下，可以选择将文本转为标准格式（例如：str.upper() 和 str.lower()），然后利用替换操作（比如：str.replace() 或者 re.sub()），删除或者替换指定字符。

str.translate() $^{[2]}$

str.translate(table) 主要的作用是返回原字符串的副本，其中每个字符按给定的转换表进行映射。参数 table 必须是使用 __getitem__() 实现索引操作的对象，通常为 mapping 或 sequence。如下示例：

>>> s = 'Un̄ićŏdè, the\fworld standard\tfor text and emoji\r\n'
>>> given_map = {
...     ord('\t'): ' ',
...     ord('\f'): ' ',
...     ord('\r'): None
... }
>>> a = s.translate(given_map)
>>> a
'Un̄ićŏdè, the world standard for text and emoji\n'

返回的结果中，空白字符 \t 和 \f 被重新映射为一个空格。而返回 None 这部分，表示删除字符 \r。

unicodedata 模块

unicodedata.normalize() $^{[3]}$

上面提及的 Un̄ićŏdè，这种带有和音符的字符串。可以使用尝试使用 str.translate() 方法构建更完整的转换表，用来删除和音符。如下示例：

>>> import sys
>>> import unicodedata
>>> a = 'Un̄ićŏdè, the world standard for text and emoji\n'
>>> cmb_chrs = dict.fromkeys(c for c in range(sys.maxunicode) if unicodedata.combining(chr(c)))
>>> b = unicodedata.normalize('NFD', a)
>>> b
'Un̄ićŏdè, the world standard for text and emoji\n'
>>> b.translate(cmb_chrs)
'Unicode, the world standard for text and emoji\n'
>>>

dict.fromkeys(seq[, value]) 方法，返回一个新字典，value 默认为 None。在这个例子中，dict.fromkeys() 方法用于构造一个字典，每个 Unicode 和音符作为键，对应的值全部为 None。

unicodedata.normalize(form, unistr) 返回 Unicode 字符串 unistr 的正常形式 form。 form 的有效值为 NFC、NFKC、NFD，NFKD。combining() 方法用于确认字符是否为和音符。示例中的 Unicode 文本是由字母与和音符组合而成的，使用 NFD 将字符转换为分解形式，然后用 translate 函数删除所有的和音符。

encode() 和 decode()

还有另外一种方法，涉及到 I/O 解码与编码函数。如下示例：

>>> a
'Un̄ićŏdè, the world standard for text and emoji\n'
>>> b = unicodedata.normalize('NFD', a)
>>> b.encode('ascii', 'ignore').decode('ascii')
'Unicode, the world standard for text and emoji\n'

代码的执行流程：同样是先用 normalize() 方法先分解和音符。然后在 ASCII 编码/解码的过程中丢弃掉这些字符。这种方法只有在获取文本到对应 ASCII 的时候才会生效。

关于性能方面的问题，简单替换操作，str.replace() 体现的优势更明显。但，如果需要清理复杂字符，对字符重新映射或删除，translate() 的表现更好。具体使用哪种方法，要多方面尝试评估再采用。

参考资料

来源

[1] David M. Beazley;Brian K. Jones.Python Cookbook, 3rd Edtioni.O’Reilly Media.2013.
[2] “4. Built-in Types — Python 3.6.10 documentation”.python.org. Retrieved 13 November 2019.
[3] “6.5. unicodedata — Unicode Database”.Docs.python.org. Retrieved 3 January 2020.

以上就是本篇的主要内容

欢迎关注『书所集录』公众号

"大梦三千秋

发布了52 篇原创文章 · 获赞 16 · 访问量 5725

私信关注

Python 文本字符串清理

文章目录

文本字符串清理

删除字符串中多余的字符 $^{[1]}$

strip()、lstrip()、rstrip()

清洗文本字符串

str.translate() $^{[2]}$

unicodedata 模块

unicodedata.normalize() $^{[3]}$

encode() 和 decode()

参考资料

来源

猜你喜欢

Python 文本字符串清理

文章目录

文本字符串清理

删除字符串中多余的字符 [ 1 ] ^{[1]} [1]

strip()、lstrip()、rstrip()

清洗文本字符串

str.translate() [ 2 ] ^{[2]} [2]

unicodedata 模块

unicodedata.normalize() [ 3 ] ^{[3]} [3]

encode() 和 decode()

参考资料

来源

猜你喜欢

删除字符串中多余的字符 $^{[1]}$

str.translate() $^{[2]}$

unicodedata.normalize() $^{[3]}$