也可以去这里查看笔记

2.7最短模式匹配

问题：你正在使用正则表达式匹配某个文本，但是它找到的是模式的最长可能匹配，把它修改成查找最短的可能匹配
方案：可以在模式操作符后面加上？修饰符
在下面例子中，模式 \”(.)\” 的意图是匹配被双引号包含的文本。但是在正则表达式中操作符是贪婪的，因此匹配操作会查找最长的可能匹配。于是在第二个例子中搜索 text2 的时候返回结果并不是我们想要的

import re
str_pat = re.compile(r'\"(.*)\"')
text1 = 'Computer says "no."'
str_pat.findall(text1)

['no.']

text2 = 'Computer says "no." Phone says "yes."'
str_pat.findall(text2)

['no." Phone says "yes.']

加上？修饰符之后,使之成为最短匹配

str_pat = re.compile(r'\"(.*?)\"')
str_pat.findall(text2)

['no.', 'yes.']

str_pat.findall(text1)

['no.']

2.8多行模式匹配

问题：使用正则表达式匹配大块文本，需要跨越多行去匹配
方案：见例子
点 (.) 匹配除了换行外的任何字符。然而，如果你将点 (.) 号放在开始与结束符 (比如引号) 之间的时候，那么匹配操作会查找符合模式的最长可能匹配。这样通常会导致很多中间的被开始与结束符包含的文本被忽略掉，并最终被包含在匹配结果字符串中返回。通过在 * 或者 + 这样的操作符后面添加一个 ? 可以强制匹配算法改成寻找最短的可能匹配。
但是（.）不匹配换行符

comment = re.compile(r'/\*(.*?)\*/')
text1 = '/* this is comment */'
text2 = '''/* this is a 
multiline comment */
'''

comment.findall(text1)

[' this is comment ']

comment.findall(text2)

[]

现在对其修改，可以匹配多行.在这个模式中，(?:.|\n)指定了一个非捕获组（也就是定义了一个紧急你用来做匹配，而不能通过单独捕获或者编号的组）
这个代码并没有成功，找到问题后再做修改

text2 = '''/* this is a 
 multiline comment */
'''
comment = re.compile(r'/\*((?:.|\n)*？)\*/')
comment.findall(text2)

[]

re.compile()函数接收一个参数叫re.DOTALL。它可以使正则表达式中的点匹配包括换行符在内的任意字符
点代表的是任意字符。* 代表的是取 0 至无限长度，问号代表的是非贪婪模式。三个链接在一起是取尽量少的任意字符
. 表示除\n之外的任意字符
* 表示匹配0-无穷
+表示匹配1-无穷

comment = re.compile(r'/\*(.*?)\*/',re.DOTALL)
comment.findall(text2)

[' this is a \n multiline comment ']

2.9 将Unicode文本标准化

问题：正在处理Unicode字符串，需要确保所有的字符串在底层有相同的表示
方案：可以使用unicodedata模块先将文本标准化
在Unicode中，某些字符可能能够用多个合法的编码表示

s1 = 'Spicy Jalape\u00f1o'
s2 = 'Spicy Jalapen\u0303o'

s1

'Spicy Jalapeño'

s2

'Spicy Jalapeño'

s1 ==  s2

False

len(s1)

len(s2)

上面的文本” Spicy Jalapeño”使用了两种形式来表示。第一种使用整体字符” ñ”(U+00F1)，第二种使用拉丁字母” n”后面跟一个” ~”的组合字符 (U+0303)。

import unicodedata as ucd

normalize()第一个参数指定字符串标准化的方式，NFC表示字符应该是整体组成

t1 = ucd.normalize('NFC',s1)
t2 = ucd.normalize('NFC',s2)

t2 ==t1

True

print(ascii(t1))
print(ascii(t2))

'Spicy Jalape\xf1o'
'Spicy Jalape\xf1o'

NFD表示字符应该分解为多个字符表示

t3 = ucd.normalize('NFD',s1)
t4 = ucd.normalize('NFD',s2)

t3 ==t4

True

print(ascii(t3))
print(ascii(t4))

'Spicy Jalapen\u0303o'
'Spicy Jalapen\u0303o'

python同样支持NFKC，NFKD，他们在处理某些字符的时候增加了额外的兼容性

s = '\ufb01'
s

'ﬁ'

import unicodedata as ucd
ucd.normalize('NFD',s)

'ﬁ'

ucd.normalize('NFKD',s)

'fi'

ucd.normalize('NFKC',s)

'fi'

在清理和过滤文本的时候字符的标准化也是很重要的。比如，假设你想清除掉一些文本上面的变音符的时候 (可能是为了搜索和匹配)
combining() 函数可以测试一个字符是否为和音字符

s1 = 'Spicy Jalape\u00f1o'
s2 = 'Spicy Jalapen\u0303o'
t1 = ucd.normalize('NFD',s1)
''.join(c for c in t1 if not ucd.combining(c))

'Spicy Jalapeno'

2.10在正则表达式中使用Unicode

问题：在正则表达式中处理文本，但关注的是 Unicode字符处理
方案：默认情况下 re 模块已经对一些 Unicode 字符类有了基本的支持。比如， \\d 已经匹配任意的 unicode 数字字符

import re 
num = re.compile('\d+')
num.match('123ada')

<_sre.SRE_Match object; span=(0, 3), match='123'>

num.match('\u0661\u0662\u0663')

<_sre.SRE_Match object; span=(0, 3), match='١٢٣'>

2.11删除字符串中不需要的字符

问题：想要去掉文本字符串开头，结尾和中间不想要的字符，比如空白
方案：strip()可以删除开始和结尾的字符，lstrip()和rstrip()分别从左边和右边执行去除操作。
默认情况下，上述函数去除的是空白，但是也可以指定其它字符

s = ' hello world \n'
s.strip()

'hello world'

s.lstrip()

'hello world \n'

s.rstrip()

' hello world'

' hello world \n'

由上述操作的结果可知，上面的三个函数并不会改变原来对象，而是生成了一个新的对象
由下面的操作结果可知，上述三个函数只能消除两边的文本

t = '-----hello  --world== --=='
t.strip('-')

'hello  --world== --=='

t.strip('=')

'-----hello  --world== --'

t.strip('-=')

'hello  --world== '

如果想要处理中间的空格可以使用replace()函数,或者使用正则表达式

t.replace(' ','')

'-----hello--world==--=='

import re
re.sub('\s+','',t)

'-----hello--world==--=='

通常可以将strip() 和其他迭代操作结合使用，比如从文本中读取多行数据

with open('data_file/test1_3.txt') as f:
    lines = (line.strip() for line in f)
    for line in lines:
        print(line)

my name is flfl
love python
say hello
to the world
python nihao
yongyuan
yanthon
pythonnn

2.12审查清理文本字符串

问题：想要清理某些字符
方案：使用str.translate()

s = 'pýtĥöñ\fis\tawesome\r\n'
s

'pýtĥöñ\x0cis\tawesome\r\n'

第一步，清理空白字符

remap = {
    ord('\t'):' ',
    ord('\f'):' ',
    ord('\r'):None
}
a = s.translate(remap)
a

'pýtĥöñ is awesome\n'

import unicodedata
import sys
cmb_chrs = dict.fromkeys(c for c in range(sys.maxunicode) if unicodedata.combining(chr(c)))
b = unicodedata.normalize("NFD",a)
b

'pýtĥöñ is awesome\n'

b.translate(cmb_chrs)

'python is awesome\n'

Python Cookbook学习笔记ch2_02

2.7最短模式匹配

2.8多行模式匹配

2.9 将Unicode文本标准化

2.10在正则表达式中使用Unicode

2.11删除字符串中不需要的字符

2.12审查清理文本字符串

猜你喜欢