(Data Science Learning Handbook 32) A detailed introduction to the re module in Python

 

1. Introduction

  Regarding regular expressions, I have given a detailed introduction in the previous article (Data Science Learning Handbook 31). This article will summarize the common functions of the re module that comes with Python;

  As a module in Python that supports regular expression-related functions, re provides a series of methods to complete the processing of almost all types of text information. The following are introduced one by one:

 

二、re.compile()

  We used this method in the previous article. It returns the matching pattern of a target object by compiling the regular expression parameters, thereby improving the efficiency of regular expressions. The main parameters are as follows:

pattern: the input regular expression to be compiled, the regular expression needs to be wrapped in '', such as 'aa*'

flags: compilation flags, used to modify the matching method of regular expressions from a certain angle, commonly used are:

  re.S: make . match all characters including newlines

  re.I: make matching case insensitive

  re.U: Parse characters according to Unicode rules, mainly used for matching Chinese

Here are a few simple examples:

import re

text = ' Even if you haven't heard of "Wikipedia Theory of Six Degrees of Separation", chances are you've heard of "Kevin Bacon's Game of Six Degrees of Separation". In both games, two disparate themes (in Wikipedia, the connection between entries, Kevin Bacon's game of Six Degrees of Separation is the use of actors who appear in the same movie) to connect) with a total of no more than six topics (including the original two topics). '

''' Compile our regular expression, the rule is to find everything inside double quotes (excluding double quotes) ''' 
regex = re.compile( ' “(.*?)” ' )

''' Print matching result ''' 
print (regex.findall(text))

operation result:

It can be seen that all the matched content will be returned in the form of a list;

import re

text = ' Even if you haven't heard of "Wikipedia Theory of Six Degrees of Separation", chances are you've heard of "Kevin Bacon's Game of Six Degrees of Separation". In both games, two disparate themes (in Wikipedia, the connection between entries, Kevin Bacon's game of Six Degrees of Separation is the use of actors who appear in the same movie) to connect) with a total of no more than six topics (including the original two topics). '

''' Compile our regular expression, the rule is the content of uppercase and lowercase English letters appearing at least once ''' 
regex = re.compile( ' [A-Za-z]+ ' )

''' Print matching result ''' 
print (regex.findall(text))

operation result:

Next, we assign values ​​to the flags parameter to see what functions will be implemented:

import re

text = ' Even if you haven't heard of "Wikipedia Theory of Six Degrees of Separation", chances are you've heard of "Kevin Bacon's Game of Six Degrees of Separation". In both games, two disparate themes (in Wikipedia, the connection between entries, Kevin Bacon's game of Six Degrees of Separation is the use of actors who appear in the same movie) to connect) with a total of no more than six topics (including the original two topics). '

''' Compile our regular expression, the rule is the content of lowercase English letters appearing at least once ''' 
regex = re.compile( ' [az]+ ' ) #Do not use flags to ignore case

''' Print matching result ''' 
print (regex.findall(text))

operation result:

Because the regular expression we used is [az]+, the uppercase letter part failed to match. Below we do not change our regular expression part, but assign parameters to flags:

import re

text = ' Even if you haven't heard of "Wikipedia Theory of Six Degrees of Separation", chances are you've heard of "Kevin Bacon's Game of Six Degrees of Separation". In both games, two disparate themes (in Wikipedia, the connection between entries, Kevin Bacon's game of Six Degrees of Separation is the use of actors who appear in the same movie) to connect) with a total of no more than six topics (including the original two topics). '

''' Compile our regular expression, the rule is the content that lowercase English letters appear at least once ''' 
regex = re.compile( ' [az]+ ' ,flags=re.I) #Use re.I to ignore case

''' Print matching result ''' 
print (regex.findall(text))

 operation result:

In the case of using flags=re.I to ignore case, on the basis of the original regular expression, the matching of uppercase letters is realized.

 

3. re.match()

  Personally, I don't think this method is used a lot. It means that the defined regular expression is used as a match to the beginning of the target string (it does not match the non-beginning part). The following is a simple example:

import re

text = 'What are you waiting for?'

''' successfully matched to the beginning because the beginning of the string is W ''' 
print (re.match( ' w ' ,text,re.I).group())

operation result:

When the beginning of the string does not match, no value is returned even if other parts of the string match (that is, the so-called only the beginning part is matched):

import re

text = 'What are you waiting for? where are you fucking from?'

'''未能成功匹配到开头,因为字符串开头是Wha'''
print(re.match('whe',text,re.I))

运行结果:

 

四、re.search()

  re.search()的使用格式类似re.match(),即三个传入参数:pattern,string,flags,但与match匹配开头不同的是,search匹配的是文中出现的第一个满足条件的字符串部分并返回,对后续的不再进行匹配,下面是一个简单的例子:

import re

text = 'What are you waiting for? where are you fucking from?'

'''成功匹配到第一个出现的目标内容,后续的内容便不再匹配'''
print(re.search('a',text,re.I).group())

运行结果:

文中有很多a,但search遇到第一个a便停止匹配并返回这第一个值;

这里要注意一下,我在前面几个例子中使用到的group()方法,是针对match或search成功匹配并返回的对象,我们称之为match object,围绕它的常用方法如下:

  strat():返回匹配开始的位置

  end():返回匹配结束的位置

  group():返回被re匹配的字符串

  span():返回一个tuple格式的对象,标记了匹配开始,结束的位置,形如(start,end)

事实上,虽然说search只返回一个对象,但我们可以通过将正则表达式改造成若干子表达式拼接的形式,来返回多个分块的对象

import re

text = '1213sdsdjAKNNK'

'''匹配复合表达式对应的内容(返回对象会根据子表达式进行分块),并分别打印第1、2、3块子内容'''
print(re.search('([1-9]+)*([a-z]+)*([A-Z]+)',text).group(1))
print(re.search('([1-9]+)*([a-z]+)*([A-Z]+)',text).group(2))
print(re.search('([1-9]+)*([a-z]+)*([A-Z]+)',text).group(3))

运行结果:

 

五、findall()

  注意,这和我们在解析BeautifulSoup对象时使用到的findAll()拼写不同(虽然功能相似),它与match和search不同的是,它会根据传入的正则表达式部分来提取目标字符串中所有符合规则的部分,并传出为列表的形式,下面是一个简单的例子:

import re

text = '即使你没听说过“维基百科六度分隔理论”,也很可能听过“凯文 · 贝肯(Kevin Bacon)的六度分隔值游戏”。在这两个游戏中,都是把两个不相干的主题(维基百科里是用词条之间的连接,凯文 · 贝肯的六度分隔值游戏是用出现在同一部电影中的演员来连接)用一个总数不超过六条的主题连接起来(包括原来的两个主题)。'

'''匹配text中所有以 听 开头的长度为2的字符串'''
print(re.findall('听.',text))

运行结果:

与前面在介绍re.compile()时对findall的用法不同,这里是re.findall(正则表达式,目标字符串)的格式,前面的是 编译好的正则模式.findall(目标字符串),这两种格式的功能等价;

 

六、re.finditer()

  我们有时候会遇到这样的情况:目标字符串非常长(可能是一整篇小说),而符合我们正则表达式的目标内容也非常的多,这种时候如果沿用前面的做法使用re.findall()来一口气将所有结果提取出来保存在一个硕大的列表中,是件非常占用内存的事情,而Python中用来节省内存的生成器(generator)就派上了用场;

  re.finditer(pattern,string,flags=0)就利用了这种机制,它构造出一个基于正则表达式pattern和目标字符串string的生成器,使得我们可以在对该生成器的循环中边循环边计算对应位置的值,即从始至终每一轮只保存了当前的位置和当前匹配到的内容,达到节省内存的作用,下面是一个简单的例子:

import re

text = 'abjijdianbdadjijijiha8hihanihhhiihiaaihidaihihaidhihaidahi'

'''构造我们的迭代器'''
obj = re.finditer('a.',text)

'''对obj进行迭代,每次返回当前位置匹配到的内容及对应的起始与结束位置'''
for i in obj:
    print(i.group())
    print(i.span())

运行结果:

 

七、re.sub()

  类似字符串操作中的replace(),只不过replace()中只能死板地设置固定的内容作为替换项,利用re.sub(pattern,repl,string,count)则可以基于正则表达式达到灵活匹配替换内容,pattern指定了正则表达式部分,repl指定了进行替换的新内容,string指定目标字符串,count指定了替换的次数,默认全部替换,其实前一篇文章结尾处我们得到一篇干净的新闻报道就用到了这种方法,下面再举一个简单的例子:

import re

text = 'abjijdianbdadjijijiha8hihanihhhiihiaaihidaihihaidhihaidahi'

'''构造我们的替代规则'''
obj = re.sub('a.','嘻嘻',text)

'''打印替换后内容'''
print(obj)

运行结果:

 

八、re.split()

  类似于字符串处理中的split(),re.split()在原有基础上扩充了正则表达式的功能,re.split(pattern,string,maxsplit),其中pattern指定分隔符的正则表达式,string指定目标字符串,maxsplit指定最大分割个数,下面是一个简单的例子:

import re

text = 'abjijdianbdadjijijiha8hihanihhhiihiaaihidaihihaidhihaidahi'

'''构造我们的分割规则'''
obj = re.split('i.',text)

'''打印分割后内容'''
print(obj)

运行结果:

  

  以上就是关于re模块的常用功能,接下来会以一篇实战来详细介绍实际业务中的网络数据采集过程。

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325322141&siteId=291194637