Advanced Python - Regular Expressions

What is the use?

Let's look at an example.

A text file stores some market position information, the format is as follows

Python3 高级开发工程师 上海互教教育科技有限公司上海-浦东新区2万/月02-18满员
测试开发工程师(C++/python) 上海墨鹍数码科技有限公司上海-浦东新区2.5万/每月02-18未满员
Python3 开发工程师 上海德拓信息技术股份有限公司上海-徐汇区1.3万/每月02-18剩余11人
测试开发工程师(Python) 赫里普(上海)信息科技有限公司上海-浦东新区1.1万/每月02-18剩余5人
Python高级开发工程师 上海行动教育科技股份有限公司上海-闵行区2.8万/月02-18剩余255人
python开发工程师 上海优似腾软件开发有限公司上海-浦东新区2.5万/每月02-18满员

Now, we need to write a program to grab the salaries of all positions from these texts.

is to get results like this

2
2.5
1.3
1.1
2.8
2.5

How to do it?
 

This is typical string processing.

Analyzing the rules here, we can find that there are keywords  万/月 or 万/每月

write the following code

content = '''
Python3 高级开发工程师 上海互教教育科技有限公司上海-浦东新区2万/月02-18满员
测试开发工程师(C++/python) 上海墨鹍数码科技有限公司上海-浦东新区2.5万/每月02-18未满员
Python3 开发工程师 上海德拓信息技术股份有限公司上海-徐汇区1.3万/每月02-18剩余11人
测试开发工程师(Python) 赫里普(上海)信息科技有限公司上海-浦东新区1.1万/每月02-18剩余5人
Python高级开发工程师 上海行动教育科技股份有限公司上海-闵行区2.8万/月02-18剩余255人
python开发工程师 上海优似腾软件开发有限公司上海-浦东新区2.5万/每月02-18满员
'''

# 将文本内容按行放入列表
lines = content.splitlines()
for line in lines:
    # 查找'万/月' 在 字符串中什么地方
    pos2 = line.find('万/月')
    if pos2 < 0:
        # 查找'万/每月' 在 字符串中什么地方
        pos2 = line.find('万/每月')
        # 都找不到
        if pos2 < 0: 
            continue

    # 执行到这里,说明可以找到薪资关键字
    # 接下来分析 薪资 数字的起始位置
    # 方法是 找到 pos2 前面薪资数字开始的位置
    idx = pos2-1

    # 只要是数字或者小数点,就继续往前面找
    while line[idx].isdigit() or line[idx]=='.':
        idx -= 1

    # 现在 idx 指向 薪资数字前面的那个字,
    # 所以薪资开始的 索引 就是 idx+1
    pos1 = idx + 1

    print(line[pos1:pos2])

 Is there an easier way for this ? 从字符串中搜索出某种特征的子串

The solution is what we're going to introduce today  正则表达式 .


If we use regular expressions, the code can be like this

content = '''
Python3 高级开发工程师 上海互教教育科技有限公司上海-浦东新区2万/月02-18满员
测试开发工程师(C++/python) 上海墨鹍数码科技有限公司上海-浦东新区2.5万/每月02-18未满员
Python3 开发工程师 上海德拓信息技术股份有限公司上海-徐汇区1.3万/每月02-18剩余11人
测试开发工程师(Python) 赫里普(上海)信息科技有限公司上海-浦东新区1.1万/每月02-18剩余5人
Python高级开发工程师 上海行动教育科技股份有限公司上海-闵行区2.8万/月02-18剩余255人
python开发工程师 上海优似腾软件开发有限公司上海-浦东新区2.5万/每月02-18满员
'''

import re
for one in  re.findall(r'([\d.]+)万/每{0,1}月', content):
    print(one)

Run it and see, the result is the same.

But the code is much simpler.

A regular expression is a syntax used to describe the characteristics of the string you want to search for.

A regular expression is specified here

re.findall(r'([\d.]+)万/每{0,1}月', content)

([\d.]+)万/每{0,1}月 , which is a regular expression string specifying the characteristics of the search substring.

Why do you write that? We will introduce it later.

The findall function returns all matching substrings in a list.

As can be seen from the above example, the key to using regular expressions is,  如何写出正确的表达式语法 .

Click here to refer to the description in the official Python documentation  . Specific usage details including syntax are inside.

This tutorial will introduce you to some common regular expression syntax.

online verification

How to verify whether the expression you wrote can correctly match the string to be searched?

You can visit this URL:  regex101: build, test, and debug regex

Enter the search text and expression according to the schematic picture below to see if your expression can correctly match the string.

The above verification website is abroad, if you can’t access it, try this domestic regular expression verification website

common grammar

Ordinary characters written in regular expressions mean: Match them directly.

For example, in your text below, if you want to find all the tests, the regular expression is very simple, just enter test directly.

As follows:

The same is true for Chinese characters. To find Chinese characters, just write them directly in the regular expression.


But there are some special characters, the term is called metacharacters (metacharacters) .

They appear in the regular expression string, not to match them directly, but to express some special meaning.

These special metacharacters include the following:

. * + ? \ [ ] ^ $ { } | ( )

Let's introduce their meanings separately:

dot - matches all characters

. means to match  换行符 any  character except . 单个

For example, you want to select all colors from the text below.

苹果是绿色的
橙子是橙色的
香蕉是黄色的
乌鸦是黑色的

That is to find all   words that end with and include the previous character.

You can write regular expressions like this  . .色

The dot represents any character, note that it is a character.

.色 Combining them means looking for any character followed by the word color, and combining the two characters

Verify it, as shown in the figure below

As long as the expression is correct, it can be written in Python code, as follows

content = '''苹果是绿色的
橙子是橙色的
香蕉是黄色的
乌鸦是黑色的'''

import re
p = re.compile(r'.色')
for one in  p.findall(content):
    print(one)

The result of the operation is as follows

绿色
橙色
黄色
黑色

Asterisk - repeat match any number of times

* Indicates matching the preceding subexpression any number of times, including 0 times.

For example, you want to select the string content after the comma in each line, including the comma itself, from the text below. Note that the commas here are Chinese commas.

苹果,是绿色的
橙子,是橙色的
香蕉,是黄色的
乌鸦,是黑色的
猴子,

You can write regular expressions like this  ,.* .

  • Immediately following the . means that any character can appear any number of times, so the whole expression means all characters after the comma, including the comma

Verify it, as shown in the figure below

Especially in the last line, there are no other characters after the monkey comma, but * means that it can match 0 times, so the expression is also valid.

As long as the expression is correct, it can be written in Python code, as follows

content = '''苹果,是绿色的
橙子,是橙色的
香蕉,是黄色的
乌鸦,是黑色的
猴子,'''

import re
p = re.compile(r',.*')
for one in  p.findall(content):
    print(one)

The result of the operation is as follows

,是绿色的
,是橙色的
,是黄色的
,是黑色的

Note that .* is very common in regular expressions and means to match any character any number of times.

Of course, this * does not have to be preceded by dots , it can also be other characters, such as

plus sign - repeat match multiple times

means match the preceding subexpression one or more times, excluding 0 times .

For example, still the above example, you need to select the string content after the comma in each line from the text, including the comma itself.

But add a condition, if there is no content after the comma, don't choose it.

For example, in the text below, if there is no content after the comma in the last line, do not select it.

苹果,是绿色的
橙子,是橙色的
香蕉,是黄色的
乌鸦,是黑色的
猴子,

You can write regular expressions like this  ,.+ .

Verify it, as shown in the figure below

In the last line, there are no other characters after the monkey comma, and + means at least one match, so there is no substring selected in the last line.

question mark - matches 0-1 times

? Indicates that the preceding subexpression is matched 0 or 1 times.

For example, still the above example, you want to select 1 character after the comma in each line from the text, including the comma itself.

苹果,绿色的
橙子,橙色的
香蕉,黄色的
乌鸦,黑色的
猴子,

You can write regular expressions like this  . ,.?

Verify it, as shown in the figure below

In the last line, there are no other characters after the monkey comma, but ? means matching 1 or 0 times, so a comma character is also selected in the last line.

curly braces - match the specified number of times

Curly braces indicate that the preceding character matches  指定的次数 .

For example, the following text

红彤彤,绿油油,黑乎乎,绿油油油油

The expression  油{3} means to match consecutive oil characters 3 times

The expression  油{3,4} means matching consecutive oil characters at least 3 times and at most 4 times

It can only match the latter, as shown below:

Greedy and non-greedy

We want to extract all html tags in the following string,

source = '<html><head><title>Title</title>'

get a list like this

['<html>', '<head>', '<title>', '</title>']

It's easy to think of using regular expressions <.*>

Write the following code

source = '<html><head><title>Title</title>'

import re
p = re.compile(r'<.*>')

print(p.findall(source))

But the running result is

['<html><head><title>Title</title>']

what happened? It turns out that in regular expressions, '*', '+', '?' are all greedy. When using them, they will match as much as possible.

Therefore,  <.*> the asterisk in (indicating any number of repetitions) has been matched to the e in the end of the string. </title> 

To solve this problem, you need to use the non-greedy mode, that is, add it after the asterisk  ? , and it becomes like this <.*?>

code changed to

source = '<html><head><title>Title</title>'

import re
# 注意多出的问号
p = re.compile(r'<.*?>')

print(p.findall(source))

Run it again and it should work

Escaping metacharacters

Backslashes are used in a variety of ways in regular expressions. \ 

For example, we want to search for all strings preceding a dot in the text below, including the dot itself

苹果.是绿色的
橙子.是橙色的
香蕉.是黄色的

If we write regular expressions like this  .*. , you must be smart and find something is wrong.

Because dot is a metacharacter , it appears directly in the regular expression, which means to match any single character, and cannot represent the meaning of the character itself.

How to do it?

If the content we want to search contains metacharacters, we can escape them with backslashes .

Here we use expressions like this: .*\.

For example, the Python program is as follows

content = '''苹果.是绿色的
橙子.是橙色的
香蕉.是黄色的'''

import re
p = re.compile(r'.*\.')
for one in  p.findall(content):
    print(one)

The result of the operation is as follows

苹果.
橙子.
香蕉.

match a certain character type

A number of characters followed by a backslash indicates  某种类型 a single character to match.

for example

\d matches any numeric character between 0-9, equivalent to the expression [0-9]

\D matches any character that is not a number between 0-9, equivalent to the expression [^0-9]

\s matches any blank character, including spaces, tabs, newlines, etc., equivalent to the expression [\t\n\r\f\v]

\S matches any non-blank character, equivalent to the expression [^ \t\n\r\f\v]

\w matches any text character, including uppercase and lowercase letters, numbers, and underscores, which is equivalent to the expression [a-zA-Z0-9_]

By default, Unicode text characters are also included. If ASCII code tags are specified, only ASCII letters are included.

\W matches any non-literal character, equivalent to the expression [^a-zA-Z0-9_]

Backslashes can also be used inside square brackets, such as [\s,.] to match: any blank character, or comma, or point

square brackets - matches one of several characters

The square brackets indicate to match one of the specified characters.

for example

[abc] Can match any character in a, b, or c. Equivalent to [a-c]  .

[a-c] A - in the middle indicates a range from a to c.

If you want to match all lowercase letters, you can use [a-z]

Some metacharacters lose their magic inside square brackets and become like normal characters.

for example

[akm.] match  a k m . any character in

Here,  . in the brackets, it no longer means to match any character, but to match  . this character


If used in square brackets  ^ , it means   the set of characters inside the square brackets.

for example

content = 'a1b2c3d4e5'

import re
p = re.compile(r'[^\d]' )
for one in  p.findall(content):
    print(one)

[^\d] Indicates that select non-numeric characters

The output is:

a
b
c
d
e

Start, end position and single-line, multi-line mode

^ Indicates the location of the matched text. 开头 

Regular expressions can be set  单行模式 and 多行模式

If yes  单行模式 , indicates  整个文本 the starting position of the match.

If yes  多行模式 , indicates  文本每行 the starting position of the match.

For example, in the following text, the number at the top of each line indicates the number of the fruit, and the last number indicates the price

001-苹果价格-60,
002-橙子价格-70,
003-香蕉价格-80,

If we want to extract all fruit numbers, use a regular expression like this ^\d+

The above regular expression, used in the Python program, is as follows

content = '''001-苹果价格-60
002-橙子价格-70
003-香蕉价格-80'''

import re
p = re.compile(r'^\d+', re.M)
for one in  p.findall(content):
    print(one)

Note that the second parameter re.M of compile indicates the use of multi-line mode,

The result of the operation is as follows

001
002
003

If the second parameter re.M of compile is removed, the running results are as follows

001

There is only the first line.

Because in single-line mode, ^ will only match the beginning of the entire text.


$ Indicates  结尾 the location of the matched text.

If so  单行模式 , indicates  整个文本 the end position of the match.

If so , indicates the end position of the match. 多行模式  文本每行 

For example, in the following text, the number at the top of each line indicates the number of the fruit, and the last number indicates the price

001-苹果价格-60,
002-橙子价格-70,
003-香蕉价格-80,

If we want to extract all fruit numbers, use a regular expression like this \d+$

corresponding code

content = '''001-苹果价格-60
002-橙子价格-70
003-香蕉价格-80'''

import re
p = re.compile(r'\d+$', re.MULTILINE)
for one in  p.findall(content):
    print(one)

Note that the second parameter of compile, re.MULTILINE, indicates the use of multi-line mode,

The result of the operation is as follows

60
70
80

If the second parameter re.MULTILINE of compile is removed, the running results are as follows

80

There is only the last line.

Because in single-line mode, $ will only match the end position of the entire text.

vertical bar - matches one of the

A vertical bar means match one of them.

for example,

In particular, it should be noted that vertical lines have the lowest priority in regular expressions, which means that the parts separated by vertical lines are a whole

For example, it means to match is  or   , 绿色|橙  绿色

instead of  or  绿色绿橙

brackets - grouping

The parentheses are called the group selection of the regular expression.

 It is to mark the content matched by the regular expression  其中的某些部分 as a certain group.

多个 We can mark groups in the regex 

Why is there a concept of group? Because we often need to extract information about certain parts of the matched content.

Earlier, we had an example, from the text below, that selects the strings preceding the commas on each line , as well包括逗号本身  .

苹果,苹果是绿色的
橙子,橙子是橙色的
香蕉,香蕉是黄色的

You can write regular expressions like this  ^.*, .

But what if we asked  not to include commas  ?

Of course you can't directly write like this ^.*

Because the last comma is the feature, if you remove it, you will not be able to find the one before the comma.

But putting commas in the regex, again includes commas.

The solution to the problem is to use group selectors: parentheses .

We write like this  ^(.*), , the result is as follows

You can find that we put the parts to be extracted from the whole expression in parentheses, so that the names of the fruits are placed in the group alone.

The corresponding Python code is as follows

content = '''苹果,苹果是绿色的
橙子,橙子是橙色的
香蕉,香蕉是黄色的'''

import re
p = re.compile(r'^(.*),', re.MULTILINE)
for one in  p.findall(content):
    print(one)

Grouping can also be used multiple times.

For example, we want to extract each person's name and corresponding mobile phone number from the following text

张三,手机号码15945678901
李四,手机号码13945677701
王二,手机号码13845666901

A regular expression like this can be used ^(.+),.+(\d{11})

The following code can be written

content = '''张三,手机号码15945678901
李四,手机号码13945677701
王二,手机号码13845666901'''

import re
p = re.compile(r'^(.+),.+(\d{11})', re.MULTILINE)
for one in  p.findall(content):
    print(one)

When there are multiple groups, we can use  (?P<分组名>...) this format to name each group.

The advantage of this is that it is more convenient for subsequent codes to extract the contents of each group

for example

content = '''张三,手机号码15945678901
李四,手机号码13945677701
王二,手机号码13845666901'''

import re
p = re.compile(r'^(?P<name>.+),.+(?P<phone>\d{11})', re.MULTILINE)
for match in  p.finditer(content):
    print(match.group('name'))
    print(match.group('phone'))

let dot match newline

As I said before,   yes  不匹配换行符 , but sometimes, the feature string is cross-line, for example, to find all the job titles in the following text

<div class="el">
        <p class="t1">           
            <span>
                <a>Python开发工程师</a>
            </span>
        </p>
        <span class="t2">南京</span>
        <span class="t3">1.5-2万/月</span>
</div>
<div class="el">
        <p class="t1">
            <span>
                <a>java开发工程师</a>
            </span>
		</p>
        <span class="t2">苏州</span>
        <span class="t3">1.5-2/月</span>
</div>

If you use the expression directly,  class=\"t1\">.*?<a>(.*?)</a> you will find that it will not match, because   there are two blank lines between t1 and  .<a>

At this time you need  点也匹配换行符 , you can use  DOTALL parameters

like this

content = '''
<div class="el">
        <p class="t1">           
            <span>
                <a>Python开发工程师</a>
            </span>
        </p>
        <span class="t2">南京</span>
        <span class="t3">1.5-2万/月</span>
</div>
<div class="el">
        <p class="t1">
            <span>
                <a>java开发工程师</a>
            </span>
		</p>
        <span class="t2">苏州</span>
        <span class="t3">1.5-2/月</span>
</div>
'''

import re
p = re.compile(r'class=\"t1\">.*?<a>(.*?)</a>', re.DOTALL)
for one in  p.findall(content):
    print(one)

back to the beginning example

With the above knowledge, let's look at the example at the beginning of this article

Grab salaries for all positions from the text below.

Python3 高级开发工程师 上海互教教育科技有限公司上海-浦东新区2万/月02-18满员
测试开发工程师(C++/python) 上海墨鹍数码科技有限公司上海-浦东新区2.5万/每月02-18未满员
Python3 开发工程师 上海德拓信息技术股份有限公司上海-徐汇区1.3万/每月02-18剩余11人
测试开发工程师(Python) 赫里普(上海)信息科技有限公司上海-浦东新区1.1万/每月02-18剩余5人

The expression we use is ([\d.]+)万/每{0,1}月

Why do you write it like this?

[\d.]+ Means to match multiple occurrences of a number or dot This would match numbers like: 3 33 33.33

万/每{0,1}月 is immediately after, if there is no this, it will match other numbers, such as 3 in Python3.

Among them  , this part means that  each character can appear 0 or 1 times in the match. 每{0,1}月 每月

Is there another way to express this  每{0,1}月 ?

It can also be used because the question mark means that the previous character matches 0 or 1 times 每?月 

cut string

The methods of the string object are only suitable for simple string splitting. Sometimes, you need more flexible string cutting. split 

For example, we need to extract the name of the general from the following string.

names = '关羽; 张飞, 赵云,马超, 黄忠  李逵'

We found that some of these names are separated by semicolons, some are separated by commas, and some are separated by spaces, and there are an indefinite number of spaces around the separator

At this time, you can use the split method in the regular expression:

import re

names = '关羽; 张飞, 赵云,   马超, 黄忠  李逵'

namelist = re.split(r'[;,\s]\s*', names)
print(namelist)

The regular expression  [;,\s]\s* specifies that the delimiter can be any of semicolon, comma, and space, and there can be an indefinite number of spaces around the symbol.

string replacement

match pattern replacement

The methods of the String object  replace are only suitable for simple substitutions. Sometimes, you need more flexible string substitution.

For example, we need to find all the links in the following text that  /avxxxxxx/ start  /av with , followed by a series of numbers, and the string of this pattern.

These strings are then all replaced with  /cn345677/ .

names = '''

下面是这学期要学习的课程:

<a href='https://www.bilibili.com/video/av66771949/?p=1' target='_blank'>点击这里,边看视频讲解,边学习以下内容</a>
这节讲的是牛顿第2运动定律

<a href='https://www.bilibili.com/video/av46349552/?p=125' target='_blank'>点击这里,边看视频讲解,边学习以下内容</a>
这节讲的是毕达哥拉斯公式

<a href='https://www.bilibili.com/video/av90571967/?p=33' target='_blank'>点击这里,边看视频讲解,边学习以下内容</a>
这节讲的是切割磁力线
'''

The content to be replaced is not fixed, so the replace method of the string cannot be used.

At this time, you can use the sub method in the regular expression:

import re

names = '''

下面是这学期要学习的课程:

<a href='https://www.bilibili.com/video/av66771949/?p=1' target='_blank'>点击这里,边看视频讲解,边学习以下内容</a>
这节讲的是牛顿第2运动定律

<a href='https://www.bilibili.com/video/av46349552/?p=125' target='_blank'>点击这里,边看视频讲解,边学习以下内容</a>
这节讲的是毕达哥拉斯公式

<a href='https://www.bilibili.com/video/av90571967/?p=33' target='_blank'>点击这里,边看视频讲解,边学习以下内容</a>
这节讲的是切割磁力线
'''

newStr = re.sub(r'/av\d+?/', '/cn345677/' , names)
print(newStr)

The sub method is also to replace strings, but the replaced content uses regular expressions to represent all strings that meet the characteristics.

 For example, here is the regular expression of the  first parameter , which means that a  character string  starting with , followed by a string of numbers, and  ending with , needs to be replaced. /av\d+?//av/

The second parameter, here is  '/cn345677/' this string, indicating what to replace.

The third parameter is the source string.

Specify a replacement function

In the previous example, what we used to replace was a fixed string  /cn345677/.

If, we require, the content after replacement is the original number + 6, for example,  replace with  . /av66771949/ /av66771955/

For this more complex replacement, we can use the second parameter of sub  指定为一个函数 , the return value of the function, to be the string used for replacement.

as follows

import re

names = '''

下面是这学期要学习的课程:

<a href='https://www.bilibili.com/video/av66771949/?p=1' target='_blank'>点击这里,边看视频讲解,边学习以下内容</a>
这节讲的是牛顿第2运动定律

<a href='https://www.bilibili.com/video/av46349552/?p=125' target='_blank'>点击这里,边看视频讲解,边学习以下内容</a>
这节讲的是毕达哥拉斯公式

<a href='https://www.bilibili.com/video/av90571967/?p=33' target='_blank'>点击这里,边看视频讲解,边学习以下内容</a>
这节讲的是切割磁力线
'''

# 替换函数,参数是 Match对象
def subFunc(match):
    # Match对象 的 group(0) 返回的是整个匹配上的字符串, 
    src = match.group(0)
    
    # Match对象 的 group(1) 返回的是第一个group分组的内容
    number = int(match.group(1)) + 6
    dest = f'/av{number}/'

    print(f'{src} 替换为 {dest}')

    # 返回值就是最终替换的字符串
    return dest

newStr = re.sub(r'/av(\d+?)/', subFunc , names)
print(newStr)

Get the string in the group, as follows

match.group(0) # 获取整个匹配字符串
match.group(1) # 获取第1个组内字符串
match.group(2) # 获取第2个组内字符串

In versions after Python 3.6, the writing method can also be more concise, directly using subscripts like a list, as follows

match[0]
match[1]
match[2]

In the example above:

When the regular expression re.sub function is executed,  the matched substring will be: 每发现一个

  • Instantiate a match object

    This match object contains information about this match, such as: what is the whole string, what is the matching part of the string, and what are the group strings in it

  • Call and execute the second parameter object of the sub function, that is, call the callback function subFunc

    And pass the match object just generated as a parameter to subFunc

Guess you like

Origin blog.csdn.net/weixin_47649808/article/details/126325560