Detailed explanation of Python regular expressions, nanny-style teaching, 0 basics can also master regular expressions

As a practical tool for processing strings, regularization is often used in Python. For example, when crawling data, regularization is often used to retrieve strings and so on. Regular expressions are already embedded in Python and can be used by importing the re module. As a novice who is just learning Python, most of them have heard the term "regular".

Today I would like to share with you a more detailed collection of Python regular expressions. After learning, you will become proficient in regular expressions.

insert image description here


1. re module

Before talking about regular expressions, we first need to know where regular expressions are used. Regular expressions are used in the findall() method, and most string searches can be done through findall().

1. Import the re module
Before using regular expressions, you need to import the re module.

import re	

2. The syntax of findall():

After importing the re module, you can use the findall() method, so we must know how the syntax of findall() is specified.

findall(正则表达式,目标字符串)

It is not difficult to see that findall() is composed of regular expressions and target strings. The target string is what you want to retrieve, so how to retrieve is to operate through regular expressions, which is our focus today.

The result returned after using findall() is a list, and the list is a string that meets the regular requirements


Two, regular expressions

(1) String matching

1. Ordinary characters

Most letters and characters can match themselves.

import re
a = "abc123+-*"
b = re.findall('abc',a)
print(b)

Output result:

['abc']

2. Metacharacters

Metacharacters refer to special characters such as . ^ $ ? + {} \ [], through which we can perform personalized retrieval of the target string and return the results we want.

Here I will introduce 10 commonly used metacharacters and their usage. Here I will give you a simple summary for easy memory. The following will explain the use of each metacharacter one by one.

insert image description here

(1) []

There are three main ways to use []:

  • Commonly used to specify a character set.
s = "a123456b"
rule = "a[0-9][1-6][1-6][1-6][1-6][1-6]b"	#这里暂时先用这种麻烦点的方法,后面有更容易的,不用敲这么多[1-6]
l = re.findall(rule,s)
print(l)

The output is:

['a123456b']
  • Can represent a range.

For example, to select the abc element in the string "abcabcaccaac":

s = "abcabcaccaac"
rule = "a[a,b,c]c"  # rule = "a[a-z0-9][a-z0-9][a-z0-9][a-z0-9]c"	
l = re.findall(rule, s)
print(l)

The output is:

['abc', 'abc', 'acc', 'aac']
  • Metacharacters inside [] have no effect, only ordinary characters.

For example, to select "caa" from the string "caa bcabcaabc" :

print(re.findall("caa[a,^]", "caa^bcabcaabc"))

The output is:

['caa^']

Note: When it is in the first position of [], it means that everything except a is matched, for example, change the position of a in []:

print(re.findall("caa[^,a]", "caa^bcabcaabc")) 

output:

['caa^', 'caab'] 

(2)^

^ is usually used to match the beginning of a line, for example:

print(re.findall("^abca", "abcabcabc"))

Output result:

['abca']

Please add a picture description

(3) $
$ is usually used to match the end of a line, for example:

print(re.findall("abc$", "accabcabc"))

Output result:

['abc']

insert image description here

(4)\

​ Different characters can be added after the backslash to indicate different special meanings, the following three are common.

  • \d: matches any decimal number equivalent to [0-9]
print(re.findall("c\d\d\da", "abc123abc"))

The output is:

['c123a']

\ can be escaped into ordinary characters, for example:

print(re.findall("\^abc", "^abc^abc"))

Output result:

['^abc', '^abc']
  • s

Matches any whitespace character eg:

print(re.findall("\s\s", "a     c"))

Output result:

['  ', '  ']
  • \w

Match any alphanumeric and underscore, equivalent to [a-zA-Z0-9_], for example:

print(re.findall("\w\w\w", "abc12_"))

output:

['abc', '12_']

insert image description here

(5){n}

{n} can avoid repeated writing. For example, we wrote \w 3 times when we used \w before, but here we need to use {n}. n indicates the number of matches, for example:

print(re.findall("\w{2}", "abc12_"))

Output result:

['ab', 'c1', '2_']

(6)*

* means match zero or more times (as much as possible to match), for example:

print(re.findall("010-\d*", "010-123456789"))

output:

['010-123456789']

**(7) + **

+ means match one or more times, e.g.

print(re.findall("010-\d+", "010-123456789"))

output:

['010-123456789']

(8) .

. is a dot, it is not obvious here, it is used to operate any character except newline, for example:

print(re.findall(".", "010\n?!"))

output:

['0', '1', '0', '?', '!']

(9) ?

? means match one or zero times

print(re.findall("010-\d?", "010-123456789"))

output:

['010-1']

Here we should pay attention to the greedy mode and the non-greedy mode.

Greedy mode: match as much data as possible, expressed as \d followed by a metacharacter, such as \d*:

print(re.findall("010-\d*", "010-123456789"))

output:

['010-123456789']

Non-greedy mode: match as little data as possible, expressed as \d followed by ? , such as \d?

print(re.findall("010-\d*?", "010-123456789"))

The output is:

['010-']

(10){m,n}
m,n refers to the decimal number, which means repeating at least m times and at most n times, for example:

print(re.findall("010-\d{3,5}", "010-123456789"))

output:

['010-12345']

plus? Indicates as little as possible to match

print(re.findall("010-\d{3,5}?", "010-123456789"))

output:

['010-123']

{m,n} has other flexible ways of writing, such as:

  • {1,} is equivalent to the effect of + mentioned above
  • {0, 1} is equivalent to the aforementioned ? Effect
  • {0,} is equivalent to the effect of * mentioned above

insert image description here

Let’s stop here first about the commonly used metacharacters and how to use them, and then let’s look at other regular knowledge.


(2) The use of regular

1. Compile regular

In Python, the re module can compile regular expressions through the compile() method, re.compile (regular expression), for example:

 s = "010-123456789"
 rule = "010-\d*"
 rule_compile = re.compile(rule) #返回一个对象
 # print(rule_compile)
 s_compile = rule_compile.findall(s)
 print(s_compile)	#打印compile()返回的对象是什么

Output result:

['010-123456789']

2. How to use regular objects

The use of regular objects is not only through the findall() we introduced earlier, but also through other methods. The effect is different. Here I will make a brief summary:

(1) findall()
finds all strings matched by re and returns a list

(2) search()
scans the string to find the matching position of this re (only the first one found)

(3) match()
determines whether re is at the beginning of the string (matching the beginning of the line)

Take the object returned after compiling the regularity by compile() above as an example. We don’t use findall() here, but use match() to see the result:

s = "010-123456789"
rule = "010-\d*"
rule_compile = re.compile(rule)  # 返回一个对象
# print(rule_compile)
s_compile = rule_compile.match(s)
print(s_compile)  # 打印compile()返回的对象是什么

output:

<re.Match object; span=(0, 13), match='010-123456789'>

It can be seen that the result is a match object, the starting subscript position is 0~13, and the match is 010-123456789. Now that the object is returned, let's talk about some operation methods of the match object.

insert image description here


3. Operation method of Match object

Let me introduce the method first, and I will give an example later. The common usage methods of the Match object are as follows:

(1) group()
returns the string matched by re

(2) start()
returns the position where the match starts

(3) end()
returns the position where the match ends

(4) span()
returns a tuple: (start, end) position

Example: Use span() to operate on the object returned by search():

s = "010-123456789"
rule = "010-\d*"
rule_compile = re.compile(rule)  # 返回一个对象
s_compile = rule_compile.match(s)
print(s_compile.span())  #用span()处理返回的对象

The result is:

(0, 13)

4. Functions of the re module

In addition to the findall() function introduced above, there are other functions in the re module to make an introduction:

(1) findall()
returns all the matched strings according to the regular expression.

(2) sub(regular, new string, original string)
The function of the sub() function is to replace the string, for example:

s = "abcabcacc" #原字符串
l = re.sub("abc","ddd",s)   #通过sub()处理过的字符串
print(l)

output:

ddddddacc	#把abc全部替换成ddd

(3) subn (regular, new string, original string)
The function of subn() is to replace the string and return the number of replacements

s = "abcabcacc" #原字符串
l = re.subn("abc","ddd",s)   #通过sub()处理过的字符串
print(l)

output:

('ddddddacc', 2)

(4) split()
split() splits the string, for example:

s = "abcabcacc"
l = re.split("b",s)
print(l)

Output result:

['a', 'ca', 'cacc']

insert image description here


3. Conclusion

That’s all I’ve said about regularization. Regularity is an essential foundation in almost all directions of Python. I wish you success in your Python journey!

Thank you for your reading and liking. I have collected a lot of technical dry goods, which can be shared with friends who like my articles. If you are willing to take the time to settle down and learn, they will definitely help you. The dry goods include:

insert image description here

Click on the business card at the end of the article to take it away
insert image description here

Guess you like

Origin blog.csdn.net/zhiguigu/article/details/130483778