Ten minutes to get started with regular expressions in Python

There are three common functions of regular expressions: verifying the validity of data, finding text that meets the requirements, and cutting and replacing text.

1. Metacharacters

The so-called metacharacters refer to those special characters that have special meaning in regular expressions.

Metacharacters are roughly divided into the following categories: those that represent a single special character, those that represent whitespace, those that represent a certain range, those that represent times, and those that represent assertions.

1.1 Special single characters

The English dot (.) represents any single character other than line breaks, \d represents any single digit, \w represents any single digit or letter or underscore, and \s represents any single whitespace character. In addition, there are three corresponding \D, \W and \S, which respectively represent the opposite meaning of the original.

import re

#匹配所有数字

txt = "123d5sdf23"
result = re.findall('\d', txt)
print(result)
#输出:['1', '2', '3', '5', '2', '3']
#匹配所有的数字、字母和下划线

txt = 'sdfw234_sdf12'
result = re.findall('\w', txt)
print(result)
#输出:['s', 'd', 'f', 'w', '2', '3', '4', '_', 's', 'd', 'f', '1', '2']

1.2. White space character

Different systems have different default "line breaks" at the end of each line of text. For example, in Windows it is \r\n, in Linux and MacOS it is \n

\r carriage return character

\n newline character

\f form feed character

\t tab character

\v vertical tab character

\s any whitespace character

#获取每一行的开头
f = open('hello.txt', 'r', encoding='utf-8')
txt = f.read()
print(txt)
#输出:
#小明
#小红
#小月
result = re.findall('\n\w*', txt)
print(result)
#输入:['\n小红', '\n小月']

1.3. Quantifier

In regular expressions, the English asterisk (*) represents 0 to multiple occurrences, the plus sign (+) represents 1 to multiple occurrences, the question mark (?) represents 0 to 1 occurrences, and {m,n} represents m to n occurrences.

*: 0 to multiple times

+: 1 to multiple times

?: 0 to 1 times

{m}: appears m times

{m,}: appears at least m times

{m,n}: m to n times

#寻找3个数字组合
txt = '123 1 sfd 2342 aa 23g 342'
result = re.findall('\d{3}\s', txt)
print(result)
#输出:['123 ', '342 ']

4. Scope

|: Or, such as ab|bc represents ab or bc

[...]: multiple selections, any single element in brackets

[az]: Match any single element between az

[^...]: Negation, cannot include any single element in parentheses


#某个资源可能以 http:// 开头,或者 https:// 开头,也可能以 ftp:// 开头
txt = 'http://www.baidu.com'
result = re.match('(https?|ftp):\/\/', txt)
print(result.span())
#输出:(0, 7)

2. Quantifiers and greed

2.1 Greedy mode

In regular rules, the quantifier expressing the degree is greedy by default. In greedy mode, it will try to match the maximum length possible.

#贪婪匹配
txt = 'aaabb'
result = re.findall(r'a*', txt)
print(result)
#输出:['aaa', '', '', '']

When a* matches the beginning a, it will try to match as many a's as possible until the first letter b does not meet the requirements. It matches three a's and gets an empty string every time it matches.

The characteristic of greedy mode is to match the maximum length possible.

2.2 Non-greedy mode

If you add an English question mark (?) after the quantifier, the regular expression becomes a*?, which is the non-greedy mode.

#非贪婪匹配
txt = 'aaabb'
result = re.findall(r'a*?', txt)
print(result)
#输出:['', 'a', '', 'a', '', 'a', '', '', '']

Non-greedy mode will match as short a time as possible

3. Function

3.1findall() function

findall() The function returns a list containing all matches.

#findall() 这个列表以被找到的顺序包含匹配项

txt = "China is a great country"
x = re.findall("China", txt)
print(x)
#输出:['China']

3.2 search() function

search() The function searches a string for a match and returns a Match object if a match exists.

If there are multiple matches, only the first match is returned, if no match is found, the value is returned None

#search() 函数搜索字符串中的匹配项,如果存在匹配则返回 Match 对象
txt = "China is a great country"
x = re.search(r"\s", txt)

print("第一个空格位置", x.start())
#输出:第一个空格位置 5

3.3 split() function

split() The function returns a list where the string is split on each match:

#spilt 分割
txt = "China is a great country"
x = re.split(r"\s", txt)
print(x)
#输出:['China', 'is', 'a', 'great', 'country']

Control the number of occurrences by specifying  maxsplit parameters:

#通过指定 maxsplit 参数来控制出现次数:
txt = "China is a great country"
x = re.split(r"\s", txt, 2)
print(x)
['China', 'is', 'a great country']

3.4 sub() function

sub() The function replaces the match with text of your choice:

#sub() 函数把匹配替换为您选择的文本:
txt = "China is a great country"
x = re.sub("is", "IS", txt)
print(x)
#输出:China IS a great country

3.5 Match object

Match objects are objects that contain information about searches and results.

Note: If there is no match, a value is returned  Noneinstead of a Match object.

Match objects provide properties and methods for retrieving information about searches and results:

  • span() The returned tuple contains the start and end positions of the match
  • .string Returns the string passed into the function
  • group() Returns the matching string part
#正则表达式查找以大写 "C" 开头的任何单词
txt = "China is a great country"
x = re.search(r"\bC\w+", txt)
print(x.span())
#输出:(0, 5)
#打印匹配的字符串部分
print(x.group())
#输出:China

Source code download

If this document is not detailed enough, you can refer to learn python in ten minutes_bilibili_bilibili​

Guess you like

Origin blog.csdn.net/kan_Feng/article/details/132189886