Python learning-regular expressions

Strings are the most frequently involved data structure in programming, and the need to manipulate strings is almost everywhere. For example, to determine whether a character string is a valid Email address, although it is possible to programmatically extract the substrings before and after @, and then determine whether it is a word and a domain name, this is not only troublesome, but also difficult to reuse the code .

Regular expressions are a powerful weapon for matching strings. His design idea is to use a descriptive language to define a rule for a string. Any string that meets the rule is considered to be "matched". Otherwise, the string is illegal.

So the way we judge whether a string is a valid Email is:

  1. Create a regular expression that matches Email;
  2. Use the regular expression to match the user's input to determine whether it is legal.

Because regular expressions are also represented by strings, we must first understand how to use characters to describe characters.

In the regular expression, if the character is given directly, it is an exact match. Use \d to match a digit, and \w to match a letter or number, so:

  • '00\d' can match '007', but cannot match '00A';
  • '\d\d\d' can match '010';
  • '\w\w\d' can match'py3';

. Can match any string, so:

  • 'py.' can match'pyc','pyo','py!' and so on.

To match variable-length characters, in regular expressions, use * to indicate any number of characters (including 0), use + to indicate at least one character, and use? Represents 0 or 1 characters, {n} represents n characters, and {n,m} represents nm characters:
Let’s look at a complex example: \d{3}\s+\d{3,8}.
s: space

Let's read it from left to right:

  1. \d{3} means match 3 digits, such as '010';
  2. \s can match a space (also includes white space characters such as Tab), so \s+ means that there is at least one space, such as matching '','', etc.;
  3. \d{3,8} means 3-8 numbers, such as '1234567'.

Taken together, the above regular expressions can match phone numbers with area codes separated by any number of spaces.

What if you want to match a number like '010-12345'? Since'-' is a special character, in the regular expression, it must be escaped with'', so the above regular expression is \d3-\d{3,8}.
However, it still cannot match '010-12345', because it contains spaces, so we need a more complicated matching method.

Advanced

For more precise matching, you can use [] to indicate the range, for example:

  • [0-9a-zA-z_] can match a number, letter or underscore;
  • [0-9a-zA-z_]+ can match a string consisting of at least one number, letter or underscore, such as'a100', '0_z','Py3000', etc.; (note that with the + sign, the order can be Change at will) (+ means at least one)
  • [a-zA-Z_][0-9a-zA-Z_]* can match starting with a letter or underscore, followed by any string consisting of a number, letter or underscore, which is a legal Python variable; (* Means any number)
  • [a-zA-Z_][0-9a-zA-Z_]{0,19} more precisely limits the length of the variable to 1-20 characters (1 character in the front + up to 19 characters in the back)
A|B可以匹配A或B,所以(P|p)ython可以匹配'Python'或者'python'^表示行的开头^\d表示必须以数字开头。(所以,^要和开头连读)
$表示行的结束,\d$表示必须以数字结束。(所以,$要和末尾连读)
你可能注意到了,py也可以匹配'python',但是加上^py$就变成了整行匹配,就只能匹配'py'了。

re module

With the preparation knowledge, we can use regular expressions in Python. Python provides the re module, which includes all regular expression functions. Since Python strings themselves are also escaped with \, pay special attention to:

s = 'ABC\\-001' # Python的字符串
# 对应的正则表达式字符串变成:
# 'ABC\-001'

Therefore, we strongly recommend using Pyhon's r prefix instead of escaping:

s = r'ABC\-001'
# 对应的正则表达式字符串不变:
# 'ABC\-001'

First look at how to judge whether the regular expression matches:

>>>import re
>>>re.match(r'^\d{3}\-\d{3,8}$','010-12345')
><_sre.SRE_Match object; span=(0, 9), match='010-12345'>
>>>re.match(r'^\d{3}\-\d{3,8}$','010 12345')
>

The match() method determines whether there is a match. If the match is successful, it returns a Match object, otherwise it returns None. The common judgment method is:

test = '用户输入的字符串'
if re.match(r'正则表达式', test):
	print('ok')
else:
	print('failed')

Split string

Using regular expressions to split a string is more flexible than using fixed characters. Please see the normal split code:

>>> 'a b   c'.split(' ')
['a', 'b', '', '', 'c']

Well, continuous spaces cannot be recognized, try using regular expressions:

>>> re.split(r'\s+', 'a b   c')
['a', 'b', 'c']

No matter how many spaces, it can be divided normally. Join, try:

>>>re.split(r'[\s\,]+','a,b, c  d')
>['a', 'b', 'c', 'd']

Add again; try

>>>re.split(r'[\s\,\;]+','a,b;; c  d')
>['a', 'b', 'c', 'd']

If the user enters a set of tags, remember to use regular expressions to convert the irregular input into the correct array next time.

Grouping

In addition to simply judging whether it matches, regular expressions also have the powerful function of extracting substrings . Use () to indicate the group to be extracted. For example:
^(\d{3})-(\d{3,8})$ defines two groups respectively, and the area code and local number can be extracted directly from the matched string:

>>>m = re.match(r'^(\d{3})-(\d{3,8})$','010-12345')
>>>m
<_sre.SRE_Match object; span=(0, 9), match='010-12345'>
>>>m.group(0)
>'010-12345'
>>>m.group(1)
>'010'
>>>m.group(2)
>'12345'

If a group is defined in the regular expression, the substring can be extracted using the group() method on the Match object.

Note that group(0) is always the original string , group(1), group(2)...represent the 1, 2,... substrings. (See how many groups you have made yourself)

Extracting substrings is very useful. Let's look at a more brutal example:

>>>t = '19:05:30'
>>>m = re.match(r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t)
>>> m.groups()
('19', '05', '30')

Don’t be stunned, don’t be afraid, this example is too simple, and don’t think about
0[0-9]: 0 and any number, not 0 [0-9] in the
regular expression, suddenly appear A character or number may represent itself.

This regular expression can directly identify the legal time. However, in some cases, regular expressions cannot be used to complete verification , such as identifying dates:

'^(0[1-9]|1[0-2]|[0-9])-(0[1-9]|1[0-9]|2[0-9]|3[0-1]|[0-9])$'

Regarding illegal dates such as '2-30' and '4-31', it is still not recognized by regular rules, or it is very difficult to write them out. At this time, the program needs to cooperate in recognition.

Greedy match

Finally, it is important to point out that regular matching is greedy matching by default , that is, matching as many characters as possible . An example is as follows, matching the 0 after the number:

>>>re.match(r'^(\d+)(0*)$','102300').groups()
>('102300', '')

Since \d+ uses greedy matching, it directly matches all the 0s behind. As a result, 0* can only match the empty string.

It is necessary to let \d+ use non-greedy matching (that is , to match as little as possible) , in order to match the following 0, add? You can make \d+ use non-greedy matching:

>>>re.match(r'^(\d+?)(0*)$','102300').groups()
>('1023', '00')

Compile

When we use regular expressions in Python, the re module will do two things inside:

  1. Compile the regular expression, if the string of the regular expression itself is illegal, an error will be reported;
  2. Use the compiled regular expression to match the string.

If a regular expression is to be reused thousands of times, for efficiency reasons, we can pre-compile the regular expression. When it is reused, there is no need to compile this step and directly match:

>>>import re
# 编译:
>>>re_telephone = re.compile(r'^(\d{3})-(\d{3,8})$')
# 使用:
>>>re_telephone.match('010-12345').groups()
('010', '12345')
>>> re_telephone.match('010-8086').groups()
('010', '8086')

After compilation, a Regular Expression object is generated. Since the object contains regular expressions, it is not necessary to give regular expressions when calling the corresponding method.

summary

Regular expressions are so powerful that it is impossible to cover them in a short section. To explain all the contents of the regular, you can write a thick book. If you often encounter regular expression problems, you may need a regular expression reference book.

Exercise

Please try to write a regular expression to verify email addresses. Version one should be able to verify a similar email:

The thinking
is very simple and simple. The key is to disassemble
the two emails only in the front and the same in the back.
Then, there must be ^... Forget it, it's too simple and no time wasting. .

Too young, the mind has not yet changed.

import re
def is_valid_email(addr):
        return re.match(r'^([\w]+?)(\.{0}|\.)(\w+@\w+\.com)$', addr)

if __name__=="__main__":
	assert is_valid_email('[email protected]')
	assert is_valid_email('[email protected]')
	assert not is_valid_email('bob#example.com')
	assert not is_valid_email('[email protected]')
	print('ok')

After the change

def is_valid_email(addr):
    # (\.{0}|\.)没必要,直接放到前面去
    # ([\w]+?)(\.{0}|\.)合并成[\w\.]+
    # 表示匹配至少一个字母或下划线或.
    # @ 和 . 没必要在分组里,所以写在了分组外面
    # []表示范围,不一定是数字范围,也可以是你选定的访问
    # [\w\.]这个就是一个范围
    return re.match(r'^([\w\.]+)@(\w+)\.(\w+)$', addr)

Version two can extract email addresses with names:

[email protected] => Tom Paris
[email protected] => bob

import re


def is_valid_email(addr):
    # 切记,看情况加^和$符号啊!!
    # 加了就是从开头开始了,那么findall和search就没意义了。
    matchObj = re.compile(r'()([\w]+)(@\w+\.[\w]+)')
    print(matchObj.search(addr).groups())
    return matchObj.search(addr)


if __name__ == '__main__':
    assert is_valid_email('<Tom Paris> [email protected]')
    print('ok')

After changing your mind

    nameMatch = re.compile(r'<?([\w\s]+)>? ([\w]+)@([\w]+).([\w]+)$')

This still doesn't work, because it will match ([\w\s]+) and ([\w]+) with a space before the bracket.
So it needs to be changed

import re


def name_of_email(addr):
	nameMatch = re.compile(r'<?([\w\s]+)>?([\w\s]*)@([\w]+).([\w]+)$')
	return nameMatch.match(addr).group(1)

if __name__ == '__main__':
    assert name_of_email('<Tom Paris> [email protected]') == 'Tom Paris'
    assert name_of_email('[email protected]') == 'tom'
    print('ok')
<Tom Paris> tom@voyager.org => Tom Paris
bob@example.com => bob

analysis:

  1. This <and> may or may not be used? number
  2. Tom Paris has letters and spaces, so set the range [\w\s]+
  3. However, the second step has already matched letters or spaces. If it is the following bob, it is found that there is no content to be matched in the first step or two, but there is another tom, so you can only set the latter to [\w\s]*, This asterisk means to match any number.
  4. @Don't put it in the group, write it outside
  5. ([\w]+)
  6. .([\w]+)$

Guess you like

Origin blog.csdn.net/qq_44787943/article/details/112597076