Regular expressions for python processes and threads

Python study notes, special records, share with you, I hope it will be helpful to everyone.

Regular expression

Strings are the most frequently involved data structure in programming, and the need to manipulate strings is almost everywhere. For example, to determine whether a character string is a legal email address, although it is possible to programmatically extract the substring before and after @, and then determine whether it is a word and a domain name, this is not only troublesome, but also difficult to reuse the code.

Regular expressions are a powerful weapon for matching strings. Its design idea is to use a descriptive language to define a rule for a string. Any string that conforms to the rule is considered to be "matched". Otherwise, the string is illegal.

So the way we judge whether a string is a valid Email is:

  1. Create a regular expression that matches Email;

  2. Use the regular expression to match the user's input to determine whether it is legal.

Because regular expressions are also represented by strings, we must first understand how to use characters to describe characters.

In the regular expression, if the character is given directly, it is an exact match. Use \d to match a number, and \w to match a letter or number, so:

  • '00\d' can match '007', but cannot match '00A';
  • '\d\d\d' can match '010';
  • '\w\w\d' can match'py3';

. Can match any character, so:

  • 'py.' can match'pyc','pyo','py!' and so on.

To match variable-length characters, in the regular expression, use * to represent any number of characters (including 0), use + to represent at least one character, use? To represent 0 or 1 character, and use {n} to represent n characters , Use {n,m} to represent nm characters:
Let’s look at a complex example: \d{3}\s+\d{3,8}.
Let's read it from left to right:

  1. \d{3} means match 3 numbers, such as '010';
  2. \s can match a space (also includes white space characters such as Tab), so \s+ means that there is at least one space, for example, matches'','', etc.;
  3. \d{3,8} means 3-8 numbers, such as '1234567'.

Taken together, the above regular expressions can match phone numbers with area codes separated by any number of spaces.

What if you want to match a number like '010-12345'? Since'-' is a special character, in regular expressions, it must be escaped with'', so the above regular expression is \d{3}-\d{3,8}.

However, it still cannot match '010-12345' because of spaces. So we need more complicated matching methods.

Advanced

For more precise matching, you can use [] to indicate the range, for example:

  • [0-9a-zA-Z_] can match a number, letter or underscore;
  • [0-9a-zA-Z_]+ can match a string consisting of at least one number, letter or underscore, such as'a100', '0_Z','Py3000', etc.;
  • [a-zA-Z_][0-9a-zA-Z_]* can match starting with a letter or underscore, followed by any string consisting of a number, letter or underscore, which is a legal Python variable;
  • [a-zA-Z_][0-9a-zA-Z_]{0, 19} more precisely limits the length of the variable to 1-20 characters (1 character in the front + up to 19 characters in the back).

A|B can match A or B, so (P|p)ython can match'Python' or'python'.
Indicates the beginning of the line, \d indicates that it must start with a number.
KaTeX parse error: Undefined control sequence: \d at position 8: indicates the end of the line, \̲d̲ indicates that it must end with a number.
You may have noticed that py can also match'python', but adding ^py$ becomes a whole line match, which can only match'py'.

re module

With the preparation knowledge, we can use regular expressions in Python. Python provides the re module, which contains all regular expression functions. Since Python strings themselves are also escaped with \, please pay special attention to:

s = 'ABC\\-001' # Python的字符串
# 对应的正则表达式字符串变成:
# 'ABC\-001'

Therefore, we strongly recommend using Python's r prefix, so you don't have to worry about escaping:

s = r'ABC\-001' # Python的字符串
# 对应的正则表达式字符串不变:
# 'ABC\-001'

First look at how to determine whether the regular expression matches:

import re
re.match(r'^\d{3}\-\d{3,8}$', '010-12345')
 re.match(r'^\d{3}\-\d{3,8}$', '010 12345')

The match() method judges whether there is a match. If the match is successful, it returns a Match object, otherwise it returns None. The common judgment method is:

test = '用户输入的字符串'
if re.match(r'正则表达式', test):
    print('ok')
else:
    print('failed')

Split string

Splitting a string with regular expressions is more flexible than using fixed characters. Please see the normal splitting code:

'a b   c'.split(' ')
['a', 'b', '', '', 'c']

Well, continuous spaces cannot be recognized, try using regular expressions:

re.split(r'\s+', 'a b   c')
['a', 'b', 'c']

No matter how many spaces, it can be divided normally. Join, try:

re.split(r'[\s\,]+', 'a,b, c  d')
['a', 'b', 'c', 'd']

Join again; try:

re.split(r'[\s\,\;]+', 'a,b;; c  d')
['a', 'b', 'c', 'd']

If the user enters a set of tags, remember to use regular expressions to convert the irregular input into the correct array next time.

Grouping

In addition to simply judging whether it matches, regular expressions also have the powerful function of extracting substrings. Use () to indicate the group to be extracted. such as:

^(\d{3})-(\d{3,8})$ defines two groups respectively, which can directly extract the area code and local number from the matched string:

import re

m = re.match(r'^(\d{3})-(\d{3,8})$', '010-12345')
print m
print m.group(0)
print m.group(1)
print m.group(2)

operation result:

<_sre.SRE_Match object at 0x100249360>
010-12345
010
12345

Process finished with exit code 0

If a group is defined in the regular expression, the substring can be extracted using the group() method on the Match object.

Note that group(0) is always the original string, group(1), group(2)...represent the 1, 2,... substrings.

Extracting substrings is very useful. Let's look at a more brutal example:

t = '19:05:30'
m = re.match(r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t)
print m.groups()

operation result:

('19', '05', '30')

Process finished with exit code 0

This regular expression can directly identify the legal time. However, in some cases, regular expressions cannot be used to complete verification, such as identifying the date:

'^(0[1-9]|1[0-2]|[0-9])-(0[1-9]|1[0-9]|2[0-9]|3[0-1]|[0-9])$'

Regarding illegal dates such as '2-30' and '4-31', it is still not recognized by regular rules, or it is very difficult to write them out. At this time, the program needs to cooperate with the recognition.

Greedy match

Finally, it is important to point out that regular matching is greedy matching by default, that is, matching as many characters as possible. An example is as follows, matching the 0 after the number:

print re.match(r'^(\d+)(0*)$', '102300').groups()

operation result:

('102300', '')

Process finished with exit code 0

Since \d+ adopts greedy matching, it directly matches all the following 0s. As a result, 0* can only match the empty string.

You must make \d+ use non-greedy matching (that is, match as little as possible) to match the following 0s, and add a? To make \d+ use non-greedy matching:

print re.match(r'^(\d+?)(0*)$', '102300').groups()

operation result:

('1023', '00')

Process finished with exit code 0

Compile

When we use regular expressions in Python, the re module will do two things inside:

  1. Compile the regular expression, if the string of the regular expression itself is illegal, an error will be reported;
  2. Use the compiled regular expression to match the string.

If a regular expression is to be reused thousands of times, for efficiency reasons, we can pre-compile the regular expression. When it is reused, there is no need to compile this step and directly match:

import re

re_telephone = re.compile(r'^(\d{3})-(\d{3,8})$')
print re_telephone.match('010-12345').groups()
print re_telephone.match('010-8086').groups()

operation result:

('010', '12345')
('010', '8086')

Process finished with exit code 0

After compilation, a Regular Expression object is generated. Since the object itself contains regular expressions, it is not necessary to give a regular string when calling the corresponding method.

Welcome to pay attention to the public account "Web Development" , you can receive python test demo and learning resources, everyone learn python together, and collect the world's methods, which is convenient for you and me to develop .

I hope I can help you. If you have any questions, you can join the QQ technical exchange group: 668562416
If there is something wrong or insufficient, I also hope that readers can provide more comments or suggestions.
If you need to reprint, please contact me. You can reprint with authorization, thank you


Welcome to pay attention to the public account "Web Development"

image

Guess you like

Origin blog.csdn.net/qq_36478920/article/details/102929044