Deep Dive into Regular Expressions in Python

@Author: Runsen

There are also many scenarios where regular expressions are used. Common ones, such as: search engine search, crawler result matching, text data extraction, etc. will be used, so mastering or even proficient in regular expressions is a hard skill and is very necessary.

regular expression

A regular expression is a special sequence of characters consisting of ordinary characters and metacharacters. Metacharacters help you easily check whether a string matches a pattern.

Python provides a powerful regular expression processing module, namely the re module, which is a built-in module of Python.

Below, I will take you to an example of an introductory demo. The code is as follows:

import re
reg_string = "hello9527python@wangcai.@!:xiaoqiang"
reg = "hello"
result = re.findall(reg,reg_string)
print(result)

Here reg_string is our normal character and reg is our meta character.

We use the findall function in the re module to match, and the returned result is a list data type.

We use regular expressions to find the string segment we need in a very long string.

metacharacter

Common metacharacters in Python and their meanings are as follows:

metacharacter meaning
. matches any character except newline
\w Match alphanumeric underscore Chinese characters
\s matches any whitespace
\d match all numbers
\b match the beginning or end of a word
^ matches the beginning of the string
$ Match the start and end of the string
Below, we specifically use the common metacharacters in Python.  

We still use the last example, this time we need to match our number in reg_string, just replace reg with \d, the code is shown below.

For example, we add a ^ in front of the hello of the previous reg, which means that we match the hello at the beginning of the string, then the result is one, which is our hello at the beginning.

If, we replace reg with \w, the code is as shown below.

This is to match underscores of numbers and letters, including our Chinese characters.

antisense code

Common antonym codes and their meanings in Python are as follows:

antisense code meaning
\W Match any character that is not a alphanumeric underscore Chinese character
\S matches any character that is not whitespace
\D match non-digits
\B match is not the start or end of a word
[^a] matches any character except a
[^abcd] matches any character except abcd

In fact, the memory is very simple, do we know that \d matches numbers, then the uppercase \D of \d matches non-digits, the metacharacter [a] matches any character of a, then [^a] matches any character except a .

The following are specific examples

>>> import re
>>> reg_string = "hello9527python@wangcai.@!:xiaoqiang"
>>> reg = "\D"
>>> re.findall(reg,reg_string)
['h', 'e', 'l', 'l', 'o', 'p', 'y', 't', 'h', 'o', 'n', '@', 'w', 'a', 'n', 'g', 'c', 'a', 'i', '.', '@', '!', ':', 'x', 'i', 'a', 'o', 'q', 'i', 'a', 'n', 'g']
>>> reg = "[^a-p]"
['9', '5', '2', '7', 'y', 't', '@', 'w', '.', '@', '!', ':', 'x', 'q']

qualifier

What is a qualifier? It's something that limits the number of matches we can match.

Common qualifiers and their meanings in Python are as follows:

qualifier meaning
* Repeat zero or more times
+ repeat one or more times
repeat zero or one time
{n} repeat n times
{n,} Repeat n or more times
{n,m} Repeat n times to m times {1, 3}

We still use our previous reg_string, this time we limit the metacharacter to \d{4}, which means that our matching numbers must be 4.

Next, let's increase the difficulty, match letters and numbers, and limit the number to 4.

So we can use [0-9a-z]{4}, as our metacharacter, [0-9a-z] represents the ten digits from 0 to 9 and the lowercase 26 English letters from a to z. [0-9a-z]{4} limits the number to 4.

Let's print out.

If it is not in the range of [0-9a-z], it will be skipped until the next 4 are in the range of [0-9a-z] and then printed out.

match ip address

In the Internet, a host has only one IP address. The IP address is used to mark the address of each computer in the TCP/IP communication protocol, and is usually expressed in decimal, such as 192.168.1.100.

In the window system, we can view our ip through ipconfig. In linux system, we can view our ip through ifconfig.

Our ip string looks like this:ip = "this is ip:192.168.1.123 :172.138.2.15"

The following requires the use of regular expressions to match ip.

Actually, we mostly write metacharacters. For example: reg = "\d{3}.\d+.\d+.\d+",because the first number must start with three digits, we can set \d{3}it to be fixed.

In addition to using findall, we can also use search, where we put metacharacters reg = "(\d{1,3}.){3}\d{1,3}".

The . in this meta character is \d{1,3}specified as the first three numbers of our ip, followed by {3}3 repetitions. \d{1,3}It refers to the last number of our ip.

But there is a difference between search and findall, search can only match the first one, we need to use a list to get the first one, and findall matches all.

group match

What is group matching, for example, I have a string here s = this is phone:13888888888 and this is my postcode:012345, and I need you to match the mobile phone number with the verification code.

Because, we want to match two, and the metacharacters of each are different. So, we need group matching.

The parentheses of the regular expression indicate group matching, and the pattern in the parentheses can be used to match the contents of the group.

So our metacharacter becomes:reg = this is phone:(\d{11}) and this is my postcode:(\d{6})

We generally use search for group matching. Did I say last time that search needs to be retrieved using a list? The same is true for group matching here, but the group()method is used here. group(1)Represents our mobile phone number, group(2)represents our verification code, and group(0)represents our mobile phone number and verification code, as shown in the following figure.

In regular expressions, in addition to the findall and search usages, there is a match usage.

The match usage only matches the beginning, and it also needs to be taken out by group(). The example of match in the following figure.

This is the meaning of re.I is to ignore case.

Greedy and non-greedy

Greedy and non-greedy modes affect the matching behavior of subexpressions modified by quantifiers. Greedy mode matches as many as possible on the premise that the entire expression matches successfully, while non-greedy mode matches the entire expression successfully. , with as few matches as possible.

There are several very important operators for greedy and non-greedy.

operator meaning
* Repeat zero or more times
+ repeat one or more times
repeat zero or one time

For example, here I have a string reg_string = pythonnnnnnnnnpythonHelloPytho, we first use metacharacters in greedy mode:reg = "python*"

In greedy mode reg = "python*", meaning n is repeated zero or more times. So we see pythonnnnnnnnnas many matches as possible for the first level result.

The following uses metacharacters in non-greedy mode: reg = "python*?", reg = "python+?", reg = "python??".

In non-greedy mode reg = "python*", it means nzero or one time, so we don't see pythonnnnnnnnnthe result.

Mobile number verification

First of all, we need to know what does our mobile number start with?

移动手机号码开头有16个号段:134、135、136、137、138、139、147、150、151、152、157、158、159、182、187、188。

联通手机号码开头有7种号段:130、131、132、155、156、185、186。

电信手机号码开头有4个号段:133、153、180、189。

这样我们就可以在开头做事情了,先判断开头是不是上面的号段, regex = "^((13[0-9])|(14[5|7])|(15([0-3]|[5-9]))|(18[0,5-9]))\d{8}$",就是我们的元字符,代码如下:

import re

def checkCellphone(cellphone):
    regex = "^((13[0-9])|(14[5|7])|(15([0-3]|[5-9]))|(18[0,5-9]))\d{8}$"
    result = re.findall(regex,cellphone)
    if result:
        print("匹配成功")
        return True
    else:
        print("匹配失败")
        return False
cellphone = '13717378202'
checkCellphone(cellphone)


匹配成功
True

匹配邮箱合法性

下面,我们进行一个作业,就是来匹配我们的邮箱号码。

作业的答案如下:


import re

def checkEmail(email):

    regex_1 = '^(\w+)@sina.com$'
    regex_2 = '^(\w+)@sina.com.cn$'
    regex_3 = '^(\w+)@163.com$'
    regex_4 = '^(\w+)@126.com$'
    regex_5 = '^[1-9][0,9]{4,}[email protected]$'
    regex = [regex_1 ,regex_2 ,regex_3, regex_4, regex_5]
    for i in  regex:
        result = re.findall(i,email)
        if result:
            print("匹配成功")
            return True
        else:
            print("匹配失败")
            return False
email = '[email protected]'
checkEmail(email)

正则表达式测试工具

打开开源中国提供的正则表达式测试工具 tool.oschina.net/regex/,输入待匹…

例如,输入下面这段待匹配的文本:

Hello, my phone number is 123455678 and email is [email protected], and my website is https://blog.csdn.net/weixin_44510615.

这段字符串中包含了一个电话号码和一个电子邮件,接下来就尝试用正则表达式提取出来,如图所示。

在网页右侧选择 “匹配 Email 地址”,就可以看到下方出现了文本中的 E-mail。如果选择 “匹配网址 URL”,就可以看到下方出现了文本中的 URL。是不是非常神奇?

本文已收录 GitHub,传送门~ ,里面更有大厂面试完整考点,欢迎 Star。

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326543701&siteId=291194637