On a reptile Prepares 5: Regular Expressions

     ※ you find a download page with python reptile is very easy to drop, but to find what you need in a web page, it is difficult to find and can be found in the string is not as simple as we imagine, not directly use the find method find a matching string position can pick up
     ※ for example, you want to write a script to get the latest proxy ip address automatically, but will certainly encounter difficulties
Here Insert Picture Description
     (Analysis: the first to write a reptile will certainly have to go to the site to review elements of the Capitol , select a random ip Inspect element, and then see what ip before and after this label, can be found wrapped in a td tag, then you will find another location will have td tag, but the inside is not ip but other information that you may spend a lot of time, go first to a table label, and then finally to the tbody tr td tag and then successfully located the unique characteristics of the ip address, but this writing is not only cumbersome, but there is a big problem that does not have universal of you on this site feasible, in another site is not feasible, and
     so it is best can follow their own needs Content feature to automatically find, that I'm looking for a ip address, ip address that this feature is to have four sections, each digit in the range 0 to 255, which are separated by three periods of English, which Well, then you can go on the web to find features inside ip address according to this feature, but that comes with the string method can not be done, but then, we encounter problems, computer old-timers have also long been thought up, and has helped us to design a very good solution is to use regular expressions)

     ※ regular expressions difficult to learn, but it is very useful in the preparation of the processing program or web page when the string, there is often a need to find some complex rules in line with the string, for example, features and rules just said ip address, if use the string method Pythob comes, you will be angry, then this time, if you know regular expressions, you will find that this is really a panacea ah. Because the regular expression is a tool to describe these complex rules for the regular expression itself is used to describe these rules, different programming languages ​​also have to use regular expressions, but not the same, Python is the use of words re module to achieve

     ※ search (): Searching for position in the string regular expression pattern first appears
Here Insert Picture Description
     (the first argument is the regular expression pattern, that is, you have to search for the rules described here need to use raw string r writing, to avoid a lot of unnecessary trouble, matching the above is (7-12), can not be found will return None,)
Here Insert Picture Description
     (the Find method can also be done, but the show is the start address)
     ※ wildcard., can you can match any character except newline.
     (Wildcard * and? In this category may represent any symbol characters, regular expressions are also so-called wildcards, using the dot (.))

Here Insert Picture Description
     (As the first statement, he found the first character I, because of this (.) Represents any character except newline, I was matched to the second sentence did not add C, (.) Also can match out)
Here Insert Picture Description
     (simultaneously by a backslash \, at which time (.) no longer represents other characters, on behalf of his own, which means that regular expressions, a backslash character still has the ability to deprive the yuan, metacharacters itself represents the other meaning, the characters have special features, such as dot) (.)
Here Insert Picture Description
     (Analysis: the backslash character can also be used to make ordinary special ability, for example, want to match to digital, you can using a backslash d, to match any number)

     ※ tries to match the ip address
Here Insert Picture Description
     (Analysis: you can see a successful match, but the writing is so problematic, first \ d which matches numbers are 0-9, while the agreed range of IP addresses is 0-255, and now here \ d \ d \ d maximum matching number is 999, and the range is the maximum range is 255, then the second, you are required here ip address of each group must have three digits, but in fact some ip address only a certain set of numbers or two, so not match)

     ※为了表示一个字符串的范围,我们可以创建一个叫做 字符类 的东西,使用中括号[] 来创建一个字符类,字符类的含义就是你只要匹配字符类中的一个字符,那么就算匹配
Here Insert Picture Description
     (可以看到正则表达式 是默认开启 大小字母敏感 模式的)

     ※可以在字符类中使用 横杆 ‘-’ 表示一个范围
Here Insert Picture Description

     ※使用大括号来解决限定重复匹配的次数
Here Insert Picture Description
     (解析:大括号里边的3表示的重复的次数,表示的前面的那个字符也就是b重复的次数,然后a和c没有那都是一次,所以匹配出了一个a三个b一个
c,第二句就匹配不到了因为b的重复次数超过了3)

     ※大括号里还可以给出重复匹配次数的范围
Here Insert Picture Description
     (解析:{3,10}表示的就是只要它前面那个字符b出现了3-8次都是可以匹配到的,然后a和c也只是一次)

     ※使用正则表达式来匹配 0~255
Here Insert Picture Description
     (解析:这两种方式都是错误的,匹配不到,因为正则表达式匹配的是字符串,所以呢数字对于字符来说只有0-9,例如123就是有‘1’、‘2’、‘3三个字符来组成的,那么上面的[0-255]这个字符类表示的就是0-2然后还有两个5,所以他就会匹配0125四个数字中的任何一个,所以得了个1)
Here Insert Picture Description
     (解析:[01]\d\d|2[0-4\d|25[0-5]]首先这个东东要分三个部分来看,第一部分是[01]\d\d,也就是可以匹配到百位上是0开头或者1开头,然后十位上是\d也就是0-9中的任意数字,个位也一样,所以这一部分表示数的范围就是000-199,接着第二部分和第一部分用逻辑或|接上,第二部分同理,百位上数字只能是2,十位上可以是0-4,个位随意,随意第二步可以匹配到的范围是200-249,第三部分也同理啦,也用逻辑或|连接,第三步就固定了百位和十位,百位只能是2,十位只能是5,个位可以是0-5,所以第三部分的范围是250-255,综合这三部分可以匹配到的范围就是0-255了,, 但是这样写还是有问题,只能要求匹配的数字必须是 3 位的,如下:)
Here Insert Picture Description
     (解析:可以看到88匹配不出,只有写088才可以匹配出来,因为每一位上是默认至少重复一次,所以把百位和十位改为可以重复0-1次就可以了,如下:
Here Insert Picture Description

     ※最后可以来配 完整的IP地址

>>> re.search(r"(([01]{0,1}\d{0,1}\d|2[0-4\d|25[0-5]])\.){3}([01]{0,1}\d{0,1}\d|2[0-4\d|25[0-5]])","192.188.68.3")
<re.Match object; span=(0, 12), match='192.188.68.3'>

Here Insert Picture Description
     (解析:ip地址是有四个部分,如192.188.68.3,每一部分的范围是0-255,所以理论上要写4份一样的,然后他们之间用(.)隔开,然后每一部分用小括号括起来,但是可以看到前三个是一个,如192. 第二个188.,第三个68.所以直接写{3},重复三次就好,然后第四部分在写一遍,就得到上面的一堆长长的表达式了

Published 247 original articles · won praise 116 · views 280 000 +

Guess you like

Origin blog.csdn.net/w15977858408/article/details/104121258