Use python module group --re

Regular Expressions:

    Regular expression itself is a small, highly specialized programming language, but in python, by re embedded integrated module allows the caller can call directly to achieve a regular match. Variation regular expression pattern into a series of bytecodes, and then written in C language

Matching engine performs.

    A regular expression is a string matching process used, the need to introduce re module using a regular expression in python

# 纯python代码校验
while True:
    phone_number = input('please input your phone number : ')
    if len(phone_number) == 11 \
            and phone_number.isdigit()\
            and (phone_number.startswith('13') \
            or phone_number.startswith('14') \
            or phone_number.startswith('15') \
            or phone_number.startswith('18')):
         Print ( ' legitimate phone number ' )
     the else :
         Print ( ' is not a valid phone number ' ) 
      
      
# regular expression check 
Import Re 
PHONE_NUMBER = the INPUT ( ' Please your Phone Number The the INPUT: ' )
 IF re.match ( ' ^ (13 is | 14 | 15 | 18 is) [0-9]. 9} {$ ' , PHONE_NUMBER):
         Print ( ' legitimate phone number ' )
 the else :
         Print ( ' not valid phone number ' )
 #Regular use can not python unique in all languages 
# matching large section of text in a particular character
Regular check whether the difference

 

 

    Online test regular expressions: : http://tool.chinaz.com/regex/ (this does not have any relationship with the re module is only used to test regular expressions)

    Scenario: reptiles, data analysis ......

    Character set of concepts:

        In various characters may appear the same position to form a burst, by the regular expression [] represents (a group of characters can only match one character)

        E.g:

            0-9 may match abbreviated to [0-9] (if you want to match the bars, backslash directly on it)

            Matching letters az abbreviated as [az] (capital letters can be written to write the same way)

            ps: what to what this range must be from small to large, because the internal corresponding ascii code is from small to large.

 

     

 

      quantifier: 

              

    

    Greed match:

        When the matching condition is satisfied again, the matching character string as long as possible, by default, greedy matching.

    

    Commonly used non-greedy matching pattern:

        *? : Repeated any number of times, but less duplication wherever possible

        +? : Repeated one or more times, but less duplication wherever possible

        ? ? : Repeat 0 or 1, but less duplication wherever possible

        {N, m} ?: repeated n to m times, but less repeated as

        {N,} ?: repeated n times or more, but less repeated as

    .? * Usage:

        Is any character

        * Length is set to 0 to infinity

        ? : Non-greedy mode

        Together that is, any character takes as little as possible, so generally do not write alone.

 

 

 

re module commonly used methods:

    

import re

ret = re.findall('a', 'william john lisa')  # 返回所有满足匹配条件的结果,放在列表里
print(ret) #结果 : ['a', 'a']

ret = re.search('a', 'william john lisa').group()
print(ret) #结果 : 'a'
# 函数会在字符串内查找模式匹配,只到找到第一个匹配然后返回一个包含匹配信息的对象,该对象可以
# 通过调用group()方法得到匹配的字符串,如果字符串没有匹配,则返回None。

ret = re.match('a', 'abc').group()  # 同search,不过尽在字符串开始处进行匹配
print(ret)
#结果 : 'a'


# ---------------------------------------------------------------------------------


ret = re.split('[ab]', 'abcd')  # 先按'a'分割得到''和'bcd',在对''和'bcd'分别按'b'分割
print(ret)  # ['', '', 'cd']

ret = re.sub('\d', 'H', 'william1john2lisa3', 1)#将数字替换成'H',参数1表示只替换1个
print(ret) #evaHegon4yuan4

ret = re.subn('\d', 'H', 'william1john2lisa3')#将数字替换成'H',返回元组(替换的结果,替换了多少次)
print(ret)

obj = re.compile('\d{3}')  #将正则表达式编译成为一个 正则表达式对象,规则要匹配的是3个数字
ret = obj.search('abc123eeee') #正则表达式对象调用search,参数为待匹配的字符串
print(ret.group())  #结果 : 123

import re
ret = re.finditer('\d', 'ds3sy4784a')   #finditer返回一个存放匹配结果的迭代器
print(ret)  # <callable_iterator object at 0x10195f940>
print(next(ret).group())  #查看第一个结果
print(next(ret).group())  #查看第二个结果
print([i.group() for i in ret])  #查看剩余的左右结果

 

 

    注意:

        1、findall的优先级查询:

import re

ret = re.findall('www.(baidu|taobao).com', 'www.taobao.com')
print(ret)  # ['taobao']     这是因为findall会优先把匹配结果组里内容返回,如果想要匹配结果,取消权限即可

ret = re.findall('www.(?:baidu|taobao).com', 'www.taobao.com')
print(ret)  # ['www.taobao.com']

 

    

        2、split的优先级查询:

ret=re.split("\d+","william1john2lisa3")
print(ret) #结果 : ['william', 'john', 'lisa']

ret=re.split("(\d+)","william1john2lisa3")
print(ret) #结果 : ['william', '3', 'john', '4', 'lisa']

#在匹配部分加上()之后所切出的结果是不同的,
#没有()的没有保留所匹配的项,但是有()的却能够保留了匹配的项,
#这个在某些需要保留匹配部分的使用过程是非常重要的。

 

       

 

 

爬虫练习:

import requests
import re


# 获取网页源代码
def get_html_content(url):
    return requests.get(url).text


# 解析获取的源代码,提取有用的内容
def parse_html(html_con):
    # 正则进行解析
    r = re.compile(r'<p class="name"><.*?>(?P<title>.*?)</a></p>' +
                   '.*?<p.*?>(?P<actor>.*?)</p>' +
                   '.*?<a href="(?P<url>.*?)" title=".*?>', re.S)
    obj = r.finditer(html_con)
    for i in obj:
        info = {
            'title': i.group('title'),
            'actor': i.group('actor').strip(),
            'movie_url': 'https://maoyan.com' + i.group('url')

        }
        yield info


def main(nums):
    url = 'https://maoyan.com/board/4?offset=%s' % nums

    get_html = get_html_content(url)
    for i in parse_html(get_html):
        print(i)


if __name__ == '__main__':
    for i in range(0, 101, 10):
        main(i)
这里以爬去电影排行榜为例

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

    

Guess you like

Origin www.cnblogs.com/tulintao/p/11203170.html