The re module module

  

  Eight, the regular expression

  1.1 First, let's understand the relationship and re module regular expressions:

The relationship between the re module is the regular expression 
    regular expressions python is not unique it is an independent technology all programming languages can use regular, but if you want to use in python, you must rely on re module 
 role: regular is used to screen for specific content in a string of 
 syntax: [] expression of a group of characters or groups of characters inside a relationship and can only choose one

 

Website phone number check function Example: https://reg.jd.com/reg/person?ReturnUrl=https%3A//www.jd.com/

Nothing to do with the re module, just a test regular expressions: http://tool.chinaz.com/regex

  1.2 Relationship metacharacters and quantifiers

  

                                                                                 
# . Metacharacters Table: matching all ^ What was the beginning of what ends with $ ^ jason $ a hundred percent accurate match more than one less will not work                        
                                                                                 
# A | b abc | ab Note: or pattern matching must be long the put in front of ab can first match abc the match abc                                  
                                                                                 
# greedy matching and non-greedy match:? regular time match the default regular are greedy match mode (try to match more) can be added to a quantifier greedy match later (match zero no match zero or the match is a)      
                                                                                 
# when using meta characters quantifiers and quantifier element must be placed behind the character, only limiting effect on the cutting element of the character immediately                                         
                                                                                 

 

  Nine, re module

  

  # Aunt three methods must master
      # 1.res = re.findall ( 'expression', 'string to be matched')   

                
# 1.res the re.findall = ( 'expression', 'string to be matched')                     
# RES = the re.findall ( 'A', 'EVA Egon Jason')                  
# Print (RES) # [ 'A', 'a'] returns a list of all stored results for which to list         
# RES = the re.findall ( '[AZ] *', 'EVA Egon Jason') zero or more * # 
# Print (RES) # [ 'EVA', '', 'Egon', '', 'Jason', '']      
# RES = the re.findall ( '[AZ] +', 'EVA Egon Jason')             
# Print (RES) # [ 'EVA ',' egon ',' jason '] + or a plurality        
                                                         

  1.2  re.search()

  # 2.res1 = re.search ( 'expression', 'string to be matched')     

                                                                                                                                         
                                                                                                
 res1 = re.search('a','eva egon jason tank')                                                                                             
 res2 = re.search('l','eva egon jason tank')                                                                                             
 print(res1,type(res1)) #  None type:<class '_sre.SRE_Match'>                                                                            
 <_sre.SRE_Match object; span=(2, 3), match='a'>                                                                                         
  Print (res1.group ())                                                                                                                      
 Summary: search string from left to right and the expression to be matched one by one until immediately return to find the matching information on the first matching 
    result of the search returns an object, call Group ( ) the method of string matching obtained      if the call is None group () will be given. Solution: You can determine if res: run the printing results
if res1: print(res1.group())

 

  1.3 re.match()

  # Res3 = re.match ( 'expression', 'string to be matched')
# RES3 re.match = ( 'expression', 'string to be matched')                              
# RES3 re.match = ( 'A', 'EVA Egon Jason Tank')                    
# RES4 re.match = ( 'E', ' Egon Jason Tank EVA ')                    
# Print (res3.group ()) #' NoneType 'Object attribute has NO' GR 
# IF RES4:                                                      
#      Print (res4.group ())                                       
# # Summary:                                                         
# 1.match find a match rule the beginning (or the beginning of a single overall) will only match a string of                        
# 2 returns an object needs to call the group () method to get the matching string results                          
                                                               
   summary:
                                                                                       
   search and match the similarities and differences:  
                                                                                       
 similarities and differences between the search and match:                                                                  
  1 . similarities:                                                                                
    # the same sentence 1. Syntax;                                                                          
   # 2. The results are returned matches an object, call group () to get a matching string;                                             
   # 3. If the return is a None, calling the group () being given                                                      
                                                                                        
 2 different points:.                                                                                
    # 1.search: Find matches left to right as long as the first match, will not continue to look down, the results directly back                                           
   # 2.match: only the beginning of the match, so the match must be based on a single character expression change or a whole string beginning with whether the person returns None                               
                                                                                       

        Other methods 9.2 re module: 

  # 1 re.split()  
# Other methods re module:                        
 # . 1 re.split ()                     
# RES2 = re.split ( '[ab &]', 'ABCD')    
# Print (RES2) # [ '', '', 'CD']     

  

 # 2 re.sub()
# RES2 the re.sub = ( '\ D', 'the HA', 'eav3egon4jason56')                                   
# Print (RES2) # is a first expression; the second is to update content; the third character to be matched string                                 
#   eavHAegonHAjasonHAHA                                                        
 # 3 re.subn()
# RES3 = re.subn ( '\ D', 'the HA', 'eav3egon4jason56',. 3)                                
# Print (RES3) # ( 'eavHAegonHAjasonHAHA',. 4) returns a number of updates and ancestral,                    
 # ( 'eavHAegonHAjasonHA6 ',. 3)                                                   
 #   (' eavHAegonHAjasonHA6 ',. 3) behind the parameters can be changed to a number which matches the number and the number of replacement                      
                                                                               

 

 # 4 re.compile()

# Obj = the re.compile ( '\ D {}. 3')                         
# Print (obj) # # regular expression compiled into a regular expression object, the rule to be matched 3 digits   
# RES4 = obj.findall ( ' eee6a6a6aw7ww123q456q9p ')     
# Print (RES4) # according to the rules defined in the object obj to match three numbers for a group of character string        
# [' 123 ',' 456 '] # findAll equipped with all results                  
                                                   
# Res5 = obj.search (' ee666p999ewrewf3p218 ')         
# Print (res5.group ()) # just to match, will not match the down             
#                                                   
# res6 = obj.match (' 123dsf5 ')                       
# Print (res6.group ()) # whether digital 3ge otherwise, the beginning of direct error          
#                                                  
                                                   
 9.3 # ------------------ ------------------ extension
# res7 = re.search('(^[1-9])(\d{16})([0-9x])$','452402199312233318' )
# print(res7.group())
# print(res7.group(1))
# print(res7.group(2))
# print(res7.group(3))
 # Supplemental famous grouping

# Supplemental known packet 
# res8 re.match = ( '(^ [1-9]) (\ D {14}) (\ D {2} [0-9x])? $', '452402199311163243') 
# Print ( res8.group ()) 
# Print (res8.group ()) 
# Print (res8.group ()) 
# Print (res8.group ())
 # Famous grouping :(? P <username>)
ret4 = re.search('(^[1-9])(?P<username>\d{14})(?P<user_pwd>\d{2}[0-9x])?$','452402199311163243')
print(ret4.group('username'))

print(ret4.group('user_pwd'))
Packet #
# RET = re.search ( 'the WWW (baidu |. Oldboy) .com', 'www.oldboy.com') 
# RET1 = re.findall ( 'the WWW (baidu |. Oldboy) .com', 'www.oldboy .com ') 
# Print (ret.group ()) # www.oldboy.com disregard packet 
# Print (RET1) # [' Oldboy '] 
# facing findAll () method is not the value of the group (), so the default packet the results were obtained priority

# Cancellation grouped in their respective brackets (? :)
# Ungroup in the respective brackets (:)? 
# RET2 = RET1 = re.findall ( 'the WWW (?: baidu |. Oldboy) .com', 'www.oldboy.com') 
# Print (RET2) # [ ' www.oldboy.com '] ungroup direct matching results

   # Reptile :( homework)

import re
import json
from urllib.request import urlopen


"""
https://movie.douban.com/top250?start=0&filter=
https://movie.douban.com/top250?start=25&filter=
https://movie.douban.com/top250?start=50&filter=
https://movie.douban.com/top250?start=75&filter=


<li>
            <div class="item">
                <div class="pic">
                    <em class="">1</em>
                    <a href="https://movie.douban.com/subject/1292052/">
                        <img width="100" alt="肖申克的救赎" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.webp" class="">
                    </a>
                </div>
                <div class="info">
                    <div class="hd">
                        <a href="https://movie.douban.com/subject/1292052/" class="">
                            <span class="title">肖申克的救赎</span>
                                    <span class="title"> / The Shawshank Redemption</span>
                                <span class="other"> / 月黑高飞(港)  /  刺激1995(台)</span>
                        </a>


                            <span class="playable">[可播放]</span>
                    </div>
                    <div class="bd">
                        <p class="">
                            Starring: Tim Robbins Tim Robbins / ... <br>Director: Frank Darabont Frank Darabont & nbsp; & nbsp;
                            1994 / & nbsp; USA & nbsp; / & nbsp; crime drama 
                        <div class = "Star">
                        </ the p->            
                                <span class="rating5-t"></span>
                                <span class="rating_num" property="v:average">9.6</span>
                                <span property="v:best" content="10.0"></span>
                                <span>1489907人评价</span>
                        </div>

                            <p class="quote">
                                <span class="inq">希望让人自由。</span>
                            </p>
                    </div>
                </div>
            </div>
        </li>
"""


def getPage(url):
    response = urlopen(url)
    return response.read().decode('utf-8')

def parsePage(s):
    com = re.compile(
        '<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>'
        '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>', re.S)

    ret = com.finditer(s)
    for i in ret:
        yield {
            "id": i.group("id"),
            "title": i.group("title"),
            "rating_num": i.group("rating_num"),
            "comment_num": i.group("comment_num"),
        }


def main(num):
    url = 'https://movie.douban.com/top250?start=%s&filter='
    f(ret)printparsePage (response_html)
    =
    retgetPage (url)=
    response_htmlNum%= open("move_info7", "a", encoding="utf8")

    for obj in ret:
        print(obj)
        data = str(obj)
        f.write(data + "\n")

count = 0
for i in range(10):
    main(count)
    count += 25

 



 



 

Guess you like

Origin www.cnblogs.com/mofujin/p/11203519.html