Examples 25-2 regular crawler

First, import library

Import Re
 from the urllib.request Import the urlopen     # built-in packet to obtain the source string page

Source code string to urlopen page

res = urlopen('https://www.cnblogs.com/zhuangdd/p/12644081.html')
print(res.read().decode('utf-8'))

——————————————————————————————
<!DOCTYPE html>
<html lang="zh-cn">
<head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <meta name="referrer" content="origin"/> 
    <Meta Property = " OG: the Description " Content = " Content help learning tool http://tool.chinaz.com/regex/ character set [] in the position of a character to appear on the [1bc] is a range [0-9] [AZ] [az ] matches three characters [abc0-9] " /> 
    <Meta HTTP-equiv = " the Cache-Control " Content = " NO-Transform " /> 
    <Meta-equiv = HTTP " Control-the Cache " Content = " NO-siteapp " /> 
    <Meta HTTP-equiv = " X--the UA-Compatible " Content = " IEs = Edge " />
    <title>25 -1  Regular re module (the Find 
,,,,,,,, etc.

 

There are many optional flags value: 

re.I (IGNORECASE) case is ignored in parentheses is the full wording 
re.M (MULTILINE) multi-line mode, change ^ and $ behavioral 
re.S (DOTALL) point can match any character , including newline 
re.L (LOCALE) localizing a matching identification, special characters set \ w, \ W, \ b , \ B, \ s, \ S depends on the current environment, not recommended 
re.U ( the uNICODE) using \ w \ W \ s \ S \ d \ D used depending on the character attributes defined unicode. Python3 in default in the Flag 
re.X (VERBOSE) verbose mode, this mode pattern string can be multi-line, ignoring white space characters, and you can add comments
flags
def getPage(url):
    response = urlopen(url)
    return response.read().decode('utf-8')

def parsePage(s):   # s 网页源码
    ret = com.finditer(s)
    for i in ret:
        ret = {
            "id": i.group("id"),
            "title": i.group("title"),
            "rating_num": i.group(" Rating_num " ),
             " comment_num " : i.group ( " comment_num " ) 
        } 
        the yield RET 

DEF main (NUM): 
    URL = ' https://movie.douban.com/top250?start=%s&filter= ' % NUM   # 0 
    response_html the getPage = (URL)    # response_html this page is the source STR 
    RET = parsePage (response_html) # generator 
    Print (RET) 
    F = Open ( " move_info7 " , "a", encoding="utf8")
    for obj in ret:
        print(obj)
        data = str(obj)
        f.write(data + "\n")
    f.close()

com = re.compile(
        '<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>'
        '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>', re.S)
count = 0
for i in range(10):
    main(count)  # count = 0
    count += 25
Watercress Code 250

 

Guess you like

Origin www.cnblogs.com/zhuangdd/p/12644200.html