First, import library
Import Re from the urllib.request Import the urlopen # built-in packet to obtain the source string page
Source code string to urlopen page
res = urlopen('https://www.cnblogs.com/zhuangdd/p/12644081.html') print(res.read().decode('utf-8')) —————————————————————————————— <!DOCTYPE html> <html lang="zh-cn"> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <meta name="referrer" content="origin"/> <Meta Property = " OG: the Description " Content = " Content help learning tool http://tool.chinaz.com/regex/ character set [] in the position of a character to appear on the [1bc] is a range [0-9] [AZ] [az ] matches three characters [abc0-9] " /> <Meta HTTP-equiv = " the Cache-Control " Content = " NO-Transform " /> <Meta-equiv = HTTP " Control-the Cache " Content = " NO-siteapp " /> <Meta HTTP-equiv = " X--the UA-Compatible " Content = " IEs = Edge " /> <title>25 -1 Regular re module (the Find ,,,,,,,, etc.
There are many optional flags value: re.I (IGNORECASE) case is ignored in parentheses is the full wording re.M (MULTILINE) multi-line mode, change ^ and $ behavioral re.S (DOTALL) point can match any character , including newline re.L (LOCALE) localizing a matching identification, special characters set \ w, \ W, \ b , \ B, \ s, \ S depends on the current environment, not recommended re.U ( the uNICODE) using \ w \ W \ s \ S \ d \ D used depending on the character attributes defined unicode. Python3 in default in the Flag re.X (VERBOSE) verbose mode, this mode pattern string can be multi-line, ignoring white space characters, and you can add comments
def getPage(url): response = urlopen(url) return response.read().decode('utf-8') def parsePage(s): # s 网页源码 ret = com.finditer(s) for i in ret: ret = { "id": i.group("id"), "title": i.group("title"), "rating_num": i.group(" Rating_num " ), " comment_num " : i.group ( " comment_num " ) } the yield RET DEF main (NUM): URL = ' https://movie.douban.com/top250?start=%s&filter= ' % NUM # 0 response_html the getPage = (URL) # response_html this page is the source STR RET = parsePage (response_html) # generator Print (RET) F = Open ( " move_info7 " , "a", encoding="utf8") for obj in ret: print(obj) data = str(obj) f.write(data + "\n") f.close() com = re.compile( '<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>' '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>', re.S) count = 0 for i in range(10): main(count) # count = 0 count += 25