Match line feed - Python Reptile

When a match is found before meeting content in learning reptiles exist wrap, then not match, and later on the Internet to find a way, then do not bother to record, today there are suddenly encountered such a situation, think, or where records about it.

 

At that time when crawling csdn Home blog, as shown below

 

 

Read the source code and found that if you use <a href="....来爬取的话,这样得到的会有许多其他的网址,并不全是我需要得博文,但是用<div class="title"> back to match appeared wrap, wrap match but I will not ....

re.compile () function is called a sign parameters re.DOTALL, it can make the regular expression dot (.) matches any character including newline including.

PAT = ' <div class = "title">. *? <H2>. *? <A the href = "(. *?)" target = "_ blank" ' # at this time. can matching includes wrap including any character 
RST1 = the re.compile (PAT, re.DOTALL) .findall (Data)

 

 

 

import urllib.request
import re
url = "http://www.csdn.net/" data = urllib.request.urlopen(url).read().decode("utf-8") print(len(data)) pat = ' <div class="title">.*?<h2>.*?<a href="(.*?)" target="_blank"' rst1 = re.compile(pat, re.DOTALL).findall(data) print(len(rst1)) for i in range(0, len(rst1)): print(rst1[i]) Data = the urllib.request.urlopen (RST1 [I]) Read () decode (.. " UTF-. 8 " , " the ignore " ) urllib.request.urlretrieve (RST1 [I], " D: \\ \\ studyPython Python \\ \\ reptile learning learning Blog \\ \\ urllib " + str (i + 1) + " .html " ) Print ( " crawling on: " , i + 1, " published blog success " ) Print ( " Home All blog crawling end " )

 

 At this point the crawling success

 

Guess you like

Origin www.cnblogs.com/dong973711/p/11923953.html