Read premise: python basic grammar
Regular Expressions
Development Environment: (Windows) eclipse + pydev
Crawling URL: www.doupoxs.com/doupocangqiong/
Import Requests Import Re Import Time headers = { ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 75.0.3770.142 Safari / 537.36 ' } # join request first, to increase the stability of the crawler F = open ( ' D: \ Pyproject \ doupo \ doupo.txt ' , ' a + ' ) # Create txt file, open for appending DEF the get_info (URL): # text of each page crawled function RES = requests.get (URL, headers = headers) IF== 200 res.status_code: # determines whether the request code 200, if it is successful, is not, then the failure Contents the re.findall = ( ' .? <P> (*) </ P> ' , res.content.decode ( ' UTF-. 8 ' ), re.S) # define encoding for Content in Contents: f.write (Content + ' \ n- ' ) # regular txt file acquired data is written the else : Pass IF the __name__ == ' __main__ ' : URLs = [ 'http://www.doupoxs.com/doupocangqiong/{}.html ' .format (STR (I)) for I in Range (2,1665)] # Total crawling pages for URL in URLs: the get_info (URL) the time.sleep ( 1 ) f.close () # close the document
The results show:
Obtaining such as head about the request, see my other post, not repeat them: https://www.cnblogs.com/junecode/p/11306266.html