Crawling "through requests, re (regular expression) Ancient Poetry and Prose " Web data.
Detailed code is as follows:
# / User / bin the env Python! # Author: the Simple-Sir # Time: 2019/7/31 22:01 # crawled pages Ancient text data Import Re Import Requests DEF the getHtml (Page): '' ' acquires web page data : param page: Page : return: html page data (text format) '' ' headers = { ' User-Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / Safari 75.0.3770.100 / 537.36 ' } URL = ' https://www.gushiwen.org/default_{}.aspx' .Format (Page) # acquired data pages respons = requests.get (URL, headers = headers) HTML = respons.text return HTML DEF getText (HTML): the titles = the re.findall (R & lt ' <div class = "CONT ">. *? <b> (. *?) </ b> ' , HTML, re.DOTALL) # get the title re.DOTALL matches all characters, contain \ n (. can not match \ the n-) caodai = re.findall (R & lt ' <P class = "Source">. *? <A. *?> (. *?) </a> ' , HTML, re.DOTALL) # Get dynasty author = the re.findall (R & lt ' <P class = "source">. * ? <a. *?>. *? <a.*?>(.*?)</a>',html,re.DOTALL) # Get dynasty Contents the re.findall = (R & lt ' <div class = "contson." *?> (. *?) </ Div> ' , HTML, re.DOTALL) # access to content, comprising the tag symbol con_texts = [] # content, no tag symbol for I in contents: RSUB = the re.sub ( ' <.? *> ' , '' , I) con_texts.append (rsub.strip ()) # Strip to space Si = [] for v in ZIP (the titles, caodai, author, con_texts): bt, cd, ZZ,nr = v s = { ' Title ' : bt, ' dynasty ' : cd, ' Author ' : ZZ, ' content ' : nr } si.append (S) return Si DEF main (): the p- = int (the INPUT ( ' You want to get how many pages ? data \ n- ' )) for page in Range (. 1,. 1 + P ): Print ( ' on page data {}: ' .format (page)) HTML = the getHtml (page) text = getText(html) for i in text: print(i) if __name__ == '__main__': main()
Results of the: