Acquaintance of reptiles python: using regular expressions crawling "ancient poetry" Web data

 Crawling "through requests, re (regular expression) Ancient Poetry and Prose " Web data.

Detailed code is as follows:

# / User / bin the env Python! 
# Author: the Simple-Sir 
# Time: 2019/7/31 22:01 
# crawled pages Ancient text data 
Import Re
 Import Requests 

DEF the getHtml (Page):
     '' ' 
    acquires web page data 
    : param page: Page 
    : return: html page data (text format) 
    '' ' 
    headers = {
         ' User-Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / Safari 75.0.3770.100 / 537.36 ' 
    } 
    URL = ' https://www.gushiwen.org/default_{}.aspx' .Format (Page)   # acquired data pages 
    respons = requests.get (URL, headers = headers) 
    HTML = respons.text
     return HTML 

DEF getText (HTML): 
    the titles = the re.findall (R & lt ' <div class = "CONT ">. *? <b> (. *?) </ b> ' , HTML, re.DOTALL)   # get the title re.DOTALL matches all characters, contain \ n (. can not match \ the n-) 
    caodai = re.findall (R & lt ' <P class = "Source">. *? <A. *?> (. *?) </a> ' , HTML, re.DOTALL)   # Get dynasty 
    author = the re.findall (R & lt ' <P class = "source">. * ? <a. *?>. *? <a.*?>(.*?)</a>',html,re.DOTALL)  # Get dynasty 
    Contents the re.findall = (R & lt ' <div class = "contson." *?> (. *?) </ Div> ' , HTML, re.DOTALL)   # access to content, comprising the tag symbol 
    con_texts = [] # content, no tag symbol 
    for I in contents: 
        RSUB = the re.sub ( ' <.? *> ' , '' , I) 
        con_texts.append (rsub.strip ()) # Strip to space 
    Si = []
     for v in ZIP (the titles, caodai, author, con_texts): 
        bt, cd, ZZ,nr = v
        s = {
            ' Title ' : bt,
             ' dynasty ' : cd,
             ' Author ' : ZZ,
             ' content ' : nr 
        } 
        si.append (S) 
    return Si 

DEF main (): 
    the p- = int (the INPUT ( ' You want to get how many pages ? data \ n- ' ))
     for page in Range (. 1,. 1 + P ):
         Print ( ' on page data {}: ' .format (page)) 
        HTML = the getHtml (page) 
        text = getText(html)
        for i in text:
            print(i)

if __name__ == '__main__':
    main()
Crawling "ancient poetry" Web data

 Results of the:

 

 

Guess you like

Origin www.cnblogs.com/simple-li/p/11284096.html
Recommended