Study Notes (reptiles): crawling poetry website for every one poem, and saves it to

1, the target site

Target site: https://so.gushiwen.org/shiwen/default.aspx?

 

 

 

 

2, reptiles purpose

Text crawl target site, such as the content of the poem, author, dynasties, and saved to the Local.

 

 3, crawlers

# - * - Coding: UTF-. 8 - * - 
# crawling Ancient site 
Import Requests
 Import Re 

# download data 
DEF that write_data (Data): 
    with Open ( ' poetry .txt ' , ' A ' ) AS F: 
        f.write ( Data) 

for I in Range (1,10 ):
     # target address url 
    url =   " https://so.gushiwen.org/shiwen/default.aspx?page= {} " .format (I) 
    headers = { ' the User -agent ' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
           'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
           'Accept-Encoding': 'none',
           'Accept-Language': 'en-US,en;q=0.8',
           'Connection': 'keep-alive'}
    html = requests.get(url ,headers = headers).content.decode('utf-8')
    # print(html)
    p_title = '<p><a style="font-size:18px; line-height:22px; height:22px;" href=".*?" target="_blank"><b>(.*?)</b></a></p>'
    title = re.findall(p_title, html)
    # 提取内容
    p_context = '<div class="contson" id=".*?">(.*?)</div>'
    context = re.findall(p_context,HTML, re.S)
    '
    p_years =extract's#<p class="source"><a href=".*?">(.*?)</a>'
    years = re.findall(p_years,html,re.S)
    #提取作者
    p_author = '<p class="source"><a href=".*?">.*?</a><span>:</span><.*?>(.*?)</a>'
    author = re.findall(p_author,html)
    # print(context)
    # print(title)
    # print(years)
    # print(author)
    for j in range(len(title)):
        context[j] = re.sub('<.*?>', '', context[j])
        #'gbk' codec can not encode character '\ u4729', there will be no line error 
        context [J] = re.sub (r ' \ u4729 ' , '' , context [J])
         # Print (title [J] ) 
        # Print (years [J]) 
        # Print (author [J]) 
        # Print (context [J]) 
        # write data 
        that write_data (title [J]) 
        that write_data ( ' \ n- ' + years [J]) 
        that write_data ( ' : ' + author [J]) 
        that write_data (context [J]) 
    Print ( ' download successful on {p} ' .format(str(i)))

4, difficulty with thinking

The difficulty lies in reptiles, use regular expressions, such as using a regular expression matching the body of poetry, poetry author, title poem. Regular use of regular expressions, you need to find what you need to match the pre-term and post-term, in order to locate the precise content needs to match. The matching Ancient text:

 # 提取内容
    p_context = '<div class="contson" id=".*?">(.*?)</div>'
    context = re.findall(p_context, html, re.S)

The content needs to match the contents of brackets, before the term is' <div class = "contson" id = ". *?">, After the item is </ div>. Caveats here are two things, first: ".? *" Id = Here you must use non-greedy mode, that add? , If not? It will continue to match the next content, which makes it impossible to match what we needed; second: (.? *) Plus? There is also the use of non-greedy mode, only the first match in brackets. And matching the title, author's, author's name, the method is similar, here is not introduced one by one.

 

Guess you like

Origin www.cnblogs.com/maxxu11/p/12669005.html