1, the target site
Target site: https://so.gushiwen.org/shiwen/default.aspx?
2, reptiles purpose
Text crawl target site, such as the content of the poem, author, dynasties, and saved to the Local.
3, crawlers
# - * - Coding: UTF-. 8 - * - # crawling Ancient site Import Requests Import Re # download data DEF that write_data (Data): with Open ( ' poetry .txt ' , ' A ' ) AS F: f.write ( Data) for I in Range (1,10 ): # target address url url = " https://so.gushiwen.org/shiwen/default.aspx?page= {} " .format (I) headers = { ' the User -agent ' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3', 'Accept-Encoding': 'none', 'Accept-Language': 'en-US,en;q=0.8', 'Connection': 'keep-alive'} html = requests.get(url ,headers = headers).content.decode('utf-8') # print(html) p_title = '<p><a style="font-size:18px; line-height:22px; height:22px;" href=".*?" target="_blank"><b>(.*?)</b></a></p>' title = re.findall(p_title, html) # 提取内容 p_context = '<div class="contson" id=".*?">(.*?)</div>' context = re.findall(p_context,HTML, re.S) ' p_years =extract's#<p class="source"><a href=".*?">(.*?)</a>' years = re.findall(p_years,html,re.S) #提取作者 p_author = '<p class="source"><a href=".*?">.*?</a><span>:</span><.*?>(.*?)</a>' author = re.findall(p_author,html) # print(context) # print(title) # print(years) # print(author) for j in range(len(title)): context[j] = re.sub('<.*?>', '', context[j]) #'gbk' codec can not encode character '\ u4729', there will be no line error context [J] = re.sub (r ' \ u4729 ' , '' , context [J]) # Print (title [J] ) # Print (years [J]) # Print (author [J]) # Print (context [J]) # write data that write_data (title [J]) that write_data ( ' \ n- ' + years [J]) that write_data ( ' : ' + author [J]) that write_data (context [J]) Print ( ' download successful on {p} ' .format(str(i)))
4, difficulty with thinking
The difficulty lies in reptiles, use regular expressions, such as using a regular expression matching the body of poetry, poetry author, title poem. Regular use of regular expressions, you need to find what you need to match the pre-term and post-term, in order to locate the precise content needs to match. The matching Ancient text:
# 提取内容 p_context = '<div class="contson" id=".*?">(.*?)</div>' context = re.findall(p_context, html, re.S)
The content needs to match the contents of brackets, before the term is' <div class = "contson" id = ". *?">, After the item is </ div>. Caveats here are two things, first: ".? *" Id = Here you must use non-greedy mode, that add? , If not? It will continue to match the next content, which makes it impossible to match what we needed; second: (.? *) Plus? There is also the use of non-greedy mode, only the first match in brackets. And matching the title, author's, author's name, the method is similar, here is not introduced one by one.