Regular website crawling in front of a piece 20 piece (request library)

First, capture or Google browser to analyze data on the site, the results are as follows:

  • The website address is: http://www.budejie.com/text
  • The website data is on display through html page, website url defaults to the first page, http://www.budejie.com/text/ 2 for the second page, etc.
  • Analyze the content of the piece is the location of the site, content is found in a piece a tag

  • Pit is still there, this is my first time writing a regular:
content_list = re.findall(r'<a href="/detail-.*">(.+?)</a>', html_str) 
  • After he discovered the match to some of the recommended content, and finally I put the following regular change this and found no problem, knowledge of regular not do too much to explain here
content_list = re.findall(r'<div class="j-r-list-c-desc">\s*<a href="/detail-.*">(.+?)</a>', html_str) 
  • Now is crawling scripts the first 20 pages and save it to local, already know the law and turning the pages matches the regular, you can write code directly

 

The following code, and the whole idea is the same as the first two rows of reptiles blog, object-oriented wording:

. 1  Import Requests
 2  Import Re
 . 3  Import JSON
 . 4  
. 5  class NeihanSpider (Object):
 . 6      "" " connotations piece Best not his sister, regular crawling data page " "" 
. 7      DEF  the __init__ (Self):
 . 8          Self. = temp_url ' http://www.budejie.com/text/ {} '   # Web site address, to leave a page replaceable} { 
. 9          self.headers = {
 10              ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 75.0.3770.100 Safari / 537.36' ,
 . 11          }
 12 is  
13 is      DEF pass_url (Self, URL):   # sends a request acquisition response 
14          Print (URL)
 15          Response = requests.get (URL, headers = self.headers)
 16          return response.content.decode ()
 . 17  
18 is      DEF get_first_page_content_list (Self, html_str):   # extracting a first data page 
. 19          CONTENT_LIST the re.findall = (R & lt ' <div class = "JR-List-C-desc"> \ S * <a the href = "/ DETAIL - . * "> (. +?) </a> ' , html_str)   # non-greedy match 
20          return CONTENT_LIST
 21 
22     def save_content_list(self, content_list):
23         with open('neihan.txt', 'a', encoding='utf-8') as f:
24             for content in content_list:
25                 f.write(json.dumps(content, ensure_ascii=False))
26                 f.write('\n')  # 换行
27             print('成功保存一页!')
28 
29     def run(self):  # Achieve the main logic 
30          for I in Range (20 is):   # just before crawling data 20 
31              # 1 is configured URL 
32              # 2. transmission request acquisition response 
33 is              html_str = self.pass_url (self.temp_url.format (I +1 ))
 34              # 3 extracts data 
35              CONTENT_LIST = self.get_first_page_content_list (html_str)
 36              # 4. save 
37 [              self.save_content_list (CONTENT_LIST)
 38 is  
39  IF  the __name__ == ' __main__ ' :
 40      neihan =NeihanSpider ()
 41 is      neihan.run ()

 

Guess you like

Origin www.cnblogs.com/springionic/p/11110314.html