First, capture or Google browser to analyze data on the site, the results are as follows:
- The website address is: http://www.budejie.com/text
- The website data is on display through html page, website url defaults to the first page, http://www.budejie.com/text/ 2 for the second page, etc.
- Analyze the content of the piece is the location of the site, content is found in a piece a tag
- Pit is still there, this is my first time writing a regular:
content_list = re.findall(r'<a href="/detail-.*">(.+?)</a>', html_str)
- After he discovered the match to some of the recommended content, and finally I put the following regular change this and found no problem, knowledge of regular not do too much to explain here
content_list = re.findall(r'<div class="j-r-list-c-desc">\s*<a href="/detail-.*">(.+?)</a>', html_str)
- Now is crawling scripts the first 20 pages and save it to local, already know the law and turning the pages matches the regular, you can write code directly
The following code, and the whole idea is the same as the first two rows of reptiles blog, object-oriented wording:
. 1 Import Requests 2 Import Re . 3 Import JSON . 4 . 5 class NeihanSpider (Object): . 6 "" " connotations piece Best not his sister, regular crawling data page " "" . 7 DEF the __init__ (Self): . 8 Self. = temp_url ' http://www.budejie.com/text/ {} ' # Web site address, to leave a page replaceable} { . 9 self.headers = { 10 ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 75.0.3770.100 Safari / 537.36' , . 11 } 12 is 13 is DEF pass_url (Self, URL): # sends a request acquisition response 14 Print (URL) 15 Response = requests.get (URL, headers = self.headers) 16 return response.content.decode () . 17 18 is DEF get_first_page_content_list (Self, html_str): # extracting a first data page . 19 CONTENT_LIST the re.findall = (R & lt ' <div class = "JR-List-C-desc"> \ S * <a the href = "/ DETAIL - . * "> (. +?) </a> ' , html_str) # non-greedy match 20 return CONTENT_LIST 21 22 def save_content_list(self, content_list): 23 with open('neihan.txt', 'a', encoding='utf-8') as f: 24 for content in content_list: 25 f.write(json.dumps(content, ensure_ascii=False)) 26 f.write('\n') # 换行 27 print('成功保存一页!') 28 29 def run(self): # Achieve the main logic 30 for I in Range (20 is): # just before crawling data 20 31 # 1 is configured URL 32 # 2. transmission request acquisition response 33 is html_str = self.pass_url (self.temp_url.format (I +1 )) 34 # 3 extracts data 35 CONTENT_LIST = self.get_first_page_content_list (html_str) 36 # 4. save 37 [ self.save_content_list (CONTENT_LIST) 38 is 39 IF the __name__ == ' __main__ ' : 40 neihan =NeihanSpider () 41 is neihan.run ()