The main watercress score of movies, movie information carried by the heat of the sort crawling
analysis
Press F12 to open the Developer Tools, click XHR label, because he is loaded by ajax for more movie information. The information returned is data json format, including the details of each movie link information, to obtain this information
The last page of each parameter plus 20 page_start can change to the next page
The following is a detailed Code
import re, requests import json class DoubanSpider: def __init__(self): self.url_temp = "https://movie.douban.com/j/search_subjects?type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=recommend&page_limit=20&page_start={}" self.headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36"} DEF parse_url (Self, URL): # sends a request acquisition response Print (URL) response = requests.get(url, self.headers) return response.content.decode() def get_content_list(self, json_str): # 提取数据 dict_ret = json.loads(json_str) print(dict_ret) CONTENT_LIST = dict_ret [ " Subjects " ] # all movie data return CONTENT_LIST def save_content_list(self, content_list): # 保存 with open("douban.txt", 'a', encoding="utf-8") as f: for content in content_list: f.write(json.dumps(content, ensure_ascii=False)) f.write ( " \ the n- " ) # write newline Print ( " Save Success " ) DEF RUN (Self): # implement the main logic NUM = 0 the while True: # . 1, configured START_URL URL = self.url_temp.format (NUM) # 2, transmission request acquisition response json_str = self.parse_url (URL) # . 3, extracting data CONTENT_LIST = self.get_content_list (json_str) # . 4, save self.save_content_list (CONTENT_LIST) # . 5, determines whether or not there is a next iF len (CONTENT_LIST) <20 is : BREAK # . 6, the next address is configured a url a = + 20 if __name__ == '__main__': douban = DoubanSpider () douban.run()