Reptile crawling watercress score of the movie information

The main watercress score of movies, movie information carried by the heat of the sort crawling

analysis

Press F12 to open the Developer Tools, click XHR label, because he is loaded by ajax for more movie information. The information returned is data json format, including the details of each movie link information, to obtain this information

 

The last page of each parameter plus 20 page_start can change to the next page

 

The following is a detailed Code

import re, requests
import json

class DoubanSpider:
    def __init__(self):
        self.url_temp = "https://movie.douban.com/j/search_subjects?type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=recommend&page_limit=20&page_start={}"
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36"}

    DEF parse_url (Self, URL):   # sends a request acquisition response 
        Print (URL)
        response = requests.get(url, self.headers)
        return response.content.decode()

    def get_content_list(self, json_str):  # 提取数据
        dict_ret = json.loads(json_str)
        print(dict_ret)
        CONTENT_LIST = dict_ret [ " Subjects " ]   # all movie data 
        return CONTENT_LIST

    def save_content_list(self, content_list):  # 保存
        with open("douban.txt", 'a', encoding="utf-8") as f:
            for content in content_list:
                f.write(json.dumps(content, ensure_ascii=False))
                f.write ( " \ the n- " )   # write newline 
        Print ( " Save Success " )

    DEF RUN (Self):   # implement the main logic 
        NUM = 0
         the while True:
             # . 1, configured START_URL 
            URL = self.url_temp.format (NUM)
             # 2, transmission request acquisition response 
            json_str = self.parse_url (URL)
             # . 3, extracting data 
            CONTENT_LIST = self.get_content_list (json_str)
             # . 4, save 
            self.save_content_list (CONTENT_LIST)
             # . 5, determines whether or not there is a next 
            iF len (CONTENT_LIST) <20 is :
                 BREAK 
            # . 6, the next address is configured a url
            a = + 20


if __name__ == '__main__':
    douban = DoubanSpider ()
    douban.run()

 

Guess you like

Origin www.cnblogs.com/zq8421/p/11037666.html