Crawler script (grab Douban Movie Top250)

Operating environment : Windows7, python2.7.13

Demand analysis : Crawling the Top 250 movie titles ranked by Douban Movie Network .

Steps :

  1. Crawl the source code of the website .

  2. Use regular expressions extract movie titles.

  3. Save the movie title to text .

  4.  Repeat the above 3 steps until all the titles of Top250 are saved.

Step 1: Crawl the website source code.

  Analyze the url of the website to find out the regularity of the url.

  The url of the first page: https://movie.douban.com/top250?start=0 &filter =

  URL of the second page: https://movie.douban.com/top250?start=25 &filter =

  URL of the third page: https://movie.douban.com/top250?start=50 &filter =

  Last page url: https://movie.douban.com/top250?start=225 &filter =

  First, use the API of the requests library to grab the source code of the website:

import requests

response = requests.get('https://movie.douban.com/top250?start=0&filter=')
page = response.content

Step 2: Use regular expressions to extract the title of the movie.

  Observe the html source code and use regular expressions to retrieve the content you need.

import re

pattern = re.compile(r'<img width="100" alt=".*?"')
movie_list = re.findall(pattern, page)

Step 3: Save the movie title to text.

file = open(u'豆瓣电影Top250.txt', 'w')
for i in movie_list:
    file.write(i)
file.close()

  Here, open the Douban Movie Top250.txt file and find that the data is as follows:

  The data needs to be filtered here . So the code for step 3 is as follows:

file = open(u ' Douban Movie Top250.txt ' , ' w ' )
 for k in movie_list:
    k = k.replace( ' <img width="100" alt=" ' , '' ) #Filter out the useless characters 
    k = k.replace( ' " ' , '' ) #Filter out the useless characters drop 
    file.write(k)
    file.write( ' \n ' ) #Add a newline to make a movie name occupy one line 
file.close()

 Step 4: Repeat the above 3 steps until all the titles of Top250 are saved.

  A for loop is used here to fetch data page by page, basically combining the codes of steps 1, 2, and 3 together and adding a for loop. (Red is the new or changed code)

n = 0 
file = open( ' aa.txt ' , ' w ' )
 for i in range(10):  #There are only 10 pages here, so do 10 loops 
    response = requests.get('https://movie. douban.com/top250?start=%s&filter=' % n) 
    page = response.content
    pattern = re.compile(r'<img width="100" alt=".*?"')
    movie_list = re.findall(pattern, page)
    for k in movie_list:
        k = k.replace('<img width="100" alt="', '')
        k = k.replace('"', '')
        file.write(k)
        file.write( ' \n ' ) #Film name of one page has been crawled 
    n += 25  #You can find the regularity of url in step 1, so n add 25 each time 
file.close()

 Summary: The above code can already complete the task. The following is the complete code that is organized and packaged into a class:

# -*- coding: utf-8 -*-
import re
import requests

class MovieTop250Spider:
    def __init__(self):
        self.n = 0
        self.url = 'https://movie.douban.com/top250?start=%s&filter=' % self.n

    def getPage(self, url): #Used to download webpage html source code 
        response = requests.get(url = url)
        page = response.content
        return page

    def spider(self):
        pattern = re.compile(r ' <img width="100" alt=".*?" ' ) #matching pattern for retrieving movie names 
        file = open(u ' Douban Top250 Movies.txt ' , ' w ' )
         for i in range(10 ):
            page = self.getPage(self.url)
            movie_list = re.findall(pattern, page)
            for k in movie_list:
                k = k.replace('<img width="100" alt="', '')
                k = k.replace('"', '')
                file.write(k)
                file.write('\n')
            self.n += 25
            self.url =  'https://movie.douban.com/top250?start=%s&filter=' % self.n
        file.close()

movie = MovieTop250Spider()
movie.spider()

   The result is as follows:

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325898122&siteId=291194637