Operating environment : Windows7, python2.7.13
Demand analysis : Crawling the Top 250 movie titles ranked by Douban Movie Network .
Steps :
1. Crawl the source code of the website .
2. Use regular expressions extract movie titles.
3. Save the movie title to text .
4. Repeat the above 3 steps until all the titles of Top250 are saved.
Step 1: Crawl the website source code.
Analyze the url of the website to find out the regularity of the url.
The url of the first page: https://movie.douban.com/top250?start=0 &filter =
URL of the second page: https://movie.douban.com/top250?start=25 &filter =
URL of the third page: https://movie.douban.com/top250?start=50 &filter =
Last page url: https://movie.douban.com/top250?start=225 &filter =
First, use the API of the requests library to grab the source code of the website:
import requests response = requests.get('https://movie.douban.com/top250?start=0&filter=') page = response.content
Step 2: Use regular expressions to extract the title of the movie.
Observe the html source code and use regular expressions to retrieve the content you need.
import re pattern = re.compile(r'<img width="100" alt=".*?"') movie_list = re.findall(pattern, page)
Step 3: Save the movie title to text.
file = open(u'豆瓣电影Top250.txt', 'w') for i in movie_list: file.write(i) file.close()
Here, open the Douban Movie Top250.txt file and find that the data is as follows:
The data needs to be filtered here . So the code for step 3 is as follows:
file = open(u ' Douban Movie Top250.txt ' , ' w ' ) for k in movie_list: k = k.replace( ' <img width="100" alt=" ' , '' ) #Filter out the useless characters k = k.replace( ' " ' , '' ) #Filter out the useless characters drop file.write(k) file.write( ' \n ' ) #Add a newline to make a movie name occupy one line file.close()
Step 4: Repeat the above 3 steps until all the titles of Top250 are saved.
A for loop is used here to fetch data page by page, basically combining the codes of steps 1, 2, and 3 together and adding a for loop. (Red is the new or changed code)
n = 0 file = open( ' aa.txt ' , ' w ' ) for i in range(10): #There are only 10 pages here, so do 10 loops response = requests.get('https://movie. douban.com/top250?start=%s&filter=' % n) page = response.content pattern = re.compile(r'<img width="100" alt=".*?"') movie_list = re.findall(pattern, page) for k in movie_list: k = k.replace('<img width="100" alt="', '') k = k.replace('"', '') file.write(k) file.write( ' \n ' ) #Film name of one page has been crawled n += 25 #You can find the regularity of url in step 1, so n add 25 each time file.close()
Summary: The above code can already complete the task. The following is the complete code that is organized and packaged into a class:
# -*- coding: utf-8 -*- import re import requests class MovieTop250Spider: def __init__(self): self.n = 0 self.url = 'https://movie.douban.com/top250?start=%s&filter=' % self.n def getPage(self, url): #Used to download webpage html source code response = requests.get(url = url) page = response.content return page def spider(self): pattern = re.compile(r ' <img width="100" alt=".*?" ' ) #matching pattern for retrieving movie names file = open(u ' Douban Top250 Movies.txt ' , ' w ' ) for i in range(10 ): page = self.getPage(self.url) movie_list = re.findall(pattern, page) for k in movie_list: k = k.replace('<img width="100" alt="', '') k = k.replace('"', '') file.write(k) file.write('\n') self.n += 25 self.url = 'https://movie.douban.com/top250?start=%s&filter=' % self.n file.close() movie = MovieTop250Spider() movie.spider()
The result is as follows: