The project is roughly divided into the following steps:
- Get pages using requests library
- Use lxml library and XPath to parse the page
- Crawl movie poster pictures
- Use the pandas library to store movie related information as a csv file
- Add a loop to save all pictures and related information
First, we build a framework to get the HTML page of the Douban movie:
import requests
# 获取HTML页面
def get_html(url):
headers = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36"}
try:
html = requests.get(url, headers=headers)
html.encoding = html.apparent_encoding
if html.status_code == 200:
print("获取HTML页面成功!")
except Exception as e:
print("获取HTML页面失败,原因是:%s" % e)
return html.text
if __name__ == '__main__':
url = "https://movie.douban.com/top250"
html = get_html(url)
Next, we analyze the web page of Douban movie:
Using developer tools (F12), after analysis, you can find that each page of movie information is these li tags:
We use XPath Helper to get these li tags:
Continue to analyze movie-related information under the li tag just obtained:
The first is the movie name:
Next is the director starring information:
Then there are the year, country and movie type:
Next is the movie rating:
Then the number of reviewers:
Next is the introduction:
(The movie of Top 247 is not described here, so I need to deal with it later)
Finally, the picture of the movie poster:
After analyzing the page, we write a function to parse the page:
from lxml import etree # 解析HTML页面
# 解析HTML页面
def parse_html(html):
movies = [] # 存储电影的相关信息
imgurls = [] # 存储电影海报图片
html = etree.HTML(html)
lis = html.xpath("//ol[@class='grid_view']/li") # XPath返回列表对象
# 提取每一部电影的相关信息
for li in lis:
# 下面的XPath路径前面都要加上. 表示从li这个节点开始
name = li.xpath(".//a/span[@class='title'][1]/text()")[0] # 获取到的列表第0个元素才是电影名字
director_actor = li.xpath(".//div[@class='bd']/p/text()[1]")[0].replace(' ','').replace('\n','').replace('/','').replace('\xa0', '') # 去除字符串中的多余字符
info = li.xpath(".//div[@class='bd']/p/text()[2]")[0].replace(' ','').replace('\n','').replace('\xa0', '') # 去除字符串中的多余字符
rating_score = li.xpath(".//span[@class='rating_num']/text()")[0]
rating_num = li.xpath(".//div[@class='star']/span[4]/text()")[0]
introduce = li.xpath(".//p[@class='quote']/span/text()")
# 把提取的相关信息存入movie字典,顺便处理Top 247那部电影没有introduce的情况
if introduce:
movie = {'name': name, 'director_actor': director_actor, 'info': info, 'rating_score': rating_score,
'rating_num': rating_num, 'introduce': introduce[0]}
else:
movie = {'name': name, 'director_actor': director_actor, 'info': info, 'rating_score': rating_score,
'rating_num': rating_num, 'introduce': None}
movies.append(movie)
imgurl = li.xpath(".//img/@src")[0] # 提取图片URL
imgurls.append(imgurl)
return movies, imgurls
if __name__ == '__main__':
url = 'https://movie.douban.com/top250'
html = get_html(url)
movies = parse_html(html)[0]
imgurls = parse_html(html)[1]
During the test, it was found that director_actor and info have \ xa0 uninterrupted white space:
use .replace ('\ xa0', '') statement to remove.
Next write a function to save the movie poster picture:
import os
# 保存海报图片
def download_img(url, movie):
if 'movieposter' in os.listdir(r'S:\大一寒假学习'):
pass
else:
os.mkdir('movieposter')
os.chdir(r'S:\大一寒假学习\movieposter')
img = requests.get(url).content # 返回的是bytes型也就是二进制的数据
with open(movie['name'] + '.jpg', 'wb') as f:
f.write(img)
Finally, add a loop to crawl the poster pictures and related information of all movies:
There are 25 movies on
each page , for a total of ten pages: the URL of each page is determined by:
import pandas as pd
if __name__ == '__main__':
MOVIES = []
IMGURLS = []
for i in range(10):
url = "https://movie.douban.com/top250?start=" + str(i*25) + "&filter="
html = get_html(url)
movies = parse_html(html)[0]
imgurls = parse_html(html)[1]
MOVIES.extend(movies)
IMGURLS.extend(imgurls)
for i in range(250):
download_img(IMGURLS[i], MOVIES[i])
print("正在下载第" + str(i+1) + "张图片……")
os.chdir(r'S:\大一寒假学习') # 记得把路径换回来
moviedata = pd.DataFrame(MOVIES) # 把电影相关信息转换为DataFrame数据格式
moviedata.to_csv('movie.csv')
print("电影相关信息存储成功!")
Run the code:
At this point, the project is complete.
Complete source code: https://github.com/Giyn/Spider