不知道具体因为啥,任何一个培训爬虫的机构在入门阶段都喜欢教大家爬豆瓣,可能昨为一个简单的静态网页,可以让初学者从中找到信心,所以我也写了一个简单的爬虫代码,另加了些许注释,代码如下:(需要requests,lxml库)
# 爬取豆瓣电影信息
# *电影名
# *评分
# *链接
# 第一步:导入第三方库
import csv
import lxml.html
import requests
# 第二步:获取源网址代码
url = 'https://movie.douban.com/top250?start={}&filter='
def getUrlCode(url):
source = requests.get(url)
source.encoding = 'utf-8'
return source.content # 返回网页源代码
# 第三步:解析源代码
def getEveryItem(source):
movieList = []
movieItemClassList = []
selector = lxml.html.document_fromstring(source)
movieItemClassList = selector.xpath('//div[@class="info"]')
for eachMovie in movieItemClassList:
movieDict = {} # 创建一个字典存储对应的数据
title = eachMovie.xpath('div[@class="hd"]/a/span[@class="title"]/text()')[0]
link = eachMovie.xpath('div[@class="hd"]/a/@href')[0]
star = eachMovie.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()')
movieDict['title'] = title
movieDict['url'] = link
movieDict['star'] = star
movieList.append(movieDict)
return movieList # 返回每页的需求数据
# 定义一个函数获取25页的数据
def getEveryPage(url):
allMovieList = []
for i in range(10):
pageLink = url.format(i * 25)
source = getUrlCode(pageLink)
allMovieList += getEveryItem(source)
return allMovieList # 返回25页的需求数据
# 定义一个函数:下载需求信息
def writeData(allMovieList):
with open('./DoubanTest.csv',"w",encoding='UTF-8') as f:
writer = csv.DictWriter(f,fieldnames = ['title','star','url'])
writer.writeheader()
for each in allMovieList:
writer.writerow(each)
if __name__ == "__main__":
allMovieList = getEveryPage(url)
print(allMovieList)
writeData(allMovieList)
虽然爬取方式比较简单,应用了xpath的方式爬取,主要是可以让思路比较清晰。
写代码的过程中还是出现了一些报错,反复检查后发现竟然都是一些辣鸡错误,比如[@class=”star”]少敲了一个” ] “号,不得不承认,在语法没有错误的情况下找到这类的拼写错误真的很难,所以建议各位小伙伴们在调试的时候仔细检查这类错误。