Python爬虫入门案例（二）电影票房数据库爬取（request+XPath+csv）

大家学完第一个案例爬取豆瓣电影数据之后，对爬虫的基本概念以及流程有了大体的了解。其实我个人认为，爬虫的流程都是一样的，只不过方法不同而已。
今天我们就来学习第二个案例，爬取电影票房数据库中的电影数据信息。
网站地址：http://58921.com/
在这里插入图片描述
下面就开始爬取。大概分为三步；
一：获取网页响应
二：获取网页所需内容
三：保存数据
1.获取相应。获取相应的方式与第一个案例一致，直接上代码。

def get_response(url):
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36'}
    response = requests.get(url,headers=headers)
    response.encoding = 'UTF-8'
    return response.text

2.分析源码，通过xpath获取我们需要的内容
在这里插入图片描述
通过分析源码，我们可以看到，每一条电影数据，都存在一个tr标签中，且存入的格式都非常规范。所以我们可以通过锁定这里的tr标签来获取数据。单数需要注意的一点是，tr标签的属性是有不同的，有even有odd。所以在这里我们需要加一个and判断。

    nodes = text.xpath('//div[@class="clearfix front_block"]//tr[@class="odd"] | //div[@class="clearfix front_block"]//tr[@class="even"]')

爬取内容代码：

def get_nodes(html):
    text = etree.HTML(html)
    nodes = text.xpath('//div[@class="clearfix front_block"]//tr[@class="odd"] | //div[@class="clearfix front_block"]//tr[@class="even"]')
    infos = []
    for node in nodes:
        key = {}
        key['movieName'] = str(node.xpath('.//td/a/@title')).strip("[']")
        key['moviePaiPian'] = str(node.xpath('.//td[2]/text()')).strip("[']")
        key['moviePeople'] = str(node.xpath('.//td[3]/text()')).strip("[']")
        key['movieAdvance'] = str(node.xpath('.//td[last()-1]/text()')).strip("[']")
        key['movieAccumulate'] = str(node.xpath('.//td[last()]/text()')).strip("[']")
        infos.append(key)
    return infos

3.保存数据。在这里我们通过csv方法将数据存入csv文件中。csv文件可以通过excel表格查看，非常方便。所以一般格式化的数据我们都喜欢存入csv文件，这样看起来数据非常规范整齐。

def save_file(infos):
    headers = ['电影名称', '当日排片', '当日人次','当日预售票房','累计票房']
    with open('movieDataBase.csv','a+',encoding='UTF-8',newline='') as fp:
        writer = csv.writer(fp)
        writer.writerow(headers)
        for key in infos:
            writer.writerow([key['movieName'],key['moviePaiPian'],key['moviePeople'],key['movieAdvance'],key['movieAccumulate']])

这样我们的爬虫就写好了。运行一下看一下结果吧。（后附加文件效果以及完整代码）
在这里插入图片描述
完整代码

import requests
from lxml import etree
import csv

def get_response(url):
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36'}
    response = requests.get(url,headers=headers)
    response.encoding = 'UTF-8'
    return response.text
def get_nodes(html):
    text = etree.HTML(html)
    nodes = text.xpath('//div[@class="clearfix front_block"]//tr[@class="odd"] | //div[@class="clearfix front_block"]//tr[@class="even"]')
    infos = []
    for node in nodes:
        key = {}
        key['movieName'] = str(node.xpath('.//td/a/@title')).strip("[']")
        key['moviePaiPian'] = str(node.xpath('.//td[2]/text()')).strip("[']")
        key['moviePeople'] = str(node.xpath('.//td[3]/text()')).strip("[']")
        key['movieAdvance'] = str(node.xpath('.//td[last()-1]/text()')).strip("[']")
        key['movieAccumulate'] = str(node.xpath('.//td[last()]/text()')).strip("[']")
        infos.append(key)
    return infos
def save_file(infos):
    headers = ['电影名称', '当日排片', '当日人次','当日预售票房','累计票房']
    with open('movieDataBase.csv','a+',encoding='UTF-8',newline='') as fp:
        writer = csv.writer(fp)
        writer.writerow(headers)
        for key in infos:
            writer.writerow([key['movieName'],key['moviePaiPian'],key['moviePeople'],key['movieAdvance'],key['movieAccumulate']])
if __name__ == '__main__':
    url = 'http://58921.com/'
    html = get_response(url)
    infos = get_nodes(html)
    save_file(infos)

希望看完之后会对大家学习爬虫有帮助，如果有问题可以留言，评论，如果能点个赞就是最大的支持了。
每天进步一点点，Keep Going！

Python爬虫入门案例（二）电影票房数据库爬取（request+XPath+csv）

猜你喜欢