使用Python爬虫抓取某网站电影Top250并保存为Excel文件

简介

如何使用Python爬虫和数据处理库Openpyxl获取某网站电影Top250信息

使用Python爬虫和数据处理库Openpyxl获取某网站电影Top250的信息，并将数据保存到Excel文件中。本文将分为以下几个部分：

一、爬取某网站电影Top250信息

首先，我们需要使用Python爬虫来获取某网站电影Top250的信息。为了避免被反爬虫机制拦截，我们需要设置一个请求头。我们使用requests和BeautifulSoup库来完成这个任务。

import requests
from bs4 import BeautifulSoup

headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36'}

lst=[['编号','名称','推荐语','评分','链接地址']]
for i in range(10):
    url='https://movie.douban.com/top250?start=' + str(i*25) + '&filter='
    resp=requests.get(url,headers=headers)
    bs=BeautifulSoup(resp.text,'html.parser')
    grid_view=bs.find('ol',class_='grid_view')
    all_li=grid_view.find_all('li')
    for item in all_li:
        no=item.find('em').text
        title=item.find('span',class_='title').text
        inq=item.find('span',class_='inq')
        rat=item.find('span',class_='rating_num').text
        url_films=item.find('a')['href']
        lst.append([no,title,inq.text if inq!=None else '' ,rat,url_films])

在以上代码中，我们使用了一个循环来遍历Top250的每一页。每一页有25个电影，所以我们每次循环增加25个电影。然后，我们使用requests库来获取每一页的HTML代码。接下来，我们使用BeautifulSoup库来解析HTML代码，并找到每个电影的信息。最后，我们将每个电影的信息存储到lst列表中。

二、将数据保存到Excel文件中

现在，我们已经成功地获取了某网站电影Top250的信息。接下来，我们将使用Openpyxl库将数据保存到Excel文件中。

import openpyxl

wb=openpyxl.Workbook()
sheet=wb.active
sheet.title='我的电影'

for item in lst:
    sheet.append(item)
wb.save('films.xls')

在以上代码中，我们首先使用openpyxl库创建一个Excel文件。然后，我们创建一个名为“我的电影”的工作表。接下来，我们使用循环将lst列表中的每个电影信息添加到工作表中。最后，我们将Excel文件保存为“films.xls”。

三、完整代码和运行结果

import requests
from bs4 import BeautifulSoup
import openpyxl

headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36'}

lst=[['编号','名称','推荐语','评分','链接地址']]
for i in range(10):
    url='https://movie.douban.com/top250?start=' + str(i*25) + '&filter='
    resp=requests.get(url,headers=headers)
    bs=BeautifulSoup(resp.text,'html.parser')
    grid_view=bs.find('ol',class_='grid_view')
    all_li=grid_view.find_all('li')
    for item in all_li:
        no=item.find('em').text
        title=item.find('span',class_='title').text
        inq=item.find('span',class_='inq')
        rat=item.find('span',class_='rating_num').text
        url_films=item.find('a')['href']
        lst.append([no,title,inq.text if inq!=None else '' ,rat,url_films])

wb=openpyxl.Workbook()
sheet=wb.active
sheet.title='我的电影'

for item in lst:
    sheet.append(item)
wb.save('films.xls')

运行结果：

我们成功地获取了某网站电影Top250的信息，并将数据保存到Excel文件中。