Use Python crawler to crawl the Top 250 movies from a website and save them as Excel files

Introduction

How to use Python crawler and data processing library Openpyxl to obtain top 250 movie information on a website

Use Python crawler and data processing library Openpyxl to obtain information about the top 250 movies on a website, and save the data to an Excel file. This article will be divided into the following parts:

1. Crawl the top 250 movie information from a website

First, we need to use a Python crawler to obtain information about the Top 250 movies on a website. In order to avoid being intercepted by the anti-crawler mechanism, we need to set a request header. We use requests and the BeautifulSoup library to accomplish this task.

import requests
from bs4 import BeautifulSoup

headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36'}

lst=[['编号','名称','推荐语','评分','链接地址']]
for i in range(10):
    url='https://movie.douban.com/top250?start=' + str(i*25) + '&filter='
    resp=requests.get(url,headers=headers)
    bs=BeautifulSoup(resp.text,'html.parser')
    grid_view=bs.find('ol',class_='grid_view')
    all_li=grid_view.find_all('li')
    for item in all_li:
        no=item.find('em').text
        title=item.find('span',class_='title').text
        inq=item.find('span',class_='inq')
        rat=item.find('span',class_='rating_num').text
        url_films=item.find('a')['href']
        lst.append([no,title,inq.text if inq!=None else '' ,rat,url_films])

In the above code, we use a loop to traverse each page of the Top250. There are 25 movies per page, so we add 25 movies each time through the loop. Then, we use the requests library to get the HTML code for each page. Next, we use the BeautifulSoup library to parse the HTML code and find the information for each movie. Finally, we store each movie's information into a lst list.

2. Save data to Excel file

Now, we have successfully obtained the information of the Top 250 movies on a certain website. Next, we will use the Openpyxl library to save the data into an Excel file.

import openpyxl

wb=openpyxl.Workbook()
sheet=wb.active
sheet.title='我的电影'

for item in lst:
    sheet.append(item)
wb.save('films.xls')

In the above code, we first create an Excel file using the openpyxl library. Then, we create a worksheet called "My Movies." Next, we use a loop to add each movie information in the lst list to the worksheet. Finally, we save the Excel file as "films.xls".

3. Complete code and running results

import requests
from bs4 import BeautifulSoup
import openpyxl

headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36'}

lst=[['编号','名称','推荐语','评分','链接地址']]
for i in range(10):
    url='https://movie.douban.com/top250?start=' + str(i*25) + '&filter='
    resp=requests.get(url,headers=headers)
    bs=BeautifulSoup(resp.text,'html.parser')
    grid_view=bs.find('ol',class_='grid_view')
    all_li=grid_view.find_all('li')
    for item in all_li:
        no=item.find('em').text
        title=item.find('span',class_='title').text
        inq=item.find('span',class_='inq')
        rat=item.find('span',class_='rating_num').text
        url_films=item.find('a')['href']
        lst.append([no,title,inq.text if inq!=None else '' ,rat,url_films])

wb=openpyxl.Workbook()
sheet=wb.active
sheet.title='我的电影'

for item in lst:
    sheet.append(item)
wb.save('films.xls')

operation result:

We successfully obtained the information of the Top 250 movies from a certain website and saved the data into an Excel file.