Python crawls the world's largest movie database website IMDb data

The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.

By Dark Horse

Preface

In the process of using Python to develop crawlers, requests and BeautifulSoup4 (alias bs4) are widely used. Requests are mainly used to simulate browser client requests to obtain server-side responses and received response results, such as: web page HTML source code Then it will be encapsulated by bs4 and then parsed to extract the target content data.

In today’s case, we will use a new library, MechanicalSoup. This library is actually a further encapsulation of requests and bs4 to further simplify the work of requests and parsing. If you are already familiar with the basic operations of requests and bs4, the following code will understand It shouldn't be difficult.

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542 

Python learning exchange group: 1039645993

Ready to work

mechanicalsoup installation

Use pip to install under the terminal, and the related dependent component libraries will be automatically installed

pip install mechanicalsoup

 

Web analytics

What we want to request today is the world’s largest movie database website IMDb. Its official website address is http://www.imdb.com. The homepage display effect is as shown in the figure:

Python crawls the world's largest movie database website IMDb data

 

The page where we want to crawl the data can be accessed through the sub-menu item "Top Rated Movies" of the "Menu" navigation, or directly visit https://www.imdb.com/chart/top/

Python crawls the world's largest movie database website IMDb data

 

The target data we want to collect is the list on the left page. After the browser right-click "check" analysis, we know that the first data item is contained in a table row. At this time, we further clarify that the data to be collected is the ranking and the title of the movie. , Three columns of release year, analyze the following HTML elements:

Python crawls the world's largest movie database website IMDb data

 

The target data are all contained in a td class="titleColumn" cell. At this time, you only need to obtain batch cells with this feature in batches, then take out the target data and clean it up.

Coding implementation

Capture and print

Preliminary code structure:

imdb.py

import mechanical

# 数据容器
data = []

def fetch_data():
    # 此处爬取页面目标数据

def main():
    fetch_data()

if __name__ == "__main__":
    main()

The key code logic is contained in the fetch_data() function. The specific code is as follows (with comments):

def fetch_data():
    url = "https://www.imdb.com/chart/top/"
    # 构造浏览器对象
    b = mechanicalsoup.StatefulBrowser()
    # 请求目标网址
    b.open(url)
    # b.page 即为当前响应页面源码,且已封装为 BeautifulSoup 对象
    # 页面中找出所有具有 class="titleColumn" 属性的 td 单元格集合
    items = b.page.find_all("td", class_="titleColumn")


    # 遍历所有项
    for item in items:
        # 取出当前单元格中所所有文本,以"\n"分隔为三个元素
        row = item.text.strip().split("\n")
        # 进一步清理元素值的空格
        # 此时列表中三个元素对应为排名、标题、年份
        row = [x.strip() for x in row]
        # 将数据添加至data列表容器,便于进一步处理
        data.append(row)
        # 打印显示
        print(row)

 

The code at this time is as follows:

 

import mechanicalsoup


data = []


def fetch_data():
    url = "https://www.imdb.com/chart/top/?ref_=nv_mv_250"
    b = mechanicalsoup.StatefulBrowser()
    b.open(url)
    items = b.page.find_all("td", class_="titleColumn")
    for item in items:
        row = item.text.strip().split("\n")
        row = [x.strip() for x in row]
        data.append(row)
        print(row)


def main():
    fetch_data()


if __name__ == "__main__":
    main()

 

At this time, the results of running the code python imdb.py are as follows:

Python crawls the world's largest movie database website IMDb data

 

You can see the effect of printing when extracting line by line. At this time, the data window data also includes all 250 lines of movie information.

 

Write batch data to Excel file

If you write the collected movie data (250 pieces) into an Excel table file at one time, you can install and use an Excel operation library, such as openpyxl, etc. After performing the aforementioned steps, fetch_data() executes the new creation and writes to the Excel operation.

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/114884957