How to use Selenium Python to crawl dynamic tables with multiple pages and perform data integration and analysis

55.png

Introduction

In the field of web crawlers, dynamic tables are a common form of data display, which can display a large amount of structured data and provide functions such as paging, sorting, and filtering. The data of dynamic tables is usually loaded dynamically through JavaScript or Ajax, which brings certain challenges to crawlers. This article will introduce how to use Selenium Python, a powerful automated testing tool, to crawl dynamic tables with multiple pages, and perform data integration and analysis.

text

Introduction to Selenium Python

Selenium is an open source automated testing framework, which can simulate user operations in the browser, such as clicking, typing, scrolling, etc., so as to realize automated testing or crawling of web pages. Selenium supports a variety of programming languages, such as Java, Python, Ruby, etc. Among them, Python is the most popular one because it is concise, easy to use, and flexible. Selenium Python provides a WebDriver API, which allows us to control different browser drivers, such as Chrome, Firefox, Edge, etc., through Python code, so as to achieve crawling of different websites and platforms.

Dynamic form crawling steps

To crawl dynamic tables with multiple pages, we need to follow the following steps:

  1. Find the target site and target form. We need to identify the URLs of the websites and forms we want to crawl and open them with Selenium Python.
  2. Position table elements and pagination elements. We need to use various positioning methods provided by Selenium Python, such as find_element_by_id, find_element_by_xpath, etc., to find table elements and pagination elements, and obtain their attributes and text.
  3. Crawl table data and page turning operations. We need to use various operation methods provided by Selenium Python, such as click, send_keys, etc., to simulate users turning pages in the form, and use libraries such as BeautifulSoup to parse the form data and store them in lists or dictionaries.
  4. Data integration and analysis. We need to use libraries such as Pandas to integrate and analyze the crawled data, and use libraries such as Matplotlib to visualize and display data.

Dynamic table crawling features

Dynamic tables that crawl multiple pages have the following characteristics:

  • Need to handle dynamic loading and asynchronous requests. The data of the dynamic form is usually loaded dynamically through JavaScript or Ajax, which means that we need to wait for the page to be fully loaded before getting the data, or use the explicit wait or implicit wait method provided by Selenium Python to set the timeout period.
  • Need to deal with pagination logic and page turning rules. Dynamic tables usually have multiple pages, and each page has a different amount of data. We need to judge the current page according to the page elements, and select the next page according to the page turning rules. Some websites may use number buttons to represent paging, some websites may use previous and next buttons to represent paging, and some websites may use ellipsis or more buttons to represent paging, we need to choose the appropriate page turning according to different situations method.
  • Exceptions and error handling need to be handled. During the crawling process, various abnormal conditions and errors may be encountered, such as network interruption, page jump, element loss, etc., we need to use the exception handling method provided by Selenium Python to capture and process these exceptions, and set retry mechanism and logging.

the case

In order to specifically illustrate how to use Selenium Python to crawl dynamic tables with multiple pages and perform data integration and analysis, we take a practical case as an example, crawl a table example on the Selenium Easy website, and perform crawling on the crawled data Simple statistics and graphing.

Introducing the Site and Forms

Selenium Easy is a website that provides Selenium tutorials and examples. It has a table demo page, which shows a dynamic table with paging function. This table has 15 records, each page has 5 records, and there are 3 pages in total. Each record contains a person's name, title, office, age, date of entry, and monthly salary. Our goal is to crawl all the data in this table, and make statistics and graphs on the number of people and monthly salary in different offices.

Code

In order to achieve this goal, we need to use the following libraries:

  • Selenium: used to control browser drivers and simulate user operations
  • requests: used to send HTTP requests and get responses
  • BeautifulSoup: for parsing HTML documents and extracting data
  • pandas: for working with data structures and analysis
  • matplotlib: for drawing data charts

First, we need to import these libraries and set some global variables, such as browser driver path, target website URL, proxy server information, etc.:

# 导入库
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt

# 设置浏览器驱动路径
driver_path = r'/Users/weaabduljamac/Downloads/chromedriver'

# 设置目标网站URL
url = 'https://demo.seleniumeasy.com/table-pagination-demo.html'

# 亿牛云 爬虫代理加强版 设置代理服务器信息
proxyHost = "www.16yun.cn"
proxyPort = "3111"
proxyUser = "16YUN"
proxyPass = "16IP"

Next, we need to create a browser driver object, and set the proxy server parameters, and then open the target website:

# 创建浏览器驱动对象
driver = webdriver.Chrome(driver_path)

# 设置代理服务器参数
options = webdriver.ChromeOptions()
options.add_argument(f'--proxy-server=http://{
      
      proxyUser}:{
      
      proxyPass}@{
      
      proxyHost}:{
      
      proxyPort}')

# 打开目标网站
driver.get(url)

Then, we need to locate the table and pagination elements and get their attributes and text:

# 定位表格元素
table = driver.find_element_by_xpath('//*[@id="myTable"]')

# 定位分页元素
pagination = driver.find_element_by_xpath('//*[@id="myPager"]')

# 获取分页元素的文本
pagination_text = pagination.text

# 获取分页元素的链接列表
pagination_links = pagination.find_elements_by_tag_name('a')

Next, we need to create an empty list to store the crawled data, and create a loop to traverse each page and crawl the table data in each page:

# 创建一个空列表来存储爬取到的数据
data = []

# 创建一个循环来遍历每个分页
for i in range(len(pagination_links)):
    # 获取当前分页元素的文本
    current_page_text = pagination_links[i].text
    
    # 判断当前分页元素是否是数字按钮或更多按钮(省略号)
    if current_page_text.isdigit() or current_page_text == '...':
        # 点击当前分页元素
        pagination_links[i].click()
        
        # 等待页面加载完成(可以使用显式等待或隐式等待方法来优化)
        driver.implicitly_wait(10)
        
        # 重新定位表格元素(因为页面刷新后原来的元素可能失效)
        table = driver.find_element_by_xpath('//*[@id="myTable"]')
        # 解析表格元素的HTML文档
        soup = BeautifulSoup(table.get_attribute('innerHTML'), 'html.parser')
    
        # 提取表格元素中的每一行数据
        rows = soup.find_all('tr')
    
        # 遍历每一行数据
        for row in rows:
            # 提取每一行数据中的每一列数据
            cols = row.find_all('td')
        
             # 判断每一列数据是否为空(因为表头行没有数据)
            if len(cols) > 0:
                # 获取每一列数据的文本
                name = cols[0].text
                position = cols[1].text
                office = cols[2].text
                age = cols[3].text
                start_date = cols[4].text
                salary = cols[5].text
            
                # 将每一列数据组合成一个字典
                record = {
    
    
                  'name': name,
                  'position': position,
                  'office': office,
                  'age': age,
                  'start_date': start_date,
                  'salary': salary
                 }
            
                # 将字典添加到列表中
                data.append(record)
            
     # 判断当前分页元素是否是上一页或下一页按钮
    elif current_page_text == 'Prev' or current_page_text == 'Next':
        # 点击当前分页元素
        pagination_links[i].click()
    
        # 等待页面加载完成(可以使用显式等待或隐式等待方法来优化)
        driver.implicitly_wait(10)
    
        # 重新定位分页元素(因为页面刷新后原来的元素可能失效)
        pagination = driver.find_element_by_xpath('//*[@id="myPager"]')
    
        # 重新获取分页元素的链接列表(因为页面刷新后原来的链接可能变化)
        pagination_links = pagination.find_elements_by_tag_name('a')
        
        

Finally, we need to use libraries such as Pandas to integrate and analyze the crawled data, and use libraries such as Matplotlib to visualize and display data:

# 关闭浏览器驱动对象
driver.quit()

# 将列表转换为Pandas数据框
df = pd.DataFrame(data)

# 查看数据框的基本信息
print(df.info())

# 查看数据框的前五行
print(df.head())

# 对不同办公室的人数进行统计和分组
office_count = df.groupby('office')['name'].count()

# 对不同办公室的月薪进行统计和分组(注意月薪需要去掉货币符号和逗号,并转换为数值类型)
office_salary = df.groupby('office')['salary'].apply(lambda x: x.str.replace('$', '').str.replace(',', '').astype(float).sum())

# 绘制不同办公室的人数和月薪的柱状图
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
office_count.plot.bar(ax=ax[0], title='Number of Employees by Office')
office_salary.plot.bar(ax=ax[1], title='Total Salary by Office')
plt.show()

epilogue

This article introduces how to use Selenium Python to crawl dynamic tables with multiple pages, and perform data integration and analysis. Through this case, we can learn the basic usage and characteristics of Selenium Python, and how to deal with dynamic loading and asynchronous requests, paging logic and page turning rules, exceptions and error handling, etc. Selenium Python is a powerful and flexible automated testing tool that can help us crawl various websites and platforms to obtain valuable data and information. I hope this article can help and inspire you. You are welcome to continue to explore more functions and applications of Selenium Python.

Guess you like

Origin blog.csdn.net/ip16yun/article/details/132023356