A Guide to Scraping Dynamic Content: Scrolling with Scrapy-Selenium and Proxies

Yiniu cloud agent

Introduction

In the process of web data crawling, sometimes it is necessary to process content that is dynamically loaded through JavaScript. This article will introduce how to use the Scrapy-Selenium library to scroll multiple times in a web page and grab data to meet the needs of dynamic content crawling.

overview

In traditional web crawlers, static web content is easy to crawl, but for dynamic content loaded through JavaScript, it is usually necessary to use a browser for simulated access. Scrapy-Selenium is a library that combines the functions of Scrapy and Selenium, which can simulate browser behavior to achieve the purpose of crawling dynamic content.

text

In this article, we will introduce how to use the Scrapy-Selenium library to scroll through a web page multiple times and scrape data. First, make sure you have installed Scrapy and Selenium libraries. If it is not installed, you can install it with the following command:

pip install scrapy selenium

Next, we need to configure Selenium to use a proxy server to improve crawler efficiency. The sample code for using the Yiniu cloud crawler agent is as follows:

from selenium import webdriver
from selenium.webdriver.common.proxy import Proxy, ProxyType

# 代理服务器配置
proxyHost = "www.16yun.cn"
proxyPort = "31111"
proxyUser = "16YUN"
proxyPass = "16IP"

# 创建代理对象
proxy = Proxy()
proxy.proxy_type = ProxyType.MANUAL
proxy.http_proxy = f"{
      
      proxyUser}:{
      
      proxyPass}@{
      
      proxyHost}:{
      
      proxyPort}"

# 创建浏览器实例,并设置代理
options = webdriver.ChromeOptions()
options.add_argument(f'--proxy-server=http://{
      
      proxy.http_proxy}')
browser = webdriver.Chrome(options=options)

# 使用浏览器进行网页访问和操作

In the above code, we have configured a proxy server to access web pages using proxy in Selenium. Next, we will introduce how to implement multiple scrolling and scrape data sample code in Scrapy-Selenium.

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.keys import Keys

class ScrollSpider(scrapy.Spider):
    name = 'scroll_spider'
    start_urls = ['https://example.com']

    def start_requests(self):
        for url in self.start_urls:
            yield SeleniumRequest(url=url, callback=self.parse)

    def parse(self, response):
        browser = response.meta['driver']
        # 模拟多次滚动
        for _ in range(5):
            browser.find_element_by_tag_name('body').send_keys(Keys.END)
            # 等待动态内容加载
            self.wait_for_content_to_load(browser)

        # 提取数据
        # ...

    def wait_for_content_to_load(self, browser):
        # 自定义等待条件,确保内容加载完毕
        pass

the case

Suppose we want to grab news headlines in a web page that loads data dynamically. In the method we can parseextract the header element and add it to the scraped results.

def parse(self, response):
    browser = response.meta['driver']
    titles = []

    for _ in range(5):
        browser.find_element_by_tag_name('body').send_keys(Keys.END)
        self.wait_for_content_to_load(browser)

    title_elements = browser.find_elements_by_css_selector('.news-title')
    for title_element in title_elements:
        title = title_element.text
        titles.append(title)

    yield {
    
    'titles': titles}

epilogue

Using the Scrapy-Selenium library, we can easily implement multiple scrolling in web pages and grab dynamically loaded data. Combined with Yiniu Cloud crawler agent, we can also improve crawler efficiency and better meet the challenges of data capture.

Through the sample code and steps in this article, you can apply these techniques in your own projects to achieve efficient crawling and processing of dynamic content. This will be very helpful for extracting valuable information from modern dynamic web pages.

Guess you like

Origin blog.csdn.net/ip16yun/article/details/132320810