Scrapy crawler framework integrates Selenium to parse dynamic web pages

1. Insufficient use of the scrpay framework alone by the crawler project

The current website generally adopts javascript dynamic pages, especially the popularity of vue and react. It is very difficult to use the scrapy framework to locate dynamic web page elements, but selenium is the most popular browser automation tool, which can simulate a browser to operate web pages, parse elements, execute Actions can handle dynamic web pages. Using selenium to process a large website is very slow and consumes a lot of resources. Is it possible to integrate selenium into the scrapy framework to take advantage of both?

The key for Scrapy to integrate selenium is to put it into DownloaderMiddleware . As shown in the schematic diagram of scrapy below, you can modify the request and response objects in the middleware method of Downloader, and then return them to scrapy
insert image description here

You can customize the downloader middleware middleware class to integrate selenium. Of course, to realize all the features of selenium, the workload is relatively large. Therefore, we recommend using the scrapy-selenium third party for integration.

2. Build scrapy-selenium development environment

2.1 Install the scrapy-selenium library

pip install scrapy-selenium
python version should be greater than 3.6,

2.2 Install browser driver

A selenium-supported browser should be installed on this machine, such as chrom, firefox, edge, etc.
Then install the corresponding browser and version of webdrive,
download the downloaded chromedriver.exe, and put it in the project root directory, or add system environment variables.

2.3 Integrate selenium into scrapy project

The project structure is as follows


├── scrapy.cfg
├── chromedriver.exe ## <-- Here
└── myproject
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py

Enter the project folder and update settings.py

## settings.py

# for Chrome driver 
from shutil import which
  
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['--headless']  
  
DOWNLOADER_MIDDLEWARES = {
    
    
     'scrapy_selenium.SeleniumMiddleware': 800
     }

3. Use selenium in the spider to parse web pages

In the spider, use the SeleniumRequest class to replace the built-in Request class of selenium.

## spider.py
import scrapy
from quotes_js_scraper.items import QuoteItem
from scrapy_selenium import SeleniumRequest

class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        url = 'https://quotes.toscrape.com/js/'
        yield SeleniumRequest(url=url, callback=self.parse)

    def parse(self, response):
        quote_item = QuoteItem()
        for quote in response.css('div.quote'):
            quote_item['text'] = quote.css('span.text::text').get()
            quote_item['author'] = quote.css('small.author::text').get()
            quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
            yield quote_item

Scrapy will automatically call selenium to parse the page elements returned by response, where selenium uses the headless chrom browser.

4. Use the features of selenium to crawl data

You can use the features of selenium, such as
• waiting for web page elements
• simulating operations such as clicks
• screenshots
and so on.

(1) Waits function

Dynamic web pages cannot locate elements, usually due to component loading order, ajax asynchronous request update, etc., and selenium provides the wait_until function to process and realize the positioning of dynamic web page elements.
All requests wait 10 seconds

def start_requests(self):
        url = 'https://quotes.toscrape.com/js/'
        yield SeleniumRequest(url=url, callback=self.parse, wait_time=10)

Use selenium wait_until conditional wait function

## spider.py
import scrapy
from quotes_js_scraper.items import QuoteItem
 
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        url = 'https://quotes.toscrape.com/js/'
        yield SeleniumRequest(
                    url=url, 
                    callback=self.parse, 
                    wait_time=10,
                    wait_until=EC.element_to_be_clickable((By.CLASS_NAME, 'quote'))
                    )
    def parse(self, response):
        quote_item = QuoteItem()
        for quote in response.selector.css('div.quote'):
            quote_item['text'] = quote.css('span.text::text').get()
            quote_item['author'] = quote.css('small.author::text').get()
            quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
            yield quote_item

(2) Click the button

For example, selenium can be configured to execute the click event of a tag

lass QuotesSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        url = 'https://quotes.toscrape.com/js/'
        yield SeleniumRequest(
            url=url,
            callback=self.parse,
            script="document.querySelector('.pager .next>a').click()",
        )

(3) Screenshot of the page

## spider.py
import scrapy
from quotes_js_scraper.items import QuoteItem
from scrapy_selenium import SeleniumRequest

class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        url = 'https://quotes.toscrape.com/js/'
        yield SeleniumRequest(
                    url=url, 
                    callback=self.parse, 
                    screenshot=True
                    )

    def parse(self, response):
        with open('image.png', 'wb') as image_file:
            image_file.write(response.meta['screenshot'])

Guess you like

Origin blog.csdn.net/captain5339/article/details/131610548