scrapy notes three (selenium)

Preface

carry on

Example--------->Crawling Short Book

Use normal selenium to capture data

First open the website and
Insert picture description here
find that you need to click to expand more to get the information you want, which can only be achieved through selelnium.
Element check
You can see that the class value of the target element is compressed and encrypted, which is an anti-climbing measure. Every time the website structure is re-updated, the name of this class will change.
So you can find this element through the structure. The elements of this website change frequently, and there needs to be a reliable positioning method so that the crawler will survive longer. The
code is as follows

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from scrapy.http.response.html import HtmlResponse

class JianshuDownloaderMiddleware:
    def __init__(self):
        self.driver = webdriver.Chrome()

    def process_request(self, request, spider):
        # 然后用selenium去请求
        self.driver.get(request.url)

        next_btn_xpath = "//div[@role='main']/div[position()=1]/section[last()]/div[position()=1]/div"
        WebDriverWait(self.driver, 5).until(
            EC.element_to_be_clickable((By.XPATH, next_btn_xpath))
        )

        while True:
            try:
                next_btn = self.driver.find_element_by_xpath(next_btn_xpath)
                self.driver.execute_script("arguments[0].click();", next_btn)
            except Exception as e:
                break

        # 把selenium获得的网页数据,创建一个Response对象返回给spider
        response = HtmlResponse(request.url,body=self.driver.page_source,request=request,encoding='utf-8')
        return response

There are some things that are not useful, you can understand it at present. The
main thing is to know that a class is constructed in the middleware, which contains a selenium method, and returns the response to the spider
. It is relatively resistant to some difficult places, but it is also challenging and stimulating.

Then open the middleware in settings.py

DOWNLOADER_MIDDLEWARES = {
    
    
   'jianshu.middlewares.JianshuDownloaderMiddleware': 543,
}

to sum up

1. The difficulty in this case of scrapy integration with selenium lies in element positioning. The anti-crawl technology of the website makes positioning difficult. It does not matter if the error occurs. The point is to find and solve the cause of the error.
The common timeout error here is that the element is not found.
2. Encounter It is difficult to judge the reason when it is directly broken and there is no obvious error prompt. This situation is more damaging to confidence. In order to help you debug, you need to use try except Exception as e when writing the code. In addition, try to click on the bug to debug and
automate the crawling of a webpage. In this case, you only need to use the debugging method to find the error. It may be due to the failure of element positioning in a certain place, and it has fallen into an infinite loop.
3. The superiority of the scrapy framework has been reflected here. After the middleware is written, the middleware automatically crawls and returns the response, so that it is convenient to concentrate on writing and parsing in the spider. Extract the code so that the division of labor is clear and pleasant
. 4. Subsequent storage tasks can be achieved through item and pipeline. You should practice actively, and you will not practice if you don’t practice.

Follow-up and next steps

Further writing the code to try to print and parse the content failed, the console has no content, and the storage is not tried.
Forcibly learning may not be efficient. The scrapy selenium first comes here. The
preliminary idea is to use scrapy for the WeChat public at the same time as the next step in the construction of the thesis website project. No. Collecting materials, pictures, music, familiar frame file download

Guess you like

Origin blog.csdn.net/qq_51598376/article/details/113788045