Scrapy cannot write pipeline when download middleware is enabled

Problem Description

In the process of crawling Baidu using Scrapy, add Selenium to the download middleware to return the loaded page and parse it, but the crawled data cannot be written into the file using the pipeline

exploration process

  • already set pipelines.pyfile
    insert image description here
  • Already settings.pyopened the pipe in
    insert image description here
  • spiderThe processing function in the file parse()has written a return statement, and the console can print the crawled data normally
    insert image description here
    insert image description here
  • At this point, I suspect that there is a problem with the project framework. Create a new Scrapy project, write the simplest spidersum pipelinefile, and find that it can be written after running
    insert image description here
  • There is no problem with the framework. Compared with the two projects, the new project does not have middleware, so it is suspected to be a middleware problem. Comment out the download middleware of the original project, try again, and find that data can be written
    insert image description here
  • Copy the middleware to the new project and run it. It is found that the file can still be written. There is no difference between the two middleware. Therefore, comparing the difference between the two files, it is spiderfound that start_urlthe fields are different. The original project is 'https://www.baidu.com/'and the new project is'https://baidu.com/'

problem causes

In Scrapy's download middleware, each request can be set to take over through the middleware. In this project, I hope that Selenium will take over the Baidu page I opened for the first time, and return the ready-made webpage content to me after opening, so I wrote such a statement:
insert image description here
When the url I requested is 'https://www.baidu.com/', it will be handed over to Selenium to take over. In order to click_page_urlbe consistent with the url, I also wrote it in the field spiderin the file . I didn’t expect it to be unusable . After changing it , the problem is solved. You can Write, check the url of the response returned by the middleware, it is still, so it is not clear why adding a domain name will affect the writingstart_url'https://www.baidu.com/'pipeline'https://baidu.com/'
insert image description here
wwwpipeline

Guess you like

Origin blog.csdn.net/qq_41983842/article/details/107866628