scrapy实践二(selenium爬取腾讯漫画【失败】)

前言

一个月前曾学习过爬取腾讯动漫全站的示例代码,现在再用scrapy尝试

过程

新建spider爬虫

在这里插入图片描述

编写普通selenium爬虫爬取漫画

观察网页,确定目标

在这里插入图片描述
本次目标是爬取热门排行下的漫画
在这里插入图片描述
右键检查可以发现漫画标题和链接可以轻松获取(前提是用selenium)

爬取内容

主要目的是为了练习scrapy的selenium,方便起见,只爬取第一部漫画`的少量信息

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Chrome()
url = "https://ac.qq.com/"
driver.get(url)
#显示等待
time.sleep(2)
subjects = driver.find_elements_by_xpath("//a[contains(@class,'in-rank-name')]")
subject = subjects[0]
title = subject.text
print(title)
href = subject.get_attribute("href")
print(href)

爬取得到

无限破狱者
https://ac.qq.com/Comic/comicInfo/id/649370

然后用selenium进入该漫画网址并点击阅读
在这里插入图片描述

driver.get(manhua_url)
time.sleep(2)
driver.find_element_by_xpath("/html/body/div[3]/div[3]/div/div/div[2]/em/div/div[1]/div/a").click()
time.sleep(2)
#操作滚动条
js = 'var q=document.getElementById("mainView").scrollTop = ' + str(100 * 1000)#这个不知如何找是重点
 #执行滑动选项
driver.execute_script(js)
#延时,使图片充分加载
time.sleep(2)
img_list = driver.find_elements_by_xpath("//img[contains(@class,'loaded')]")
for i in img_list:
    print(i.get_attribute("src"))

得到

https://manhua.acimg.cn/manhua_detail/0/03_16_24_a7ac725928a655f2f8538a2ee1e46b8f5_115731672.jpg/0
https://manhua.acimg.cn/manhua_detail/0/03_16_24_a62281dd035262f3a21ef4f03fc92e411_115731695.jpg/0
https://manhua.acimg.cn/manhua_detail/0/03_16_24_ac9393b3d58918890d808c3139a91f840_115731692.jpg/0
https://manhua.acimg.cn/manhua_detail/0/03_16_24_a668de4c3ad19e491aa91f561b890e395_115731723.jpg/0
https://manhua.acimg.cn/manhua_detail/0/03_16_24_a3b706d347a1317b9a9a3ecea4688afc5_115731699.png/0

此时selenium爬取主要思路已经完成,接下来是用scrapy完成这个思路

使用scrapy

流程梳理和重点分析

  1. 重点是midderware的改写
  2. settings.py先打开下载中间件
  3. 然后midderwares.py改写
  4. 接着tengxun.py编写爬虫,其中LinkExtractor的规则制定是重点
  5. 过程中的bug可能不好找到排除

过程

改写midderwares.py

from scrapy import signals
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from scrapy.http.response.html import HtmlResponse


class ManhuaDownloaderMiddleware:
    def __init__(self):
        self.driver = webdriver.Chrome()

    def process_request(self, request, spider):
        self.driver.get(request.url)
        response = HtmlResponse(request.url, body=self.driver.page_source, request=request, encoding='utf-8')
        return response

settings.py打开中间件

DOWNLOADER_MIDDLEWARES = {
   'manhua.middlewares.ManhuaDownloaderMiddleware': 543,
}

编写爬虫

2021-02-21
腾讯动漫网页好像有变化
在这里插入图片描述
现在图片url从网页直接获取不到
用selenium也没用,也许只能通过分析接口获取
在这里插入图片描述
不过增加了了scrapy加selenium的熟练度
放一下已经没用了的代码,改一下domains,start_urls,Rule也能爬取别的东西,scrapy框架一招鲜吃遍天

from scrapy.spiders import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor

class TengxunSpider(CrawlSpider):
    name = 'tengxun'
    allowed_domains = ['ac.qq.com']
    start_urls = [r'https://ac.qq.com/']

    rules = [
        Rule(link_extractor=LinkExtractor(allow=r'https://ac.qq.com/Comic/comicInfo/id/648421'),  follow=True),
        Rule(link_extractor=LinkExtractor(allow=r'https://ac.qq.com/ComicView/index/id/648421/cid/1'), callback="parse_detail", follow=False),

    ]

    def parse_detail(self, response):
        title = response.xpath("//text()").get()
        print(title)

总结

1.scrapy+selenium比单纯的用selenium强了无数倍,速度极快,简化操作,无需等待无需定位,可以专心解析
2.有时parse_detail函数没有被调用,此疑问待解决

猜你喜欢

转载自blog.csdn.net/qq_51598376/article/details/113843159