Selenium获取书籍信息

前言

「selenium可谓说是UI自动化必备技能之一了。且在前面也写了许多关于selenium的教程。本章来点不一样的口味」

公众号：测个der

使用selenium进行爬取数据

纳尼？selenium还能拿来爬虫？

没错，并且selenium爬虫可以说是比较简单的一种了。

学完selenium基础，简单爬虫一看就会上手就废！

为什么？？

因为你只是看会了，手不会。

正题，本章教你如何优雅的爬取图书信息

下载包：pip install selenium

目标网页：https://www.jd.com/

先来导入模块

from selenium import webdriver
from time import sleep
from selenium.webdriver.common.by import By

再来配置个无头模式，据说可以变快，不信你实测！

options = webdriver.FirefoxOptions()
options.add_argument('--headless')
options.add_argument("--disable-gpu")
s = r'D:\pytest_\Case\geckodriver.exe'
driver = webdriver.Firefox(executable_path=s,options=self.options)

顺便配置了一下驱动路径。还不知道驱动路径怎么下载的，看看

好，接下来我们写入一个类中，便于管理

class JD:
    def __init__(self):
        self.options = webdriver.FirefoxOptions()
        self.options.add_argument('--headless')
        self.options.add_argument("--disable-gpu")
        s = r'D:\pytest_\Case\geckodriver.exe'
        self.driver = webdriver.Firefox(executable_path=s,options=self.options)
        self.i = 0

简单点，类名直接京东大写首字母缩写。self.i是啥？不急，看下去！

看到这里了，你要是不知道self干啥的，那就很为难了，回头看看http://mp.weixin.qq.com/mp/homepage?__biz=MzkwODI1OTYwMg==&hid=5&sn=5bc874ab1aa9ed83f949f77a6c5f3d24&scene=18#wechat_redirect

接下来我们打开目标网页看看获取信息资料，百度搜索京东或者直接输入url就不多了，F12就能看到如下页面了。

如何定位目标元素？什么是元素？建议回头看看selenium基础。

完成如上操作了。代码怎么写？

我们也写在类中

def get_html(self):
        self.driver.get('https://www.jd.com/')
        self.driver.find_element_by_xpath('//*[@id="key"]').send_keys("python")
        self.driver.find_element_by_xpath("//*[@class='form']/button").click()

到了这一步了，那么接下来就是获取值的事情了，来看看吧

look，这不就来了吗，定位获取就完事了

dict_ = {}
dict_["价钱"] = value.find_element_by_xpath(".//div[@class='p-price']/strong/i").text + '元'
dict_["描述"] = value.find_element_by_xpath(".//div[@class='p-name p-name-type-2']/a/em").text
dict_["评论数"] = value.find_element_by_xpath(".//div[@class='p-commit']/strong/a|.//div[@class='p-commit']/strong").text
dict_["出版社"] = value.find_element_by_xpath(".//div[@class='p-shop']").text

到了这里，你问题来了，这样只能获取一条。

再看

清一色的class一样的元素，好，问题简单了。直接获取元素组就好了

list_ = self.driver.find_elements(By.CLASS_NAME, 'gl-item')
for value in list_:
    dict_ = {}
    dict_["价钱"] = value.find_element_by_xpath(".//div[@class='p-price']/strong/i").text + '元'
    dict_["描述"] = value.find_element_by_xpath(".//div[@class='p-name p-name-type-2']/a/em").text
    dict_["评论数"] = value.find_element_by_xpath(".//div[@class='p-commit']/strong/a|.//div[@class='p-commit']/strong").text
    dict_["出版社"] = value.find_element_by_xpath(".//div[@class='p-shop']").text
    self.i += 1
    print(dict_)

这样就能拿到全部元素了。

弄个文件执行以下：

    def run(self):
        self.get_html()
        self.get_data()
        print("次数：{}".format(self.i))


jd = JD()
jd.run()

那么问题又来了，因为获取的数据不全，你会发现只能获取到部分。

怎么办？

self.driver.execute_script('window.scrollTo(0,document.body.scrollHeight)'  # 拉动进度条)

在元素定位之前。拖动进度条即可。

源码放在了gitee，需要了可以直接访问获取： https://gitee.com/qinganan_admin/reptile-case.git

Selenium获取书籍信息

前言

猜你喜欢