Python + selenium 爬取百度文库Word文本 - 代码天地

Python + selenium 爬取百度文库Word文本

其他 2018-12-08 10:48:03 阅读次数: 0

 1 # -*- coding:utf-8 -*-
 2  
 3 import time
 4 from selenium import webdriver
 5 from selenium.webdriver.chrome.options import Options
 6 from selenium.common.exceptions import NoSuchElementException
 7  
 8 chrome_options = Options()
 9 chrome_options.add_argument('--headless')
10 chrome_options.add_argument('--disable-gpu')
11 chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36")
12  
13 driver = webdriver.Chrome(chrome_options=chrome_options)
14 driver.maximize_window()
15  
16 url = input("输入文档链接，搞快点：")
17 driver.get(url)
18  
19 error_str = ""
20  
21 try :
22     page_num = driver.find_element_by_xpath("//span[@class='page-count']").text
23  
24     find_button = driver.find_element_by_xpath("//div[@class='doc-banner-text']")
25     driver.execute_script("arguments[0].scrollIntoView();", find_button)
26     button = driver.find_element_by_xpath("//span[@class='moreBtn goBtn']")
27     button.click()
28  
29     for i in range(1,int(page_num.strip('/')) + 1) :
30         page = driver.find_element_by_xpath("//div[@data-page-no='{}']".format(i))
31         driver.execute_script("arguments[0].scrollIntoView();", page)
32         time.sleep(0.3)
33         print(driver.find_elements_by_xpath("//div[@data-page-no='{}']//div[@class='reader-txt-layer']".format(i))[-1].text)
34  
35 except NoSuchElementException :
36     if driver.find_element_by_xpath("//div[@class='doc-bottom-text']").text == "试读已结束，如需继续阅读或下载" :
37         error_str = "\n------------------------------------------------------------------\n\n" \
38                       "----------百度文库提示试读已结束啦，无法爬取全文，等会再试试吧----------\n\n" \
39                       "------------------------------------------------------------------"
40  
41 finally :
42     print(error_str)

猜你喜欢

转载自www.cnblogs.com/shuai-bi/p/10086664.html

Python + selenium 爬取百度文库Word文本

python3爬虫(2):使用Selenium爬取百度文库word文章

python+selenium爬取百度文库不能下载的word文档

python 利用selenium爬取百度文库的word文章

Python3网络爬虫(九)：使用Selenium爬取百度文库word文章

Python3爬虫-selenium爬取百度文库

python——百度文库爬取

python+selenium+bs4爬取百度文库内文字 && selenium 元素可以定位到，但是无法点击问题 && pycharm多行缩进、左移

二十一、Python爬取百度文库word文档内容

python3 学习1（搜索关键字爬取一页word格式的百度文库并下载成文本）

Python实现的爬取百度文库功能

python爬取百度文库所有内容

Python爬取百度文库doc文档

python+requests爬取百度文库ppt

Python3爬取百度文库数据

Python Selenium爬取百度百科旅游景点的消息盒

python用selenium爬取百度搜索结果

python爬百度文库课件

利用Python进行百度文库内容爬取（一）

Python爬取 vip百度文库,再也不用为下载卷苦恼了

python爬虫爬取百度文库txt以及ppt资料

python 爬取，selenium

利用Python进行百度文库内容爬取（二）——自动点击预览全文并爬取

python selenium搜索百度

用selenium爬取百度新闻

20200203_selenium爬取百度新闻

使用selenium爬取百度搜索的URL

python + selenium爬取淘宝

python selenium爬取音频

Python——selenium爬取学科

今日推荐

火速冲上 GitHub 热榜 —— 开源编程语言、框架哪有这么可爱？

北京人形机器人创新中心发布全球首个纯电驱拟人奔跑的全尺寸人形机器人“天工”

LFOSSA 源来如此公开课 | 掌握云原生未来：CNCF 认证全面攻略与备考秘籍

国产云输入法——仅华为无云端数据上传安全问题

开源日报 | 工业开源项目OGG 1.0；姐姐，你要和我一起配置火狐吗；苹果AI遥遥落后？Fedora 40

开放签电子签章：停止新增，优化体验，前进更进（五一假期前工作）

开源日报 | 中学生开源前端动画引擎；全球首个Llama3 8B中文版开源模型；联想电脑恐出局；Linus讽刺AI炒作

周排行

浏览器对同一域名进行请求的最大并发连接数

React Hook之自定义Hook

【转】MyBatis缓存机制

-Java-泛型

自动化测试常用脚本-发送邮件

LeetCode#859: Buddy Strings

java、Python处理字符串

第二篇の博客

Hadoop伪分布式环境安装

SQL Server进阶（十一）临时表、表变量

每日归档

更多

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)

2024-04-24(36)

2024-04-23(26)

2024-04-22(39)

2024-04-21(0)

2024-04-20(6)

2024-04-19(5)

2024-04-18(0)