python3 网络爬虫如何对非页数选择，下拉加载类的网页进行爬取 - 代码天地

python3 网络爬虫如何对非页数选择，下拉加载类的网页进行爬取

编程语言 2018-09-16 20:17:02 阅读次数: 0

目标网站是：https://www.americamakes.us/about/news/

网页加载方式：
这里写图片描述
Form Data为：

下面是我写的代码：

**import requests
from bs4 import BeautifulSoup
import os
import re

HTML_DIR = ‘html’
TXT_PATH = ‘thg_news.txt’

def getHTMLTest(url):
try:
r = requests.get(url, timeout = 30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return “”

def printUnivList(html):
soup = BeautifulSoup(html, “html.parser”)
div_w_blog_list = soup.find(‘div’, class_ = ‘w-blog-list’)
articles = div_w_blog_list.find(‘article’)
divs = div_w_blog_list.find_all(‘div’, class_ = ‘w-blog-post-body’)
# print(divs)
for child in divs:
a_dict = {}
# print(child)
TYPE = articles.get(‘data-categories’)
a_page = child.find(‘a’)
data_page = child.find(‘time’)
data = data_page.get_text()
title_url = a_page.get(‘href’)
title = a_page.get_text()
# res = re.title_url(r’https://www.americamakes.us(.*?)’)
filename = title_url.split(“/”)[-2]
# a_html = getHTMLTest(title_url)
a_dict[‘TYPE’] = TYPE
a_dict[‘Title’] = title
a_dict[‘Title_Url’] = title_url
a_dict[‘Data’] = data
a_dict[‘File_Name’] = filename
# print(a_dict)
add_name_to_text(a_dict)
a_html = getHTMLTest(title_url)
write_html(a_html, filename)

def add_name_to_text(newsdic):
with open(TXT_PATH, ‘a’, encoding=’utf-8’) as f:
for k in [‘TYPE’, ‘Title’, ‘Title_Url’, ‘Data’, ‘File_Name’]:
f.write(‘[%s]: %s’ %(k,newsdic[k]))
f.write(‘\n’)
f.write(‘\n’)

def write_html(html, name):
html_file = name+’.html’
html_path = os.path.join(HTML_DIR, html_file)
with open(html_path, ‘w’, encoding=’utf-8’) as f:
f.write(html)

def main():
url = ‘https://www.americamakes.us/about/news/’
html = getHTMLTest(url)
printUnivList(html)

if name == ‘main‘:
main()**

猜你喜欢

转载自blog.csdn.net/qq_43182687/article/details/82629589

python3 网络爬虫如何对非页数选择，下拉加载类的网页进行爬取

Python3网络爬虫：requests爬取动态网页内容

Python3爬虫(1)_使用Urllib进行网络爬取

python3爬虫爬取网页图片简单示例

python3爬虫之二：爬取网页图片

《python3网络爬虫开发实战》--APP爬取

Python3网络爬虫：使用Beautiful Soup爬取小说

python3编程08-爬虫实战：爬取网络图片

Python3编写网络爬虫04-爬取猫眼电影排行实例

python3编写网络爬虫22-爬取知乎用户信息

python3编写网络爬虫14-动态渲染页面爬取

Python3网络爬虫实战解析——优美壁纸爬取

python3 网络爬虫开发实战爬取今日头条街拍图片

Python3网络爬虫--爬取歌词并制作GUI（附源码）

Python3网络爬虫--爬取海外视频（附源码）

python3网络爬虫开发实战学习笔记(二)------python3 XPATH爬猫眼电影排名

【转载】Python3网络爬虫(一)：利用urllib进行简单的网页抓取

Python3网络爬虫(一)：利用urllib进行简单的网页抓取

Python3网络爬虫：Scrapy入门实战之爬取动态网页图片

Python——网络爬虫（爬取网页图片）

Python3网络爬虫基本操作(二)：静态网页抓取

【笔记】Python3｜爬虫处理网页数据异步加载问题（结合Selenium完成）

【Python3 爬虫】17_爬取天气信息

python3 --小爬虫（爬取美剧字幕）

Python3 爬虫实战（并发爬取）

python3爬虫爬取煎蛋网妹纸图片

python3 爬虫爬取blog内容

python3 爬虫学习之爬取猫眼电影

Python3爬虫爬取VIP视频

python3爬虫 —— 爬取豆瓣电影信息

今日推荐

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

中国码农的“35岁魔咒”

蘭雅 CorelDRAW 插件 2024.5.1 国际劳动节版，免费下载

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

周排行

基本数据类型封装类比较 Java源码解读(一) 8种基本类型对应的封装类型

JS实现无缝滚动上

深入解析HashMap原理（基于JDK1.8）

mysql的连接池

关于.htc

linux下的ubuntu12.04图形界面

【数论】好推不好记的扩展欧几里德

设备树详解

cscope + tags 简单设置

xml学习

每日归档

更多

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)