【爬虫】002 python3 +beautifulsoup4 +requests 爬取静态页面 - 代码天地

【爬虫】002 python3 +beautifulsoup4 +requests 爬取静态页面

其他 2018-08-07 18:23:27 阅读次数: 0

实验环境: win7 python3.5 bs4 0.0.1 requests 2.19

实验日期：2018-08-07

爬取网站：http://www.xhsd.cn/

现在的网站大多有复杂的交互，地方政府的网站又太简单，体现不出bs4的解析过程； http://www.xhsd.cn/ 这个网站，还算现代，很可贵的是它还是直接在服务端返回的，客户端没有渲染；

2018-08-07 它的预览是这样的（爬取之前，先通过chrome浏览器检查页面元素，了解页面html构造）

　　

希望抓取推荐图书新书速递考试用书等

python 抓取代码

  
  import requests 
import  bs4 
import pandas as pd 
import re 
url="""http://www.xhsd.cn/"""
r=requests.get(url)
html=r.text

soup=bs4.BeautifulSoup(html,'lxml')



tables=soup.find_all('table',bgcolor="#ffffff")

def etr(tb):
    content={}
    arr=list(filter(lambda x:len(str(x))>2,tb.children))
    tr1=arr[0]
    tr2=arr[1]
    label=next(tr1.stripped_strings)
    content['label']=label
    print(label)

    a_s=tr2.find_all('a',title=True)
    cs=[]
    for a in a_s:
        try:
            cts=list(a.stripped_strings)
            #print(cts)
            book,auth,price_now,price_before=cts
            img=a.find('img')['src']
            tmp={"book":book,"auth":auth,"price_now":price_now,"price_before":price_before,"image":img}
            cs.append(tmp)
        except:
            continue

    content["contents"]=cs
    return content 

tables=tables
dfs=[]
for tb in tables:
    content=etr(tb)

    df_tmp=pd.DataFrame(data=content['contents'])
    df_tmp['label']=content['label']
    dfs.append(df_tmp)

df=pd.concat(dfs,ignore_index=True)
 
 

图片的处理

爬取下来的数据中，有df['image'] 以http://www.xhsd.cn//upload/2017/7/1500881045493.jpg 为例

['http://www.xhsd.cn//upload/2017/7/1500881045493.jpg', 'http://www.xhsd.cn//upload/20160701\\9787201077642.JPG', 'http://www.xhsd.cn//upload/20160621\\9787201088945.JPG', 'http://www.xhsd.cn//upload/2017/6/1498807359861.jpg']

下一张博客讲下载图片和简单处理

猜你喜欢

转载自www.cnblogs.com/mathf/p/9437190.html

【爬虫】002 python3 +beautifulsoup4 +requests 爬取静态页面

python3爬虫(基于requests、BeautifulSoup4)之项目实战(二)

python3爬虫(基于requests、BeautifulSoup4)之项目实战(一)

python3爬虫(基于requests、BeautifulSoup4)之项目实战(三)

python3爬虫(基于requests、BeautifulSoup4)之环境配置

python 爬虫-beautifulsoup4

【Python网络爬虫】使用requests和beautifulsoup4库轻松实现

Python:requests库、BeautifulSoup4库的基本使用（实现简单的网络爬虫）

（待整理）Python:requests库、BeautifulSoup4库的基本使用（实现简单的网络爬虫）

【python3爬虫】beautifulsoup4 安装

python爬虫beautifulsoup4系列3

python爬虫爬取招聘（ requests，BeautifulSoup）

Python 爬虫：requests + BeautifulSoup4 爬取 CSDN 个人博客主页信息（博主信息、文章标题、文章链接）爬取博主每篇文章的信息（访问、收藏）合法刷访问量？

Python3爬虫--两种方法（requests(urllib)和BeautifulSoup）爬取网站pdf

python3 爬虫（requests+BeautifulSoup）

python爬虫beautifulsoup4系列1

python爬虫beautifulsoup4系列2

Python爬虫--BeautifulSoup4教程、练习

Python 爬虫 BeautifulSoup4 库的使用

python爬虫之-BeautifulSoup4

python---requests和beautifulsoup4模块的使用

python写爬虫代码，除了requests和beautifulsoup4还需要哪些库？【学习记录】

Python爬虫自学之第（③）篇——实战：requests+BeautifulSoup实现静态爬取

从0开始学爬虫8使用requests/pymysql和beautifulsoup4爬取维基百科词条链接并存入数据库

Python爬虫实现使用beautifulSoup4爬取名言网功能案例

python3爬虫-使用requests爬取起点小说

python使用beautifulsoup4爬取酷狗音乐

python基于beautifulsoup4爬取wallpaperup的壁纸

Python3.7 爬虫（三）使用 Urllib2 与 BeautifulSoup4 爬取网易云音乐歌单

python3爬虫实战-requests+beautifulsoup-杭电官网比赛信息实时爬取

今日推荐

LFOSSA 源来如此公开课 | 掌握云原生未来：CNCF 认证全面攻略与备考秘籍

国产云输入法——仅华为无云端数据上传安全问题

开源日报 | 工业开源项目OGG 1.0；姐姐，你要和我一起配置火狐吗；苹果AI遥遥落后？Fedora 40

开放签电子签章：停止新增，优化体验，前进更进（五一假期前工作）

开源日报 | 中学生开源前端动画引擎；全球首个Llama3 8B中文版开源模型；联想电脑恐出局；Linus讽刺AI炒作

“百模大战”必有一战 | 2024中国“百模大战”竞争格局分析

周排行

Family Tree 题解

BZOJ 1093 最大半连通子图 SCC + DP

幂等处理

Spring----学习（2）----XML 配置Bean 自动装配

SQL Server 远程更新目标表数据

HIbernate3.6 环境搭建

特殊符号正则表达式

【Linux】第一章进程的理解

843. n-皇后问题（dfs+输出各种情况）

空间数据库2

每日归档

更多

2024-04-26(39)

2024-04-25(22)

2024-04-24(36)

2024-04-23(26)

2024-04-22(39)

2024-04-21(0)

2024-04-20(6)

2024-04-19(5)

2024-04-18(0)

2024-04-17(5)