Python爬虫开发【第1篇】【beautifulSoup4解析器】 - 代码天地

Python爬虫开发【第1篇】【beautifulSoup4解析器】

其他 2018-08-11 20:10:05 阅读次数: 0

CSS 选择器：BeautifulSoup4

Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。

pip 安装：pip install beautifulsoup4

官方文档：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

抓取工具	速度	使用难度	安装难度
正则	最快	困难	无（内置）
BeautifulSoup	慢	最简单	简单
lxml	快	简单	一般

使用BeautifuSoup4爬腾讯社招页面

地址：http://hr.tencent.com/position.php?&start=10#a

 1 # bs4_tencent.py
 2 
 3 
 4 from bs4 import BeautifulSoup
 5 import urllib2
 6 import urllib
 7 import json    # 使用了json格式存储
 8 
 9 def tencent():
10     url = 'http://hr.tencent.com/'
11     request = urllib2.Request(url + 'position.php?&start=10#a')
12     response =urllib2.urlopen(request)
13     resHtml = response.read()
14 
15     output =open('tencent.json','w')
16 
17     html = BeautifulSoup(resHtml,'lxml')
18 
19 # 创建CSS选择器
20     result = html.select('tr[class="even"]')
21     result2 = html.select('tr[class="odd"]')
22     result += result2
23 
24     items = []
25     for site in result:
26         item = {}
27 
28         name = site.select('td a')[0].get_text()
29         detailLink = site.select('td a')[0].attrs['href']
30         catalog = site.select('td')[1].get_text()
31         recruitNumber = site.select('td')[2].get_text()
32         workLocation = site.select('td')[3].get_text()
33         publishTime = site.select('td')[4].get_text()
34 
35         item['name'] = name
36         item['detailLink'] = url + detailLink
37         item['catalog'] = catalog
38         item['recruitNumber'] = recruitNumber
39         item['publishTime'] = publishTime
40 
41         items.append(item)
42 
43     # 禁用ascii编码，按utf-8编码
44     line = json.dumps(items,ensure_ascii=False)
45 
46     output.write(line.encode('utf-8'))
47     output.close()
48 
49 if __name__ == "__main__":
50    tencent()

猜你喜欢

转载自www.cnblogs.com/loser1949/p/9460821.html

Python爬虫开发【第1篇】【beautifulSoup4解析器】

Python爬虫(十二)_BeautifulSoup4 解析器

【python】打卡学习第七天-爬虫解析器BeautifulSoup4

关于BeautifulSoup4 解析器的说明

python爬虫beautifulsoup4系列1

Python3网络爬虫教程15——BeautifulSoup4中的编码，格式化，解析器的区别

python 爬虫-beautifulsoup4

Python爬虫beautifulsoup4常用的解析方法总结

python爬虫之数据解析（一）：BeautifulSoup4库

Python爬虫 —— 使用BeautifulSoup4解析HTML文档

python爬虫beautifulsoup4系列3

python爬虫beautifulsoup4系列2

Python爬虫--BeautifulSoup4教程、练习

Python 爬虫 BeautifulSoup4 库的使用

python爬虫之-BeautifulSoup4

python BeautifulSoup4解析网页

python BeautifulSoup4解析html

Python爬虫beautifulsoup4常用的解析方法总结（新手必看）

爬虫利器beautifulsoup4

爬虫基础——BeautifulSoup4

爬虫（BeautifulSoup4）——安装

爬虫之BeautifulSoup4

六：爬虫-数据解析之BeautifulSoup4

python爬虫实战：基础爬虫(使用BeautifulSoup4等) python爬虫实战：基础爬虫(使用BeautifulSoup4等)

python爬虫beautifulsoup4系列4-子节点

【Python爬虫】beautifulsoup4库的安装与调用

【python 爬虫】BeautifulSoup4 库的介绍使用

【python3爬虫】beautifulsoup4 安装

Python网络爬虫——BeautifulSoup4库的使用

python爬虫之BeautifulSoup4库的简单用法

今日推荐

美国拟限制 AI 大模型出口中国和俄罗斯

苹果将与 OpenAI 达成协议，将 ChatGPT 应用于 iPhone

openKylin 社区生态委员会第六次会议圆满召开

阿里云正式发布通义千问 2.5

Python 3.13 发布首个 Beta：实验性自由线程模式和 JIT、改进交互式解释器

Stack Overflow 拿我的代码去训练 AI 大模型，还封了我的账号

Pop!_OS 的 COSMIC 桌面完成 App Store 上架工作

报告：Django 仍然是 74% 开发者的首选

《2024 年一季度互联网投融资运行情况》研究报告

15 年前上了“FFmpeg 耻辱柱”，今天他还得谢谢咱——腾讯QQPlayer一雪前耻？

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

周排行

curl的POST请求，封装方法

8.1.1. Integer Types

Java基础 Day05(个人复习整理)

Python - Django - 中间件 process_exception

小L的试卷

【Shell编程】（函数）判断用户是否存在

python(css样式)

spring ant path 匹配原则 - 【笔记】

《JavaScript与JScript从入门到精通》(美)James.Jaworski.中译本.扫描版.pdf

Eclipse运行带参数的java程序

每日归档

更多

2024-05-12(0)

2024-05-11(38)

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)