Python数据采集-创建爬虫

一捉取一个网页

1 from urllib.request import urlopen
2 
3 html = urlopen("http://pythonscraping.com/pages/page1.html")
4 print(html.read())

View Code

解析：urllib是Python一个标准库，处理网络请求、cookie、请求头等，这个过程类似浏览器输入一个网址获取该网址的所有资源

安装urllib：cmd-pip install urllib

运行结果：显示网页HTML

二处理网页数据

1 from urllib.request import urlopen
2 from bs4 import BeautifulSoup
3 
4 html = urlopen("http://www.hankcs.com/program/python/python-ide.html")
5 bsObj = BeautifulSoup(html.read())
6 print(bsObj.h1)

View Code

解析：BS4是Python一个用于处理HTML标签格式化和组织复杂的网络信息，展示XML树。方便对网页数据解析

安装BeautifulSoup：cmd-pip install BeautifuSoup4

运行结果：显示<h1>标签数据

扩展：要了解HTML的组成结构才能更好地利用BS，HTML有head/body/script/link等主要标签，具体参考：https://www.w3cschool.cn/html/

具体你要用哪个标签属性看的针对分析该网页的那些数据信息是你感兴趣的，然后对其源HTML进行分析找到该数据的位置（通常那些网站反爬虫会不定时修改HTML属性），然后获取数据即可。一个简单的HTML也没就如此获取。

三复杂HTML解析

例如我们想获取国家数据农业相关数据，URL为http://data.stats.gov.cn/easyquery.htm?cn=C01&zb=A0D0G&sj=2016

想获取table数据

1 from urllib.request import urlopen
2 from bs4 import BeautifulSoup
3 
4 html = urlopen("http://data.stats.gov.cn/easyquery.htm?cn=C01&zb=A0D0G&sj=2016")
5 html_str = html.read().decode('utf-8')
6 print(html_str)
7 bsObj = BeautifulSoup(html_str)
8 obj = bsObj.findAll("table")[0]
9 print(obj)

View Code

运行，发现并不是我们想要的结果，分析HTML源码发现，表格table数据是js运行时加载的，我们获取的HTML静态页面并没有数据。如果仅仅是静态HTML，就算数据藏得很深，我们都可以利用BS的强大解析来获取数据。这部分我们后续继续学习。

BS解析相关方法参考：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

中文乱码：

 html.read().decode('utf-8')

1.根据css样式获取数据

 1 from urllib.request import urlopen
 2 from bs4 import BeautifulSoup
 3 
 4 html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
 5 html_str = html.read().decode('utf-8')
 6 print(html_str)
 7 bsObj = BeautifulSoup(html_str)
 8 obj = bsObj.findAll("span",{"class":"green"})#查找span标签内的内容，过滤条件是class=green
 9 for item in obj:
10     print(item.get_text())

View Code

运行结果：

过滤数据方法findAll的参数：

findAll(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)

每一个参数都可以是json组合条件

2.处理子标签和后代标签

 1 from urllib.request import urlopen
 2 from bs4 import BeautifulSoup
 3 
 4 html = urlopen("http://www.pythonscraping.com/pages/page3.html")
 5 html_str = html.read().decode('utf-8')
 6 #print(html_str)
 7 bsObj = BeautifulSoup(html_str)
 8 obj = bsObj.find("table",{"id":"giftList"}).children
 9 
10 for item in obj:
11     print(item)

View Code

3.处理兄弟标签

 1 from urllib.request import urlopen
 2 from bs4 import BeautifulSoup
 3 
 4 html = urlopen("http://www.pythonscraping.com/pages/page3.html")
 5 html_str = html.read().decode('utf-8')
 6 #print(html_str)
 7 bsObj = BeautifulSoup(html_str)
 8 obj = bsObj.find("table",{"id":"giftList"}).tr.next_siblings
 9 
10 for item in obj:
11     print(item
12 
13 )

View Code

4.处理父标签

 1 from urllib.request import urlopen
 2 from bs4 import BeautifulSoup
 3 
 4 html = urlopen("http://www.pythonscraping.com/pages/page3.html")
 5 html_str = html.read().decode('utf-8')
 6 #print(html_str)
 7 bsObj = BeautifulSoup(html_str)
 8 obj = bsObj.find("img").parent
 9 
10 for item in obj:
11     print(item
12 
13 )

View Code

四正则表达式

我们对捉取到的页面数据，还没有完全过滤，我们还需要根据跟着要求对这些数据进行过滤，获取最终感兴趣的数据。这时候可以写各种string的过滤函数，而效率很低，冗余代码很多。这时候我们可以使用正则表达式来快速过滤数据，验证数据等操作。

举个例子：[A-Za-z0-9\._+]+@[A-Za-z]+\.(com|org|edu|net) 为验证该数据是否为邮箱格式

参考：

1.获取页面图片的地址

 1 from urllib.request import urlopen
 2 from bs4 import BeautifulSoup
 3 import  re
 4 
 5 html = urlopen("http://www.pythonscraping.com/pages/page3.html")
 6 html_str = html.read().decode('utf-8')
 7 #print(html_str)
 8 bsObj = BeautifulSoup(html_str)
 9 images = bsObj.findAll("img",{"src":re.compile(".*\.jpg")})
10 
11 for item in images:
12     print(item["src"])

View Code

运行结果

五 Lambda表达式

假设你已经获取到了一个数据，而这个数据里面的值内容很多，而且格式相同，你想进一步通过正则表达式来过滤就不合理了，因为正则表达式只是对所有数据进行匹配是否符合规则，但是并没有进行查询过滤，只是套规则过滤。这时候，我们使用lambda表达式讲高效率快速解决问题。

例如：soup.findAll(lambda tag: len(tag.attrs) == 2) 查找有两个属性的标签

1.获取页面图片的地址

 1 from urllib.request import urlopen
 2 from bs4 import BeautifulSoup
 3 import  re
 4 
 5 html = urlopen("http://www.pythonscraping.com/pages/page3.html")
 6 html_str = html.read().decode('utf-8')
 7 #print(html_str)
 8 bsObj = BeautifulSoup(html_str)
 9 print(bsObj.findAll(lambda tag:len(tag.attrs)==1))
10 imgs = bsObj.findAll(lambda tag:tag.name=="img")
11 for item in imgs:
12     print(item["src"])

View Code

五 Python其他HTML解析库

BeautifulSoup只是我们通过urllib获取到的HTML页面解析库之一，我们还有其他的库可以选择，这就是Python，强大的库。例如LXML、HTMLparser

1.LXML 安装使用

Windows-cmd: pip install lxml

参考：https://lxml.de/tutorial.html

2.HTMLparser安装使用

Windows-cmd: pip install HTMLParser

参考：https://docs.python.org/3/library/html.parser.html

后续继续学习LXML和HTMLParser两个库，但是思路流程都是与BeautifulSoup类似。获取到HTML后，然后进行XML数据解析，过滤等操作，最终得到自己想要的数据

Python数据采集-创建爬虫

猜你喜欢