python安全开发第三章第三节－－爬虫工具

注释：本文适用的Python2.7开发环境，本文适用的爬虫库可能也已经不存在，但是爬虫思路是没问题的

爬虫

爬虫本来想做简单介绍，在高级篇做详细阐述，后来想想还是稍微给点内容,爬虫有一个很好的第三方模块scarpy，也可以自己完全实现。但是这里我们不用scarpy这个复杂的模块反而用bs4这个分析html结构的模块,在python高级篇也会做更深入的讲解

>>> sudo pip2.7 install bs4
>>> sudo apt-get install python-lxml

bs4模块介绍

bs4 是一个可以解析html文本的模块，这个原理并不复杂，就是遍历所有标签，然后分别存储到变量里，然后按照我们的需要进行处理，实际上我们可以自己写代码，甚至正则表达式解决，但是我们既然有现成的模块，又何必自己费力呢，这里我们看个网上的bs4教程bs4一个教学博客
官方文档

from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html)
print soup.prettify()#打印树形结构

基本的提取操作

1.Tag快速定位第一个满足条件的地方，缺点是不够灵活

print soup.title
#<title>The Dormouse's story</title>
  print soup.head
#<head><title>The Dormouse's story</title></head>
  print soup.a
#<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
  print soup.p
#<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
print soup.p['class']
#['title']

2 NavigableString　快速定位第一个满足条件的字符串

print soup.p.string

按照节点操作

contents 返回列表,children返回迭代，descendants返回所有子孙节点descendants 属性可以对所有tag的子孙节点进行递归循环，和 children类似，我们也需要遍历获取其中的内容

print soup.head.contents 
print soup.head.contents[0]
for child in soup.body.children:
	print child
for string in soup.strings:
	print(repr(string))
for string in soup.stripped_strings:
	print(repr(string))

搜索文档树

find_all( name , attrs , recursive , text , **kwargs )

soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all('a',id='link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
soup.find_all(text=["Tillie", "Elsie", "Lacie"])
print soup.select('title') 
print soup.select('#link1')
print soup.select('a[href="http://example.com/elsie"]')

正则表达式

我们这里不需要正则表达式，但是很多时候我们需要正则表达式来完成网页中的内容的提取，还有之前我们的部分程序代码也是用了正则表达式，因此这里简单介绍一下正则表达式

函数	作用
match()	确定ＲＥ是否在字符串开始的位置匹配
search()	扫描字符串，找到这个ＲＥ匹配的位置
findall()	找到ＲＥ匹配的所有子串，并把他们作为一个列表返回
finditer()	找到ＲＥ匹配的所有子串，并把他们作为一个迭代器返回

re.match 查找

match()方法的工作方式是只有当被搜索字符串的开头匹配模式的时候它才能查找到匹配对象r表示通配符

>>> m=re.match(r'dog', 'dog cat dog')
>>> m.group(0)
>>> re.match(r'cat', 'dog cat dog')
#上面的内容也可以这样写
>>> patter=re.compile(r'dog')
>>> m=re.match(patter, 'dog cat dog')

re.search 查找

search()方法和match()类似，不过search()方法不会限制我们只从字符串的开头查找匹配,然而search()方法会在它查找到一个匹配项之后停止继续查找

>>> m=re.search(r'cat', 'dog cat dog')
>>> m.group(0)
>>> match = re.search(r'dog', 'dog cat dog')
>>> m.group(0)
>>> m.start()#告诉我们匹配的在开始位置
>>> m.end()#告诉我们匹配的在字符串中结束的位置
#上文内容也可以这样
>>> patter=re.compile(r'cat')#这种写法就一个好处，提高规则的预编译
>>> m=re.search(patter, 'dog cat dog')
>>> m.group()

re.findall 所有匹配对象

搜索string，以列表形式返回全部能匹配的子串

>>> pattern = re.compile(r'\d+')
>>> print re.findall(pattern,'one1two2three3four4')
>>> # ['1', '2', '3', '4']

re.finditer 所有对象匹配并迭代

>>>pattern = re.compile(r'\d+')
>>>for m in re.finditer(pattern,'one1two2three3four4'):
  	    print m.group()

搜索string，返回一个顺序访问每一个匹配结果（Match对象）的迭代器

import re
pattern = re.compile(r'\d+')
print re.findall(pattern,'one1two2three3four4')

爬取糗事百科

糗事百科我们都喜欢看，我们先分析糗百的html源码并找到规律(注释：这app应该已经不在了吧,基于这个思路爬小说看看…)

# -*- coding: utf-8 -*-    
import urllib2    
from bs4 import BeautifulSoup
    # 将所有的段子都扣出来，添加到列表中并且返回列表    
def GetPage(page):    
    myUrl = "http://m.qiushibaike.com/hot/page/" + str(page)    
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'   
    headers = {
    
     'User-Agent' : user_agent }   
    req = urllib2.Request(myUrl, headers = headers)   
    myResponse = urllib2.urlopen(req)  
    myPage = myResponse.read()    
    soup=BeautifulSoup(myPage,'lxml')
    #获取所有div标签的内容
    QBlist=soup.find_all('div',class_='content')#class_是为了和类关键字区分
    #过滤出文字段子的内容
    for i in range(len(QBlist)):
        a=str(QBlist[i])[28:-14]
        print a
        
GetPage(1)

大家可能发现我这个爬虫还很不完美，搜索出来的东西可能不是我想要的，同学下去自己研究使用re模块代替我的过滤，继续研究bs4这个模块,有精力的他同学也可以去学习一下scarpy的用法scarpy官方文档

爬虫在网上是很广泛个存在的，谷歌爬虫百度爬虫服务器，每天无休止的工作着，抓数据到自己的数据库然后让我们检索使用，这个是最基本的，但是再复杂，基础理论也是这样。以后高级篇有机会再做介绍