python web scraping

关注微信公众号(瓠悠笑软件部落)，一起学习，一起摸鱼
huyouxiao.com
涉及以下模块:

webbrowser 　它是python自带的模块，可以打开浏览器定位到一个特定网址。
Requests 从　Internet上面下载文件和网页
Beautiful Soup 解析HTML文件.
Selenium 启动并控制Web浏览器。 Selenium能够在此浏览器中填写表单并模拟鼠标单击。

Project: maplt.py with the webbrowser Module

webbrowser 模块的 open() 函数可以启动一个浏览器，并定位到一个特定的URL。

>>> import webbrowser
>>> webbrowser.open('http://inventwithpython.com/')
True

这几行代码可以启动浏览器并打开网址。这是 webbrowser 模块能够做的唯一事情。即便如此，open() 函数也可以是一些有趣的事情变得可能。例如: 把一个地址拷贝到粘贴板, 然后打开google地图并输入进去，是一件很乏味的事情。你可以写一个脚本来自动完成这些事情。这样，你只需要将地址拷贝到粘贴板上面，然后运行这个脚本，那么地图就会自动打开了。
程序看上去像这样子:

从命令行或者粘贴板上获取街道的地址
打开网页浏览器Google Maps页面,定位到这个地址
　这意味着你的代码需要以下步骤:
通过 sys.argv 方法从命令行读取参数
读取粘贴板中的内容
调用 webbrowser.open() 函数打开 web 浏览器:

#! /usr/bin/python3
# mapIt.py - Launches a map in the brower using an address from the 
# command line or clipboard.

import webbrowser, sys, pyperclip
if len(sys.argv) > 1:
    # Get address from command line.
    address = ' '.join(sys.argv[1:])
else:
    # Get address from clipboard.
    address = pyperclip.paste()

webbrowser.open('https://www.google.com/maps/place/' + address)

使用 requests 模块从网上下载文件

需要安装 requests 模块: sudo pip install requests.

request.get()

#! /usr/bin/python3
import requests
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
print(type(res))
print(res.status_code == requests.codes.ok)
print(len(res.text))
print(res.text[:250])

# 输出内容
<class 'requests.models.Response'>
True
178978
The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Projec

Checking for Errors

一个简单的方法是使用 Response object 的方法 raise_for_status() 来校验是否下载成功。如果下载过程中出问题了，她会爬出一个异常。如果下载成功，就不会做任何事情。

>>> import requests
>>> res = requests.get('http://inventwithpython.com/page_not_exist')
>>> res.raise_for_status()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/requests/models.py", line 840, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://inventwithpython.com/page_not_exist
>>>

如果下载失败，在你的程序里面不会导致终端退出，你可以用 try and except statements 包围 raise_for_status() 行，以处理这些错误，而不是引起程序崩溃。

import requests

res = requests.get('http://localhost:8080/index.html')
try:
    res.raise_for_status()
except Exception as exc:
    print('There was a problem: %s' % (exc))

将下载的文件保存到磁盘

#!/usr/bin/python3
import requests 
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
res.raise_for_status()
playFile = open('RomeoAndJuliet.txt', 'wb')
for chunk in res.iter_content(100000):
    playFile.write(chunk)

playFile.close()

迭代循环中iter_content（）方法返回每个内容的“块”。每个块都是字节数据类型，而你需要指定每个块将包含多少字节。一百千字节通常是一个很好的大小，所以传递100000作为参iter_content（）。代码运行完，会在当前目录生成RomeoAndJuliet.txt文件。　write()方法会返回这个文件的字节大小。总的来说，完成文件下载到保存，需要以下步骤:

调用 requests.get() 方法下载文件
用二进制写模式(‘wb’)调用open()方法
循环遍历 Response object的iter_content()方法
在每次遍历时，调用write()方法将文本内容写到文件中
最后调用 close()方法关闭文件
可以确定的是，即使下载大量文件，也不会占用特别多的内存
requests 模块详细介绍

HTML

HTML入门
 HTML介绍

Parsing HTML with the BeautifulSoup Module

Beautiful Soup 模块可以从HTML页面中提取信息。　BeautifulSoup 模块的名称叫 bs4(for Beautiful Soup, version 4).　安装命令：sudo pip3 install beautfilsoup4 . 在安装的时候使用beautfilsoup4名称，但是在导入的时候，需要用：　import bs4

#! /usr/bin/python3
import bs4
exampleFile = open('example.html')
exampleSoup = bs4.BeautifulSoup(exampleFile.read())
elems = exampleSoup.select('#author')
print(type(elems))
print(str(len(elems)))
print(type(elems[0]))
print(elems[0].getText())
print(str(elems[0]))
print(elems[0].attrs)

# 输出是
<class 'list'>
1
<class 'bs4.element.Tag'>
Al Sweigart
<span id="author">Al Sweigart</span>
{'id': 'author'}

爬取网页图片并下载

从 http://xkcd.com 下载图片,并且定位向前按钮，继续下载图片。

#! /usr/bin/python3
# downloadXkcd.py - Downloads every signle XKCD comic.

import requests, os, bs4

url = 'http://xkcd.com'  # starting url
os.makedirs('xkcd', exist_ok=True)  # store comics in ./xkcd

while not url.endswith('#'):
    # Download the page.
    print('Downloading page %s...' % url)
    res = requests.get(url)
    res.raise_for_status()

    soup = bs4.BeautifulSoup(res.text)

    # Find the URL of the comic image.
    comicElem = soup.select('#comic img')
    if comicElem == []:
        print('Could not find comic image.')
    else:
        comicUrl = 'http:' + comicElem[0].get('src')
        # Download the image.
        print('Downloading image %s...' % (comicUrl))
        res = requests.get(comicUrl)
        res.raise_for_status()

        # Save the image to ./xkcd.
        print('save the image '+os.path.basename(comicUrl)+' into xkcd')
        imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
        for chunk in res.iter_content(100000):
            imageFile.write(chunk)
        imageFile.close()

    # Get the Prev button's url.
    prevLink = soup.select('a[rel="prev"]')[0]
    url = 'http://xkcd.com' + prevLink.get('href')

print('Done.')

BeautifulSoup参考文档