Python学习前传 —— Python网络爬虫

原来一直是Linux C 开发，现在开始学习另一门面向过程的语言 —— Python。

学习一门语言，重要的在于思想，现在就以Linux C 开发者的角度来学习Python。我们先不讲Python基础，我们直接来看一段网络爬虫代码，看一下Python语言的特点。那么什么是网络爬虫呢？网络爬虫，又称为网络蜘蛛（WebSpider），非常形象的一个名字。如果你把整个互联网想象成类似于蜘蛛网一样的构造，那么这只爬虫，就是要在上面爬来爬去，以便捕获我们需要的资源。

下面我们来看代码，这段代码的主要功能是编写一个爬虫，爬取百度百科“孙悟空”的词条，并将所有包含“view”关键字的链接按照格式打印出来，并进入每一个词条，然后检测该词条是否有副标题，如果有就将副标题一并打印出来：

import urllib.request
import urllib.parse
import re
from bs4 import BeautifulSoup

def main():
    keyword = input("请输入关键词：")
    keyword = urllib.parse.urlencode({"word":keyword})
    response = urllib.request.urlopen("http://baike.baidu.com/search/word?%s" % keyword)
    html = response.read()
    soup = BeautifulSoup(html, "html.parser")

    for each in soup.find_all(href = re.compile("view")):
        content = ''.join([each.text])
        url2 = ''.join(["http://baike.baidu.com", each["href"]])
        response2 = urllib.request.urlopen(url2)
        html2 = response2.read()
        soup2 = BeautifulSoup(html2, "html.parser")
        if soup2.h2:
            content = ''.join([content, soup2.h2.text])
        content = ''.join([content, " -> ", url2])
        print(content)

if __name__ == "__main__":
    main()

我们可以在IDLE下使用F5快捷键执行程序，看一下效果：