python爬虫入门（2）爬取知乎某个热门主题

转载请注明链接

入门1如果看过，现在就可以进行简单的实战测试了，接下来要做的是：
选取知乎的某个热门主题：https://www.zhihu.com/topic/19606591/hot，这是个恐怖片主题，大半夜的小心查看。
取出该主题下所有帖子里面提到的片名，所有的片名大部分都是《》包括的，就以此匹配出片名，去重并写入文件。

匹配需要用到正则表达式，一个详细的教程点击下面：
http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

直接上程序，这里新建了一个类(zhihuspider.py)：

import requests
import re
class ZhihuSpider:

    def __init__(self):
        self.enable = False
        self.proxy = {'http': 'http://username:1qaz%[email protected]:8080/',
                 'https': 'http://username:1qaz%[email protected]:8080/'}
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0'
        }
        self.url = 'https://www.zhihu.com/topic/19606591/hot'
        self.filmNames = []

    def savaToFile(self, data):
        path = "/home/user/pythonscrapy/"
        f = open(path + '/filmlist', 'w')
        f.write(data)
        f.close()

    #获取主题页，匹配主题页帖子的url,一个示例如下:
    #<h2 class="ContentItem-title"><a href="//zhuanlan.zhihu.com/p/40456756" target="_blank" data-za-detail-view-element_name="Title">那些记忆中挥之不去的童年阴影合集01</a></h2>
    #zhuanlan.zhihu.com/p/4045675是我们需要的地址
    def getPageItems(self):
        response = requests.get(self.url, headers=self.headers, proxies=self.proxy)
        #注意一定是.*?，非贪婪匹配
        titleItems= re.findall('<h2.*?class="ContentItem-title">.*?<a href="\/\/(.*?)".*?</a>.*?</h2>', response.text, re.S)
        #获取到的titleItems：
        #['zhuanlan.zhihu.com/p/40502384', 'zhuanlan.zhihu.com/p/40456756', 'zhuanlan.zhihu.com/p/40459605']
        for item in titleItems:
            print(item)
            self.getItemFilmNames(item)
        #list先转str然后存储
        self.savaToFile(str(self.filmNames))

    #获取帖子中的片名
    def getItemFilmNames(self,item):
        response = requests.get('https://'+item, headers=self.headers, proxies=self.proxy)
        titles = re.findall('《(.*?)》', response.text, re.S)
        #获取到的titles,zhuanlan.zhihu.com/p/40459605下的：
        #['鬼故事', '通灵作弊', '鬼故事', '鬼故事', '通灵作弊', '鬼故事']
        print(titles)
        for title in titles:
        #此处片名去重
            if title not in self.filmNames:
                self.filmNames.append(title)

这里只是抓取了第一页的内容，该主题页在浏览时存在下拉刷新，就是浏览器滚动条滚动到底部会再加载帖子，因此需要动态加载其他页。目前Fiddler在ubuntu解密https数据流支持不好，所以暂没有实现动态加载其他页。等有时间再研究加载下拉刷新页面的方案。

在另一个文件中调用如下：

import zhihuspider
if __name__ == '__main__':
    zhihuSpider = zhihuspider.ZhihuSpider()
    zhihuSpider.getPageItems()

python爬虫入门（2）爬取知乎某个热门主题

猜你喜欢