scrapy——抓取知乎

分享一下我老师大神的人工智能教程!零基础,通俗易懂!http://blog.csdn.net/jiangjunshow

也欢迎大家转载本篇文章。分享知识,造福人民,实现我们中华民族伟大复兴!

               

 

 

主要目标是:

·       从如何评价X的话题下开始抓取问题,然后开始爬相关问题再循环

·       对于每个问题抓取标题,关注人数,回答数等数据

1    创建项目

$ scrapy startproject zhihu

New Scrapy project 'zhihu', using template directory'/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-1.3.2-py2.7.egg/scrapy/templates/project',created in:

    /Users/huilinwang/zhihu

 

You can start your first spider with:

    cd zhihu

    scrapy genspider exampleexample.com

2    编辑SPIDER

在/zhihu/zhihu/spiders目录中,创建zhihuspider.py文件,具体内容查看文档后面。

2.1    函数def start_requests(self):

     This method must return aniterable with the first Requests to crawl for this spider.只调用一次。

    当没有指定URLs(定义在start_urls变量中)时候调用,如果指定了URLs则调用make_requests_from_url() 函数。

    如果要通过登录来POST请求,可以如下代码:

class MySpider(scrapy.Spider):

    name = 'myspider'

 

    def start_requests(self):

        return[scrapy.FormRequest("http://www.example.com/login",

                                  formdata={'user': 'john', 'pass': 'secret'},

                                  callback=self.logged_in)]

 

    def logged_in(self, response):

        # here you would extractlinks to follow and return Requests for

        # each of them, with anothercallback

        pass

实例中start_requests内容如下:

 def start_requests(self):

        yield scrapy.Request(

            url=self.zhihu_url,

           headers=self.headers_dict,

           callback=self.request_captcha

        )

其中zhihu_url,headers_dict分别在共享变量中定义。

request_captcha为回调函数。然后调用回调函数。

2.1.1 scrapy.Request类

位于scrapy/http/request/_init_.py文件中,初始化函数如下:

    def __init__(self, url, callback=None, method='GET', headers=None, body=None,

                 cookies=None, meta=None, encoding='utf-8', priority=0,

                 dont_filter=False, errback=None):

该模块实现Requst类,用于表示HTTP请求。官方文档(docs/topics/request-response.rst)

    scrapy使用request和response对象来爬网站。request对象在spider中产生,在系统中传递直到达到downloader.下载器执行request并返回response对象。

    request和response类都有子类,增加了在基类中没有要求的函数。

    request对象如下:

class scrapy.http.Request(url[, callback, method='GET', headers, body,cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])

    一个request对象表示一个HTTP请求,通过在SPIDER中产生,在下载器中执行,然后产生一个Response.

2.1.2          request的参数说明:

    url (string) – 这个请求的URL,这个属性是只读的,改变URL使用replace().

    callback (callable) – 这个request(一旦请求下载后)会调用的函数。如果没有指定callback,默认调用parse()函数。注意,如果处理过程发生意外,errback会被调用。

    method (string) – 这个request的HTTP方法,默认是’GET’。

    meta (dict) – Request的初始化值request.meta.如果有这个值,dict会被复制。

    body (str or unicode) – request的主体。如果传递是一个unicode,会编码到utf-8。如果主体没有给定,空字符串会被保存。不管这个参数的类型,最后值会被存储为str.

    headers (dict) – 这个request的头。可以是单独值或者多值头。如果没有传递这个值,HTTP头就不会被发送。

    cookies (dict or list) –request的cookies,可以用两种方式发送。

    Using a dict:

    request_with_cookies =Request(url="http://www.example.com",

                               cookies={'currency':'USD', 'country': 'UY'})

    Using a list of dicts:

    request_with_cookies =Request(url="http://www.example.com",

                              cookies=[{'name': 'currency',

                                       'value': 'USD',

                                       'domain': 'example.com',

                                       'path': '/currency'}])

    在后续requests的时候,保存的cookies才有用。  

    有些站点返回cookies(作为响应),后续有requests时候会被再次发送。这是通常的WEB浏览器行为。如果,出于某些原因考虑,要避免合并存在的cookies,可以设置dont_merge_cookies为True在request.meta.

    例如不合并cookies的request:

    request_with_cookies =Request(url="http://www.example.com",

                               cookies={'currency': 'USD','country': 'UY'},

                              meta={'dont_merge_cookies': True})

    encoding (string) – 这个request的编码(默认是为’utf-8’). 用于将字符串转换成给定的编码。

    priority (int) – 这个request的优先级(默认是0)。优先级用于被调度器调度顺序。高的优先级会被更早调用。负值表示相对的低优先级。

    dont_filter (boolean) – 表示这个request不会被调度器过滤。在一个request 需要被多次调度时候有效,需要忽略过滤器。使用这个需要小心,不然会循环爬行,默认是False。  errback (callable) – 当有异常时候会被调用。包括404 HTTP错误等。

2.2    函数def request_captcha(self, response):

         是start_request的回调函数,当下载器处理完request后返回response,该函数于处理response.

    def request_captcha(self, response):

       _xsrf = response.css('input[name="_xsrf"]::attr(value)').extract()[0]

       captcha_url = "http://www.zhihu.com/captcha.gif?r=" + str(time.time() *1000)

       yield scrapy.Request(

           url=captcha_url,

           headers=self.headers_dict,

           callback=self.download_captcha

       )

其中time.time() * 1000为获取时间。返回当前时间的时间戳(1970纪元后经过的浮点秒数)

代码中

response.css('input[name="_xsrf"]::attr(value)').extract()[0]

获取_xsrf对应的值,为获取HTML源码中:

<input type="hidden"name="_xsrf" value="fb57ee37dc9bd70821e6ed878bdfe24f"/>

该函数最后调用函数download_captcha。

2.3    函数 def download_captcha(self, response):

函数如下:

    def download_captcha(self, response):

       with open("captcha.gif","wb") as fp:

           fp.write(response.body)

       os.system('opencaptcha.gif')

       print "请输入验证码:\n"

       captcha = raw_input()

       yield scrapy.FormRequest(

           url=self.login_url,

           headers=self.headers_dict,

           formdata={

                "email": email,

                "password": password,

                "_xsrf": response.meta["_xsrf"],

               "remember_me": "true",

                "captcha": captcha

           },

           callback=self.request_zhihu

       )

该函数是request_captcha函数的回调函数。

该函数主要是增加了scrapy.FormRequest函数

2.3.1 FormRequest

    FormRequest类扩展了Request基类。使用lxml.html forms来预处理来自Response对象的数据。

    类增加一个新的参数到构造器。其他参数和Request类一致。

classmethodfrom_response(response[formname=Noneformid=Noneformnumber=0formdata=Noneformxpath=Noneformcss=Noneclickdata=Nonedont_click=False...])

    使用预填充的元素来,返回一个formrequest对象。

其中参数如下:

·      formdata (dict) – fields to override in the formdata. If a field was already present in the response <form> element, its value is overridden by the one passedin this parameter.

    表示form中覆盖的数据。主要用于模拟HTML POST格式,发送一对健值。

然后调用request_zhihu函数。

2.4    函数def request_zhihu(self, response):

代码如下:

def request_zhihu(self, response):

       yield scrapy.Request(url=self.topic+'/19760570',

                             headers=self.headers_dict,

                             callback=self.get_topic_question,

dont_filter=True)

https://www.zhihu.com/topic/19760570/hot开始

因为需要循环调用所以设置了dont_filter=True

调用get_topic_question

2.5    函数def get_topic_question(self, response):

代码如下:

    def get_topic_question(self,response):

       # withopen("topic.html", "wb") asfp:

       #     fp.write(response.body)

       question_urls = response.css(".question_link[target=_blank]::attr(href)").extract()

       length = len(question_urls)

       k = -1

       j = 0

       temp = []

       for j in range(length/3):

           temp.append(question_urls[k+3])

           j+=1

           k+=3

       for url in temp:

           yield scrapy.Request(url =self.zhihu_url+url,

                    headers = self.headers_dict,

                    callback = self.parse_question_data)

找到相对链接,即href属性,这些就是知乎的TOPIC的相对链接。赋值给question_urls变量。然后间隔提取其中的TOPIC URL片段,最后继续调用Request,不过URL发生了变化,变成了一个拼接的URL。然后调用parse_question_data

2.6    函数def parse_question_data(self, response):

是蜘蛛的最后一个函数,如下:

    def parse_question_data(self,response):

       item = ZhihuItem()

       item["qid"] = re.search('\d+',response.url).group()

       item["title"] = response.css(".zm-item-title::text").extract()[0].strip()

       item["answers_num"] = response.css("h3::attr(data-num)").extract()[0]

       question_nums = response.css(".zm-side-section-inner .zg-gray-normalstrong::text").extract()

       item["followers_num"] = question_nums[0]

       item["visitsCount"] = question_nums[1]

       item["topic_views"] = question_nums[2]

       topic_tags = response.css(".zm-item-tag::text").extract()

       if len(topic_tags) >= 3:

           item["topic_tag0"] = topic_tags[0].strip()

           item["topic_tag1"] = topic_tags[1].strip()

item["topic_tag2"] = topic_tags[2].strip()

print item

       elif len(topic_tags) == 2:

           item["topic_tag0"] = topic_tags[0].strip()

           item["topic_tag1"] = topic_tags[1].strip()

           item["topic_tag2"] ='-'

       elif len(topic_tags) == 1:

           item["topic_tag0"] = topic_tags[0].strip()

           item["topic_tag1"] ='-'

           item["topic_tag2"] ='-'

       # printtype(item["title"])

       question_links = response.css(".question_link::attr(href)").extract()

       yield item

       for url in question_links:

           yield scrapy.Request(url =self.zhihu_url+url,

                    headers = self.headers_dict,

callback = self.parse_question_data)

循环抓取,直到结束

3    编辑pipelines.py

导入到数据库中。

3.1    函数def open_spider(self, spider):

spider开启时个方法被

 

 

 

 

3.2    函数def process_item(self, item, spider):

   每个item pipeline组件都需要调用该方法,这个方法必须返回一个 Item (或任何继承类)对象,或是抛出 DropItem 常,被弃的item将不会被之后的pipeline件所理。

3.3    编辑SETTING

修改ROBOTSTXT_OBEY = False

 

4    items内容

import scrapy

class ZhihuItem(scrapy.Item):

    # define the fields for youritem here like:

    # name = scrapy.Field()

    qid = scrapy.Field()

    title = scrapy.Field()

    followers_num = scrapy.Field()

    answers_num = scrapy.Field()

    visitsCount = scrapy.Field()

    topic_views = scrapy.Field()

    topic_tag0 = scrapy.Field()

    topic_tag1 = scrapy.Field()

    topic_tag2 = scrapy.Field()

 

5    SPDIER内容

#coding=utf-8

import scrapy

import os

import time

import re

import json

 

from ..items import ZhihuItem

 

class zhihutopicSpider(scrapy.Spider):

    zhihu_url ="https://www.zhihu.com"

    headers_dict = {

        "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",

        "Accept-Language":"zh-CN,zh;q=0.8",

        "Connection": "keep-alive",

        "Host":"www.zhihu.com",

       "Upgrade-Insecure-Requests": "1",

        "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36(KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36"

    }

    def start_requests(self):

        yield scrapy.Request(

            url=self.zhihu_url,

           headers=self.headers_dict,

           callback=self.request_captcha

        )

    def request_captcha(self,response):

        _xsrf =response.css('input[name="_xsrf"]::attr(value)').extract()[0]

        captcha_url ="http://www.zhihu.com/captcha.gif?r=" + str(time.time() * 1000)

        yield scrapy.Request(

            url=captcha_url,

           headers=self.headers_dict,

           callback=self.download_captcha

        )

 

    def download_captcha(self,response):

        withopen("captcha.gif", "wb") as fp:

            fp.write(response.body)

        os.system('opencaptcha.gif')

        print "请输入验证码:\n"

        captcha = raw_input()

        yield scrapy.FormRequest(

            url=self.login_url,

           headers=self.headers_dict,

            formdata={

                "email":email,

               "password": password,

                "_xsrf":response.meta[“_xsrf"],

               "remember_me": "true",

                "captcha":captcha

            },

           callback=self.request_zhihu

        )

 

    def request_zhihu(self,response):

        yieldscrapy.Request(url=self.topic + '/19760570',

                            headers=self.headers_dict,

                            callback=self.get_topic_question,

                            dont_filter=True)

 

    def get_topic_question(self,response):

        # withopen("topic.html", "wb") as fp:

        #     fp.write(response.body)

        question_urls =response.css(".question_link[target=_blank]::attr(href)").extract()

        length = len(question_urls)

        k = -1

        j = 0

        temp = []

        for j in range(length/3):

           temp.append(question_urls[k+3])

            j+=1

            k+=3

        for url in temp:

            yield scrapy.Request(url= self.zhihu_url+url,

                    headers =self.headers_dict,

                    callback =self.parse_question_data)

 

    def parse_question_data(self,response):

        item = zhihuQuestionItem()

        item["qid"] =re.search('\d+',response.url).group()

        item["title"] =response.css(".zm-item-title::text").extract()[0].strip()

       item["answers_num"] =response.css("h3::attr(data-num)").extract()[0]

        question_nums =response.css(".zm-side-section-inner .zg-gray-normalstrong::text").extract()

       item["followers_num"] = question_nums[0]

       item["visitsCount"] = question_nums[1]

       item["topic_views"] = question_nums[2]

        topic_tags =response.css(".zm-item-tag::text").extract()

        if len(topic_tags) >= 3:

           item["topic_tag0"] = topic_tags[0].strip()

           item["topic_tag1"] = topic_tags[1].strip()

           item["topic_tag2"] = topic_tags[2].strip()

        elif len(topic_tags) == 2:

           item["topic_tag0"] = topic_tags[0].strip()

           item["topic_tag1"] = topic_tags[1].strip()

           item["topic_tag2"] = '-'

        elif len(topic_tags) == 1:

           item["topic_tag0"] = topic_tags[0].strip()

           item["topic_tag1"] = '-'

           item["topic_tag2"] = '-'

        # printtype(item["title"])

        question_links =response.css(".question_link::attr(href)").extract()

        yield item

        for url in question_links:

            yield scrapy.Request(url =self.zhihu_url+url,

                    headers =self.headers_dict,

                    callback =self.parse_question_data)

6    PIPELINE内容

import MySQLdb

class ZhihuPipeline(object):

    print "\n\n\n\n\n\n\n\n"

   sql_questions = (

           "INSERTINTO questions("

           "qid,title, answers_num, followers_num, visitsCount, topic_views, topic_tag0,topic_tag1, topic_tag2) "

           "VALUES('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')")

   count = 0

 

    def open_spider(self, spider):

       host = "localhost"

       user = "root"

       password = "wangqi"

       dbname = "zh"

       self.conn = MySQLdb.connect(host, user, password, dbname)

       self.cursor = self.conn.cursor()

       self.conn.set_character_set('utf8')

       self.cursor.execute('SET NAMES utf8;')

       self.cursor.execute('SET CHARACTER SET utf8;')

       self.cursor.execute('SET character_set_connection=utf8;')

       print "\n\nMYSQL DB CURSOR INITSUCCESS!!\n\n"

       sql = (

           "CREATETABLE IF NOT EXISTS questions ("

                "qid VARCHAR (100) NOTNULL,"

                "title varchar(100),"

                "answers_num INT(11),"

                "followers_num INT(11) NOT NULL,"

                "visitsCount INT(11),"

                "topic_views INT(11),"

                "topic_tag0 VARCHAR (600),"

                "topic_tag1 VARCHAR (600),"

                "topic_tag2 VARCHAR (600),"

               "PRIMARY KEY (qid)"

           ")")

       self.cursor.execute(sql)

       print "\n\nTABLES ARE READY!\n\n"

 

    def process_item(self, item,spider):

       sql = self.sql_questions % (item["qid"], item["title"], item["answers_num"],item["followers_num"],

                                item["visitsCount"], item["topic_views"], item["topic_tag0"], item["topic_tag1"], item["topic_tag2"])

       self.cursor.execute(sql)

       if self.count % 10 == 0:

           self.conn.commit()

       self.count += 1

       print item["qid"] +" DATA COLLECTED!"

 

7    运行

scrapy crawl zhihu

 

8    关于反爬

    robots.txt(统一小写)是一种存放于网站根目录下的ASCII编码的文本文件,它通常告诉网络蜘蛛,此网站中的哪些内容是不应被搜索引擎的网络蜘蛛获取的,哪些是可以被网络蜘蛛获取的。robots.txt是一个这个绅士协议也不是一个规范,而只是约定俗成的,有些搜索引擎会遵守这一规范,而其他则不然。这就说明了scrapy自动遵守了robots协议.(这个要在settings.py里面设置不遵守才可以爬得到把scrapy写进robots.txt的网站)

 

 

 

           

给我老师的人工智能教程打call!http://blog.csdn.net/jiangjunshow

这里写图片描述

猜你喜欢

转载自blog.csdn.net/ugghhj/article/details/84137372