分享一下我老师大神的人工智能教程！零基础，通俗易懂！http://blog.csdn.net/jiangjunshow

也欢迎大家转载本篇文章。分享知识，造福人民，实现我们中华民族伟大复兴！

主要目标是：

· 从如何评价X的话题下开始抓取问题，然后开始爬相关问题再循环

· 对于每个问题抓取标题，关注人数，回答数等数据

1 创建项目

$ scrapy startproject zhihu

New Scrapy project 'zhihu', using template directory'/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-1.3.2-py2.7.egg/scrapy/templates/project',created in:

/Users/huilinwang/zhihu

You can start your first spider with:

cd zhihu

scrapy genspider exampleexample.com

2 编辑SPIDER

在/zhihu/zhihu/spiders目录中，创建zhihuspider.py文件，具体内容查看文档后面。

2.1 函数def start_requests(self):

This method must return aniterable with the first Requests to crawl for this spider.只调用一次。

当没有指定URLs（定义在start_urls变量中）时候调用，如果指定了URLs则调用make_requests_from_url() 函数。

如果要通过登录来POST请求，可以如下代码：

class MySpider(scrapy.Spider):

name = 'myspider'

def start_requests(self):

return[scrapy.FormRequest("http://www.example.com/login",

formdata={'user': 'john', 'pass': 'secret'},

callback=self.logged_in)]

def logged_in(self, response):

# here you would extractlinks to follow and return Requests for

# each of them, with anothercallback

pass

实例中start_requests内容如下：

def start_requests(self):

yield scrapy.Request(

url=self.zhihu_url,

headers=self.headers_dict,

callback=self.request_captcha

)

其中zhihu_url,headers_dict分别在共享变量中定义。

request_captcha为回调函数。然后调用回调函数。

2.1.1 scrapy.Request类

位于scrapy/http/request/_init_.py文件中,初始化函数如下：

def __init__(self, url, callback=None, method='GET', headers=None, body=None,

cookies=None, meta=None, encoding='utf-8', priority=0,

dont_filter=False, errback=None):

该模块实现Requst类，用于表示HTTP请求。官方文档（docs/topics/request-response.rst）

scrapy使用request和response对象来爬网站。request对象在spider中产生，在系统中传递直到达到downloader.下载器执行request并返回response对象。

request和response类都有子类，增加了在基类中没有要求的函数。

request对象如下：

class scrapy.http.Request(url[, callback, method='GET', headers, body,cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])

一个request对象表示一个HTTP请求，通过在SPIDER中产生，在下载器中执行，然后产生一个Response.

2.1.2 request的参数说明：

url (string) – 这个请求的URL，这个属性是只读的，改变URL使用replace().

callback (callable) – 这个request（一旦请求下载后）会调用的函数。如果没有指定callback,默认调用parse()函数。注意，如果处理过程发生意外，errback会被调用。

method (string) – 这个request的HTTP方法，默认是’GET’。

meta (dict) – Request的初始化值request.meta.如果有这个值，dict会被复制。

body (str or unicode) – request的主体。如果传递是一个unicode，会编码到utf-8。如果主体没有给定，空字符串会被保存。不管这个参数的类型，最后值会被存储为str.

headers (dict) – 这个request的头。可以是单独值或者多值头。如果没有传递这个值，HTTP头就不会被发送。

cookies (dict or list) –request的cookies，可以用两种方式发送。

Using a dict:

request_with_cookies =Request(url="http://www.example.com",

cookies={'currency':'USD', 'country': 'UY'})

Using a list of dicts:

request_with_cookies =Request(url="http://www.example.com",

cookies=[{'name': 'currency',

'value': 'USD',

'domain': 'example.com',

'path': '/currency'}])

在后续requests的时候，保存的cookies才有用。

有些站点返回cookies(作为响应），后续有requests时候会被再次发送。这是通常的WEB浏览器行为。如果，出于某些原因考虑，要避免合并存在的cookies，可以设置dont_merge_cookies为True在request.meta.

例如不合并cookies的request:

request_with_cookies =Request(url="http://www.example.com",

cookies={'currency': 'USD','country': 'UY'},

meta={'dont_merge_cookies': True})

encoding (string) – 这个request的编码（默认是为’utf-8’). 用于将字符串转换成给定的编码。

priority (int) – 这个request的优先级（默认是0）。优先级用于被调度器调度顺序。高的优先级会被更早调用。负值表示相对的低优先级。

dont_filter (boolean) – 表示这个request不会被调度器过滤。在一个request 需要被多次调度时候有效，需要忽略过滤器。使用这个需要小心，不然会循环爬行，默认是False。 errback (callable) – 当有异常时候会被调用。包括404 HTTP错误等。

2.2 函数def request_captcha(self, response):

是start_request的回调函数，当下载器处理完request后返回response，该函数于处理response.

def request_captcha(self, response):

_xsrf = response.css('input[name="_xsrf"]::attr(value)').extract()[0]

captcha_url = "http://www.zhihu.com/captcha.gif?r=" + str(time.time() *1000)

yield scrapy.Request(

url=captcha_url,

headers=self.headers_dict,

callback=self.download_captcha

)

其中time.time() * 1000为获取时间。返回当前时间的时间戳（1970纪元后经过的浮点秒数）

代码中

response.css('input[name="_xsrf"]::attr(value)').extract()[0]

获取_xsrf对应的值，为获取HTML源码中：

该函数最后调用函数download_captcha。

2.3 函数 def download_captcha(self, response):

函数如下：

def download_captcha(self, response):

with open("captcha.gif","wb") as fp:

fp.write(response.body)

os.system('opencaptcha.gif')

print "请输入验证码:\n"

captcha = raw_input()

yield scrapy.FormRequest(

url=self.login_url,

headers=self.headers_dict,

formdata={

"email": email,

"password": password,

"_xsrf": response.meta["_xsrf"],

"remember_me": "true",

"captcha": captcha

callback=self.request_zhihu

)

该函数是request_captcha函数的回调函数。

该函数主要是增加了scrapy.FormRequest函数

2.3.1 FormRequest

FormRequest类扩展了Request基类。使用lxml.html forms来预处理来自Response对象的数据。

类增加一个新的参数到构造器。其他参数和Request类一致。

classmethodfrom_response(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])

使用预填充的元素来，返回一个formrequest对象。

其中参数如下：

· formdata (dict) – fields to override in the formdata. If a field was already present in the response <form> element, its value is overridden by the one passedin this parameter.

表示form中覆盖的数据。主要用于模拟HTML POST格式，发送一对健值。

然后调用request_zhihu函数。

2.4 函数def request_zhihu(self, response):

代码如下：

def request_zhihu(self, response):

yield scrapy.Request(url=self.topic+'/19760570',

headers=self.headers_dict,

callback=self.get_topic_question,

dont_filter=True)

从https://www.zhihu.com/topic/19760570/hot开始

因为需要循环调用所以设置了dont_filter=True。

调用get_topic_question

2.5 函数def get_topic_question(self, response):

代码如下：

def get_topic_question(self,response):

# withopen("topic.html", "wb") asfp:

# fp.write(response.body)

question_urls = response.css(".question_link[target=_blank]::attr(href)").extract()

length = len(question_urls)

k = -1

j = 0

temp = []

for j in range(length/3):

temp.append(question_urls[k+3])

j+=1

k+=3

for url in temp:

yield scrapy.Request(url =self.zhihu_url+url,

headers = self.headers_dict,

callback = self.parse_question_data)

找到相对链接，即href属性，这些就是知乎的TOPIC的相对链接。赋值给question_urls变量。然后间隔提取其中的TOPIC URL片段，最后继续调用Request，不过URL发生了变化，变成了一个拼接的URL。然后调用parse_question_data

2.6 函数def parse_question_data(self, response):

是蜘蛛的最后一个函数，如下：

def parse_question_data(self,response):

item = ZhihuItem()

item["qid"] = re.search('\d+',response.url).group()

item["title"] = response.css(".zm-item-title::text").extract()[0].strip()

item["answers_num"] = response.css("h3::attr(data-num)").extract()[0]

question_nums = response.css(".zm-side-section-inner .zg-gray-normalstrong::text").extract()

item["followers_num"] = question_nums[0]

item["visitsCount"] = question_nums[1]

item["topic_views"] = question_nums[2]

topic_tags = response.css(".zm-item-tag::text").extract()

if len(topic_tags) >= 3:

item["topic_tag0"] = topic_tags[0].strip()

item["topic_tag1"] = topic_tags[1].strip()

item["topic_tag2"] = topic_tags[2].strip()

print item

elif len(topic_tags) == 2:

item["topic_tag0"] = topic_tags[0].strip()

item["topic_tag1"] = topic_tags[1].strip()

item["topic_tag2"] ='-'

elif len(topic_tags) == 1:

item["topic_tag0"] = topic_tags[0].strip()

item["topic_tag1"] ='-'

item["topic_tag2"] ='-'

# printtype(item["title"])

question_links = response.css(".question_link::attr(href)").extract()

yield item

for url in question_links:

yield scrapy.Request(url =self.zhihu_url+url,

headers = self.headers_dict,

callback = self.parse_question_data)

循环抓取，直到结束

3 编辑pipelines.py

导入到数据库中。

3.1 函数def open_spider(self, spider):

当spider被开启时，这个方法被调用

3.2 函数def process_item(self, item, spider):

每个item pipeline组件都需要调用该方法，这个方法必须返回一个 Item (或任何继承类)对象，或是抛出 DropItem 异常，被丢弃的item将不会被之后的pipeline组件所处理。

3.3 编辑SETTING

修改ROBOTSTXT_OBEY = False

4 items内容

import scrapy

class ZhihuItem(scrapy.Item):

# define the fields for youritem here like:

# name = scrapy.Field()

qid = scrapy.Field()

title = scrapy.Field()

followers_num = scrapy.Field()

answers_num = scrapy.Field()

visitsCount = scrapy.Field()

topic_views = scrapy.Field()

topic_tag0 = scrapy.Field()

topic_tag1 = scrapy.Field()

topic_tag2 = scrapy.Field()

5 SPDIER内容

#coding=utf-8

import scrapy

import os

import time

import re

import json

from ..items import ZhihuItem

class zhihutopicSpider(scrapy.Spider):

zhihu_url ="https://www.zhihu.com"

headers_dict = {

"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",

"Accept-Language":"zh-CN,zh;q=0.8",

"Connection": "keep-alive",

"Host":"www.zhihu.com",

"Upgrade-Insecure-Requests": "1",

"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36(KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36"

}

def start_requests(self):

yield scrapy.Request(

url=self.zhihu_url,

headers=self.headers_dict,

callback=self.request_captcha

)

def request_captcha(self,response):

_xsrf =response.css('input[name="_xsrf"]::attr(value)').extract()[0]

captcha_url ="http://www.zhihu.com/captcha.gif?r=" + str(time.time() * 1000)

yield scrapy.Request(

url=captcha_url,

headers=self.headers_dict,

callback=self.download_captcha

)

def download_captcha(self,response):

withopen("captcha.gif", "wb") as fp:

fp.write(response.body)

os.system('opencaptcha.gif')

print "请输入验证码:\n"

captcha = raw_input()

yield scrapy.FormRequest(

url=self.login_url,

headers=self.headers_dict,

formdata={

"email":email,

"password": password,

"_xsrf":response.meta[“_xsrf"],

"remember_me": "true",

"captcha":captcha

callback=self.request_zhihu

)

def request_zhihu(self,response):

yieldscrapy.Request(url=self.topic + '/19760570',

headers=self.headers_dict,

callback=self.get_topic_question,

dont_filter=True)

def get_topic_question(self,response):

# withopen("topic.html", "wb") as fp:

# fp.write(response.body)

question_urls =response.css(".question_link[target=_blank]::attr(href)").extract()

length = len(question_urls)

k = -1

j = 0

temp = []

for j in range(length/3):

temp.append(question_urls[k+3])

j+=1

k+=3

for url in temp:

yield scrapy.Request(url= self.zhihu_url+url,

headers =self.headers_dict,

callback =self.parse_question_data)

def parse_question_data(self,response):

item = zhihuQuestionItem()

item["qid"] =re.search('\d+',response.url).group()

item["title"] =response.css(".zm-item-title::text").extract()[0].strip()

item["answers_num"] =response.css("h3::attr(data-num)").extract()[0]

question_nums =response.css(".zm-side-section-inner .zg-gray-normalstrong::text").extract()

item["followers_num"] = question_nums[0]

item["visitsCount"] = question_nums[1]

item["topic_views"] = question_nums[2]

topic_tags =response.css(".zm-item-tag::text").extract()

if len(topic_tags) >= 3:

item["topic_tag0"] = topic_tags[0].strip()

item["topic_tag1"] = topic_tags[1].strip()

item["topic_tag2"] = topic_tags[2].strip()

elif len(topic_tags) == 2:

item["topic_tag0"] = topic_tags[0].strip()

item["topic_tag1"] = topic_tags[1].strip()

item["topic_tag2"] = '-'

elif len(topic_tags) == 1:

item["topic_tag0"] = topic_tags[0].strip()

item["topic_tag1"] = '-'

item["topic_tag2"] = '-'

# printtype(item["title"])

question_links =response.css(".question_link::attr(href)").extract()

yield item

for url in question_links:

yield scrapy.Request(url =self.zhihu_url+url,

headers =self.headers_dict,

callback =self.parse_question_data)

6 PIPELINE内容

import MySQLdb

class ZhihuPipeline(object):

print "\n\n\n\n\n\n\n\n"

sql_questions = (

"INSERTINTO questions("

"qid,title, answers_num, followers_num, visitsCount, topic_views, topic_tag0,topic_tag1, topic_tag2) "

"VALUES('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')")

count = 0

def open_spider(self, spider):

host = "localhost"

user = "root"

password = "wangqi"

dbname = "zh"

self.conn = MySQLdb.connect(host, user, password, dbname)

self.cursor = self.conn.cursor()

self.conn.set_character_set('utf8')

self.cursor.execute('SET NAMES utf8;')

self.cursor.execute('SET CHARACTER SET utf8;')

self.cursor.execute('SET character_set_connection=utf8;')

print "\n\nMYSQL DB CURSOR INITSUCCESS!!\n\n"

sql = (

"CREATETABLE IF NOT EXISTS questions ("

"qid VARCHAR (100) NOTNULL,"

"title varchar(100),"

"answers_num INT(11),"

"followers_num INT(11) NOT NULL,"

"visitsCount INT(11),"

"topic_views INT(11),"

"topic_tag0 VARCHAR (600),"

"topic_tag1 VARCHAR (600),"

"topic_tag2 VARCHAR (600),"

"PRIMARY KEY (qid)"

")")

self.cursor.execute(sql)

print "\n\nTABLES ARE READY!\n\n"

def process_item(self, item,spider):

sql = self.sql_questions % (item["qid"], item["title"], item["answers_num"],item["followers_num"],

item["visitsCount"], item["topic_views"], item["topic_tag0"], item["topic_tag1"], item["topic_tag2"])

self.cursor.execute(sql)

if self.count % 10 == 0:

self.conn.commit()

self.count += 1

print item["qid"] +" DATA COLLECTED!"

7 运行

scrapy crawl zhihu

8 关于反爬

robots.txt（统一小写）是一种存放于网站根目录下的ASCII编码的文本文件，它通常告诉网络蜘蛛，此网站中的哪些内容是不应被搜索引擎的网络蜘蛛获取的，哪些是可以被网络蜘蛛获取的。robots.txt是一个这个绅士协议也不是一个规范，而只是约定俗成的，有些搜索引擎会遵守这一规范，而其他则不然。这就说明了scrapy自动遵守了robots协议.（这个要在settings.py里面设置不遵守才可以爬得到把scrapy写进robots.txt的网站）

给我老师的人工智能教程打call！http://blog.csdn.net/jiangjunshow

scrapy——抓取知乎

1 创建项目

2 编辑SPIDER

2.1 函数def start_requests(self):

2.1.1 scrapy.Request类

2.1.2 request的参数说明：

2.2 函数def request_captcha(self, response):

2.3 函数 def download_captcha(self, response):

2.3.1 FormRequest

2.4 函数def request_zhihu(self, response):

2.5 函数def get_topic_question(self, response):

2.6 函数def parse_question_data(self, response):

3 编辑pipelines.py

3.1 函数def open_spider(self, spider):

3.2 函数def process_item(self, item, spider):

3.3 编辑SETTING

4 items内容

5 SPDIER内容

6 PIPELINE内容

7 运行

8 关于反爬

给我老师的人工智能教程打call！http://blog.csdn.net/jiangjunshow

猜你喜欢