Scrapy reptile comprehensive framework Detailed Explanation -----

table of Contents

A, Scrapy framework for understanding

Two, Scrapy framework created

Third, the process reptiles

Fourth, involving knowledge points

1, scrapy save a file in two ways:

2, Scrapy shell code debugging

3, settings.py file description

4, xpath Introduction

5, URL splicing

6, the user name and password authentication proxy ip

 7, Scrapy asynchronous

8, Scrapy breakthrough anti reptile restrictions

9, using FormRequest.from_response () method for simulating user login:

10, Request and Response parameters and methods

11, Scrapy_redis distributed processing of Scrapy


Scrapy as a crawler frame, using asynchronous processing request. Crawling faster! Of course, also be directly requests, beautifulsoup, urllib these libraries, the use of cross-process data!

But if from a different website crawling same data, using these libraries is especially cumbersome and messy. In the words of the framework, it is a very clear and concise!

Specifically to learn a Scrapy, nor how deep underlying Pa, did not have any problems in general use. Hereby made a summary:

A, Scrapy framework for understanding

First, get to know this framework, mainly the following five parts:

Scheduler (Scheduler): save request sent from the engine, into the queue

Downloader (Downloader): go to the Internet to download engine scheduling request from the scheduler

Spider (crawler device): parsing engine acquired in response to a download manager Response, in response, returns the required data, or url requests are sent to the engine

Item Pipeline (pipe): Spider brought receiving engine data from the saved data. The same data can be stored in different pipes.

Scrapy Engine (engine, the brain): All of the above operations, all the engine scheduling assignment. Four components, engines and only a relationship, no relationship between each other. == "reflects the decoupling

If you want to learn Scrapy, please remember clearly the five components slightly!

Two, Scrapy framework created

(1) Create a project

scrapy startproject project name     

eg:  scrapy startproject  book

(2) create the crawler (First locate the project created crawler operation)

cd project name

scrapy genspider domain name reptile reptilian # note name is unique, it can not be the same name and other reptiles

eg:   cd book

        scrapy genspider dangdang book.dangdang.com

(3) runs reptiles

scrapy crawl reptile name

eg:   scrapy crawl  dangdang 

 

After creating the project is completed, i.e., use the following configuration items :( control device to see the first portion of four)

 

Third, the process reptiles

(1) First, the program file to run in the spider crawlers will file just created; generated by default:

class DangdangSpider(scrapy.Spider):
    name = 'dangdang'       # 爬虫名
    allowed_domains = ['book.dangdang.com', 'dangdang.com']     # 允许的域名
    start_urls = ['http://book.dangdang.com/']      # 开始爬取的网址

    def parse(self, response):
        print('Start spider....')

This is the first step Spider reptile, first run start_urls URL, and then sends a request to the scheduler downloader, download download, callback function to parse crawlers. (Of course, a direct relationship between each other does not exist, briefly explain, easy to understand)

parse method is to parse the response in response to, by analyzing the response, to a request for more url scheduler or more data, to the item pipeline. This depends entirely on the engine to return identification data or request.

(2) If the return value is the first step url parse request, the request is placed to the engine Scheduler scheduler, the scheduler those requests obtained stack into a set. Through dynamic allocation of the engine, the set of requests, to the ordered downloader.

(3) If the first step of the method parse the return value data, the engine data to the pipe item pipline. Pipeline to save the data to a file.

#  -- item pipeline --
class BookPipeline(object):
    def open_spider(self, spider):  # 打开文件流
        # self.txt_channel = open('txt_save', 'a+', encoding='utf-8')
        self.txt_channel = open('txt_save1', 'a+', encoding='utf-8')

    def process_item(self, item, spider):   # 保存数据
        self.txt_channel.write(str(item) + '\n')
        self.txt_channel.flush()    # 刷新缓存文件,将数据保存文件
        return item

    def close_spider(self, spider):     # 关闭文件流
        self.txt_channel.close()

If you want the pipeline to take effect, you need to configure settings.py file:

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'book.pipelines.BookPipeline': 300,    # 数字代表优先级,越小,优先级越高
}

(4) middleware middleware, including two intermediate:

spider middleware (AC interception engine and the Spider)

downloader middleware (AC interception of engine and Downloader)

(5) items.py file is actually a constraint for you to extract the data, which is defined by calling the Item class to a more explicit process data (Oh, in fact, did not use eggs, save data directly stored on a dictionary on the line)

import scrapy

class BookItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    pass

 Of course, this class can be anywhere, by project name .items import BookItem import, carried out using from. But most still use the callback function parse the spider crawling file in. Because here is generated and stored in local data.

Fourth, involving knowledge points

1, scrapy save a file in two ways:

方式一:朴素法(直接使用命令进行保存文件)
    格式:
        scrapy crawl itcast -o teachers.csv -s FEED_EXPORT_ENCODING=UTF-8
    详解:
        FEED_EXPORT_ENCODING: 为保存编码

方式二:使用pipelines

2, Scrapy shell code debugging

进入shell:
    scrapy shell "http://www.itcast.cn/channel/teacher.shtml" 
-s USER_AGENT="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"

进入之后,通过提示的实例对象,进行相应操作。如果不知道对象的方法,使用 实例对象 -h 进行查看
一般你在Scrapy爬虫过程中的进行的操作,在这儿都可以,所以一般用做调试

 

 Reference: https://scrapy-chs.readthedocs.io/zh_CN/latest/topics/shell.html#scrapy

3, settings.py file description

# 1. 爬虫名称,不是spider,name里的名称,而是整个爬虫项目的名称,
# 很多网站都会有自己的爬虫(百度,谷歌等都有)。
BOT_NAME = 'scrapy_learn'

# 2. 爬虫应用路径
SPIDER_MODULES = ['scrapy_learn.spiders']
NEWSPIDER_MODULE = 'scrapy_learn.spiders'

# 3. 客户端 user-agent请求头,常伪造成浏览器
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36'

# 4. 是否遵循爬虫规则,正经的要遵循,但我们搞爬虫都不正经
ROBOTSTXT_OBEY = False

# 5. 并发请求数,默认16
CONCURRENT_REQUESTS = 32

# 6. 延迟下载秒数,默认0
DOWNLOAD_DELAY = 3

# 7. 单域名访问并发数,并且延迟下次秒数也应用在每个域名,比CONCURRENT_REQUESTS更加细致的并发
CONCURRENT_REQUESTS_PER_DOMAIN = 16

# 单IP访问并发数,如果有值则忽略:CONCURRENT_REQUESTS_PER_DOMAIN,
# 并且延迟下次秒数也应用在每个IP
CONCURRENT_REQUESTS_PER_IP = 16

# 8. 是否支持cookie,cookiejar进行操作cookie,默认支持
COOKIES_ENABLED = True
# 是否是调试模式,调试模式下每次得到cookie都会打印
COOKIES_DEBUG = True

# 9. Telnet用于查看当前爬虫的信息(爬了多少,还剩多少等),操作爬虫(暂停等)等...,
# cmd中:telnet 127.0.0.1 6023(6023是专门给爬虫用的端口)
# telnet 命令
# est() 检查引擎状态
# engine.pass 暂停引擎, 还有很多命令,在网上可搜
TELNETCONSOLE_ENABLED = True

# 10. 默认请求头
DEFAULT_REQUEST_HEADERS = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Language': 'en',
}

# 中间件,需要详细讲,另写
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
    'scrapy_learn.middlewares.ScrapyLearnSpiderMiddleware': 543,
}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'scrapy_learn.middlewares.ScrapyLearnDownloaderMiddleware': 543,
}

# 11. 定义pipeline处理请求
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'scrapy_learn.pipelines.ScrapyLearnPipeline': 300,
}

# 12. 自定义扩展,基于信号进行调用
# See https://doc.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
    'scrapy.extensions.telnet.TelnetConsole': None,
}

# 自动限速算法(智能请求)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# 第一次下载延迟几秒
AUTOTHROTTLE_START_DELAY = 5
# 最大延迟
AUTOTHROTTLE_MAX_DELAY = 60
# 波动范围,不用管
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = False

# 做缓存的,以后说
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# 13. 爬虫允许的最大深度,可以通过meta查看当前深度;0表示无深度
DEPTH_LIMIT = 4

# DEPTH_PRIORITY只能设置为0或1,
# 0深度优先,一下找到底,然后再找其他的
# 1广度优先,一层一层找
# 他们内部的原理就是根据response.meta里的depth(层数)来找。
 DEPTH_PRIORITY = 0

# 默认下载中间件
    {
        'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,
        'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300,
        'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350,
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
        'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,
        'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,
        'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,
        'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590,
        'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
        'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,
        'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,
        'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,
        'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,
        'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
    }
# settings文件变量的提取方法

from scrapy.utils.project import get_project_settings
settings = get_project_settings()
 
user_agents = settings['USER_AGENTS']

4, xpath Introduction

  By this method the heart-shaped scrapy.Selector xpath precise look, style sheets can also scrapy.Selector.css inquiry

 

 

xpath and css version control as follows:

description

Xpath

CSS Path

Direct child elements

//div/a

div > a

Child or descendant of

//div//a

div a

Id to locate

//div[@id=’idValue’]//a

div#idValue a

Positioning with class

//div[@class=’classValue’]//a

div.classValue a

Brother sibling elements

//ul/li[@class=’first’]/following-sibling::li

ul>li.first + li

Attributes

//form/input[@name=’username’]

form input[name=’username’]

More properties

//input[@name=’continue’ and @type=‘button’]

input[name=’continue’][type=’button’]

4 sub-elements

//ul[@id=’list’]/li[4]

ul#list li:nth-child(4)

The first sub-element

//ul[@id=’list’]/li[1]

ul#list li:first-child

The last sub-elements

//ul[@id=’list’]/li[last()]

ul#list li:last-child

Property contains a field

//div[contains(@title,’Title’)]

div[title*=”Title”]

Property begins with a field

//input[starts-with(@name,’user’)]

input[name^=”user”]

Ending with a field attribute

//input[ends-with(@name,’name’)]

input[name$=”name”]

text included in a field

//div[contains(text(), 'text')]

Unable to locate

There are elements of a property

//div[@title]

div[title]

Parent

//div/..

Unable to locate

Brother sibling node

//li/preceding-sibling::div[1]

Unable to locate

5, URL splicing

import urllib

next_url = urllib.parse.urljoin(带域名的网址reponse.url, 不带域名的网址next_url)

 

6, the user name and password authentication proxy ip

(1), defines the middleware Middlewares.py:

#添加需要账号和密码身份验证的ip例子
import base64
class ProxyMiddleware(object):
    def process_request(self,request,spider):
        # 随机选出代理信息
        proxy = "xxx.xxx.xxx.xxx:port"
        # 设置代理的认证信息
        auth = base64.b64encode(bytes("USERNAME:PASSWORD", 'utf-8'))
        request.headers['Proxy-Authorization'] = b'Basic ' + auth
        # 设置代理ip (http/https)
        request.meta['proxy'] = 'http://' + proxy


(2) Open the middleware settings.py:

DOWNLOADER_MIDDLEWARES = {
   'MySpider.middlewares.ProxyMiddleware': 200,
}


(3) scrapy start to see results.

 7, Scrapy asynchronous

Reference: https://blog.csdn.net/topleeyap/article/details/79209816

8, Scrapy breakthrough anti reptile restrictions

(1) Random replacementUser-Agent

Url each request to replace a user-agent

1

pip install fake-useragent

settings

1

2

3

4

DOWNLOADER_MIDDLEWARES = {

   # 'ArticleSpider.middlewares.MyCustomDownloaderMiddleware': 543,

      'ArticleSpider.middlewares.RandomUserAgentMiddleware'400,

}

middlewares 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

from fake_useragent import UserAgent

 

class RandomUserAgentMiddleware(object):

    def __init__(self, crawler):

        super(RandomUserAgentMiddleware, self).__init__()

 

        self.ua = UserAgent()

        # 若settings中没有设置RANDOM_UA_TYPE的值默认值为random,

        # 从settings中获取RANDOM_UA_TYPE变量,值可以是 random ie chrome firefox safari opera msie

        self.ua_type = crawler.settings.get('RANDOM_UA_TYPE''random'

 

    @classmethod

    def from_crawler(cls, crawler):

        return cls(crawler)

 

    def process_request(self, request, spider):

        def get_ua():

            '''根据settings的RANDOM_UA_TYPE变量设置每次请求的User-Agent'''

            return getattr(self.ua, self.ua_type)

 

        ua = get_ua()

        request.headers.setdefault('User-Agent', get_ua())

 

 

 (2) ip proxy

Option One: Free

Custom function to get some free proxy ip online

settings

1

2

3

DOWNLOADER_MIDDLEWARES = {

      'ArticleSpider.middlewares.RandomProxyMiddleware'400,

}

middlewares 

1

2

3

4

class RandomProxyMiddleware(object):

    #动态设置ip代理

    def process_request(self, request, spider):

        request.meta["proxy"= get_random_ip() # 这个自定义函数返回一个随机代理ip:port

Option Two: Pay version

scrapy-proxies and so on github

(3) Automatic speed

AUTOTHROTTLE_ENABLED
默认:False

启用 AutoThrottle 扩展。

AUTOTHROTTLE_START_DELAY
默认:5.0

初始下载延迟(单位:秒)。

AUTOTHROTTLE_MAX_DELAY
默认:60.0

在高延迟情况下最大的下载延迟(单位秒)。

AUTOTHROTTLE_DEBUG
默认:False

起用 AutoThrottle 调试(debug)模式,展示每个接收到的 response。您可以通过此来查看限速参数是如何实时被调整的。

(4) by selenium to the landing page, get

具体selenium基本操作的了解,可以参考另外一篇博文:https://blog.csdn.net/feifeiyechuan/article/details/84755216(很全)

那么如何去通过selenium去植入Scrapy呢?

settings

1

2

3

DOWNLOADER_MIDDLEWARES = {

      'ArticleSpider.middlewares.JSPageMiddleware':1,

}

middlewares   

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

from selenium import webdriver

from scrapy.http import HtmlResponse

import time

 

 

class JSPageMiddleware(object):

    def __init__(self): # 使用同一个self,保证只打开一个浏览器,所有spider使用一个浏览器

        self.browser = webdriver.Chrome(executable_path="D:/Package/chromedriver.exe")

        super(JSPageMiddleware, self).__init__()

 

    # 通过chrome请求动态网页

    def process_request(self, request, spider):

        if spider.name == "jobbole":

            # self.browser = webdriver.Chrome(executable_path="D:/Package/chromedriver.exe")

            self.browser.get(request.url)

            time.sleep(1)

            print("访问:{0}".format(request.url))

            # browser.quit()

            return HtmlResponse(url=self.browser.current_url, body=self.browser.page_source,

                                encoding="utf-8", request=request)

 优点:反爬难度大

缺点:selenium是同步的,效率低下

9、使用FormRequest.from_response()方法模拟用户登录:

classmethod from_response(response[, formname=None, formid=None, formnumber=0, 
formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])

 返回一个新的FormRequest对象,其表单字段值预先填充在给定响应中包含的HTML <form>元素中

该方法是默认情况下自动模拟任何可以点击的表单控件,如 < input type =“submit” >。即使这是非常方便的,通常是期望的行为,有时它可能会导致难以调试的问题。例如,使用javascript填充和/或提交的表单时,默认的from_response()行为可能不是最合适的。要禁用此行为,您可以将dont_click参数设置为True。另外,如果要更改点击的控件(而不是禁用它),还可以使用clickdata参数

-- spider 文件中 -- 
import scrapy
class LoginSpider(scrapy.Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']
    def parse(self, response):
        return scrapy.FormRequest.from_response(
        response,
        formdata={'username': 'john', 'password': 'secret'}, #预先填好的账号密码
        callback=self.after_login
    )
def after_login(self, response):
    # check login succeed before going on
    if "authentication failed" in response.body:
    self.logger.error("Login failed")
    return

10、Request  和 Response 参数及方法

Request对象

一个Request对象表示一个HTTP请求,它通常是在爬虫中生成,并由下载器执行,从而返回Response

基础参数 :

url——请求的url

callback——请求回来的reseponse处理函数,也叫回调函数

meta——用来在“页面”之间传递数据

  • meta是一个dict,主要用来在解析函数之间传递值
  • 比如:在parse() 给item某些字段提取了值,并且提取出了一个新的URL,item另外一些字段需要在这个新的URL的response里面提取,为此定义一个parse_item()解析函数用于处理这个response。在用request发送这个新的URL请求的时候,使用parse_item()作为回调函数,并使用meta传递原来已经提取的item字段给parse_item()里的response
  • Request对象接受一个meta参数,一个字典对象,同时Response对象有一个meta属性可以取到相应request传过来的meta
  • 一旦此参数被设置, 通过参数传递的字典将会被浅拷贝

headers——页面的headers数据

cookies——设置页面的cookies

基础高级参数

encoding——请求的转换编码

priority——链接优先级

  • 优先级越高,越优先爬取,但不可以序列化
  • 序列化 (Serialization):将对象的状态信息转换为可以存储或传输的形式的过程。在序列化期间,对象将其当前状态写入到临时或持久性存储区。以后,可以通过从存储区中读取或反序列化对象的状态,重新创建该对象

dont_filter——强制不过滤 
scrapy会对request的URL去重,加上dont_filter则告诉它这个URL不参与去重

errback——错误回掉 
errback更适合用于检查记录请求产生的错误,但是不适合请求的重试

Request对象方法

copy():复制一个一模一样的对象 
replace():对对象参数进行替换

Request.meta 一些特殊的keys

  • dont_redirect:如果 Request.meta 包含 dont_redirect 键,则该request将会被RedirectMiddleware忽略
  • dont_retry:如果 Request.meta 包含 dont_retry 键, 该request将会被RetryMiddleware忽略
  • handle_httpstatus_list:Request.meta 中的 handle_httpstatus_list 键可以用来指定每个request所允许的response code
  • handle_httpstatus_all:handle_httpstatus_all为True ,可以允许请求的任何响应代码
  • dont_merge_cookies:Request.meta 中的dont_merge_cookies设为TRUE,可以避免与现有cookie合并
  • cookiejar:Scrapy通过使用 Request.meta中的cookiejar 来支持单spider追踪多cookie session。 默认情况下其使用一个cookie jar(session),不过可以传递一个标示符来使用多个
  • dont_cache:可以避免使用dont_cache元键等于True缓存每个策略的响应
  • redirect_urls:通过该中间件的(被重定向的)request的url可以通过 Request.meta 的 redirect_urls 键找到
  • bindaddress:用于执行请求的传出IP地址的IP
  • dont_obey_robotstxt:如果Request.meta将dont_obey_robotstxt键设置为True,则即使启用ROBOTSTXT_OBEY,RobotsTxtMiddleware也会忽略该请求
  • download_timeout:下载器在超时之前等待的时间(以秒为单位)
  • download_maxsize:爬取URL的最大长度
  • download_latency:自请求已经开始,即通过网络发送的HTTP消息,用于获取响应的时间量 
    该元密钥仅在下载响应时才可用。虽然大多数其他元键用于控制Scrapy行为,但是这个应用程序应该是只读的
  • download_fail_on_dataloss:是否在故障响应失败
  • proxy:可以将代理每个请求设置为像http:// some_proxy_server:port这样的值
  • ftp_user :用于FTP连接的用户名
  • ftp_password :用于FTP连接的密码
  • referrer_policy:为每个请求设置referrer_policy
  • max_retry_times:用于每个请求的重试次数。初始化时,max_retry_times元键比RETRY_TIMES设置更高优先级

 

Response对象

基础参数

url——请求的url 
body——请求回来的html 
meta——用来在“页面”之间传递数据 
headers——页面的headers数据 
cookies——设置页面的cookies 
Request——发出这个response的request对象

Response对象方法

copy():同request 
replace():同request 
urljoin():由于将页面相对路径改为绝对路径 
follow():对相对路径进行自动补全

11、Scrapy_redis  进行Scrapy的分布式处理

我写的一个比较详细的scrapy_redis分布式处理流程:https://blog.csdn.net/feifeiyechuan/article/details/90166405

发布了84 篇原创文章 · 获赞 149 · 访问量 5万+

Guess you like

Origin blog.csdn.net/feifeiyechuan/article/details/90147579