Scrapy advanced by Xiaobai (distributed and cookies pool based on Scrapy-Redis)

 

 First let's update the scrapy version. The latest version is 1.3

Again, the Windows friend pip cannot install Scrapy. It is recommended to use anaconda, otherwise it is better to use Linux honestly.

conda install scrapy==1.3
or
pip install scrapy==1.3

 Install Scrapy-Redis

conda install scrapy-redis

or

pip install scrapy-redis

 Python version 2.7, 3.4 or 3.5. Personal use of the 3.6 version is no problem, need to pay attention:

Redis>=2.8

Scrapy>=1.0

Redis-py>=2.1 。

The 3.X version of Python comes with Redis-py. If the rest of the friends do not have it, install it by pip.

 

Before we start, we have to know some configurations of scrapy-redis: PS These configurations are written in the settings.py of the Scrapy project!

#Enable Redis scheduling storage request queue

SCHEDULER = "scrapy_redis.scheduler.Scheduler"



#Ensure all crawlers are deduplicated through Redis

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"



#Default request serialization uses pickle but we can change to other similar ones. PS: This thing can be used for 2.X. 3.X cannot be used

#SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"



#Do not clear the Redis queue, so you can pause/resume crawling

#SCHEDULER_PERSIST = True



#Use priority scheduling request queue (default)

#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

#Optional other queues

#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'

#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'



#Maximum idle time to prevent distributed crawler from closing due to waiting

#This only works when the queue class set above is SpiderQueue or SpiderStack

# and when your spider starts for the first time, it may also prevent the same time from starting (due to the queue being empty)

#SCHEDULER_IDLE_BEFORE_CLOSE = 10



#Process the cleared items in redis

ITEM_PIPELINES = {

    'scrapy_redis.pipelines.RedisPipeline': 300

}



#Serialize project pipeline as redis Key storage

#REDIS_ITEMS_KEY = '%(spider)s:items'



#Use ScrapyJSONEncoder by default for item serialization

#You can use any importable path to a callable object.

#REDIS_ITEMS_SERIALIZER = 'json.dumps'



#Specify the port and address to use when connecting to redis (optional)

#REDIS_HOST = 'localhost'

#REDIS_PORT = 6379



#Specify the URL used to connect to redis (optional)

#If this item is set, the priority of this item is higher than the set REDIS_HOST and REDIS_PORT

#REDIS_URL = 'redis://user:pass@hostname:9001'



#Custom redis parameters (connection timeout etc.)

#REDIS_PARAMS  = {}



#Custom redis client class

#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'



#If True, use redis'spop' to operate.

#This option is useful if you need to avoid duplicate start URL lists. When this option is enabled, urls must be added via sadd, otherwise a type error will occur.

#REDIS_START_URLS_AS_SET = False



#RedisSpider and RedisCrawlSpider default start_usls key

#REDIS_START_URLS_KEY = '%(name)s:start_urls'



#Set redis to use encoding other than utf-8

#REDIS_ENCODING = 'latin1'

 For those who can't stand it, look here: http://scrapy-redis.readthedocs.io/en/stable/readme.html Please choose the required configuration and write it in the project's settings.py file.

 

Continuing the crawler modification from our last blog post:

First write the redis configuration file we need into settings.py:

If your redis database is configured according to the previous blog post, you need at least the following three items

SCHEDULER = "scrapy_redis.scheduler.Scheduler"



DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"



REDIS_URL = 'redis://root:password@hostIP:port'

 Nice configuration files are written here. Let's do some basic anti-crawling settings. The third item, please configure it according to your actual situation.

The most basic one to switch UserAgent!

First, create a new useragent.py in the project file to write a bunch of User-Agents (you can find more on the Internet, or you can use the following ready-made ones)

agents = [

    "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",

    "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",

    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",

    "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",

    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",

    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",

    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",

    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",

    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",

    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",

    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",

    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",

    "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",

    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",

    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",

    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",

    "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",

    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",

    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",

    "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",

    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",

    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",

    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",

    "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",

    "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",

    "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",

    "Mozilla/2.02E (Win95; U)",

    "Mozilla/3.01Gold (Win95; I)",

    "Mozilla/4.8 [en] (Windows NT 5.1; U)",

    "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",

    "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",

    "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",

    "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",

    "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",

    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",

    "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",

    "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",

    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",

    "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",

    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",

    "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",

    "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",

    "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",

    "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",

    "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",

    "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",

    "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",

    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",

    "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",

    "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",

    "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",

    "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3",

    "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",

    "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",

    "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",

    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",

    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",

    "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",

    "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",

    "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",

    "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",

    "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",

 Now let's rewrite Scrapy's download middleware (Wow! Rewriting middleware is so high-end!! Will it be difficult!!! Rest assured!!! So Easy!! Do it with me! After all, no, you can't hit me along the network cable):

 

For details on rewriting middleware, please refer to the official documentation:

http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/downloader-middleware.html#scrapy.contrib.downloadermiddleware.DownloaderMiddleware

 

Create a new middlewares.py file in the project (if you are using a new version of Scrapy, there will be such a file when you create it, just use it directly)

First import UserAgentMiddleware after all we have to rewrite it!

import json ##Processing json package

import redis #Python package for operating redis

import random #Random selection

from .useragent import agents #Import the previous

from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware #UserAegent中间件

from scrapy.downloadermiddlewares.retry import RetryMiddleware #Retry middleware

 Open to write:

class UserAgentmiddleware(UserAgentMiddleware):



    def process_request(self, request, spider):

        agent = random.choice(agents)

        request.headers["User-Agent"] = agent

 The first line: defines a class UserAgentmiddleware inherits from UserAgentMiddleware

The second line: defines why the function process_request( requestspider ) defines this function, because Scrapy calls this method every request through the middleware.

QQ20170206-223156

Line 3: Randomly select a User-Agent

The fourth line: set the User-Agent of the request to our random User-Agent

^_^Y(^o^)Y A middleware is finished! Haha is it So easy!

 

Next you need to log in. This time we do not use the FromRequest of the previous blog post to achieve login. Let's log in using cookies. In this case we need to rewrite the cookie middleware! Distributed crawler! You can't manually write a cookie for each spider. And you won't know whether the cookie has expired or not. So we need to maintain a cookie pool (redis is used for this cookie pool).

good! Come to think of it, what are the most basic functions required to maintain a cookie pool?

  1. get cookies
  2. Update cookies
  3. delete cookies
  4. Determine whether the cookie is available for the corresponding operation (such as retry)

Well, let's do the first three operations on cookies first.

First, we create a new cookies.py file in the project to write the operations we need to do with cookies.

haoduofuli/haoduofuli/cookies.py:

First, import the files we need daily:

import requests

import json

import redis

import logging

from .settings import REDIS_URL ##Get REDIS_URL in settings.py

 First, we store the account password for login into the redis database in the form of Key:value. It is not recommended to use db0 (this is used by Scrapy-redis by default, and the account password is stored in a separate db.)

QQ20170207-221128@2x

Just like this.

Solve the first problem: get the cookie:

import requests

import json

import redis

import logging

from .settings import REDIS_URL



logger = logging.getLogger(__name__)

##Use REDIS_URL to link the Redis database, the parameter deconde_responses=True is required, the data will become byte form and it is completely useless

reds = redis.Redis.from_url(REDIS_URL, db=2, decode_responses=True)

login_url = 'http://haoduofuli.pw/wp-login.php'



##Get cookies

def get_cookie(account, password):

    s = requests.Session()

    payload = {

        'log': account,

        'pwd': password,

        'rememberme': "forever",

        'wp-submit': "Login",

        'redirect_to': "http://http://www.haoduofuli.pw/wp-admin/",

        'testcookie': "1"

    }

    response = s.post(login_url, data=payload)

    cookies = response.cookies.get_dict()

    logger.warning("Success to get Cookie! (Account is: %s)" % account)

    return json.dumps(cookies)

 This paragraph is easy to understand.

Use the requests module to submit a form to log in to obtain a cookie, and return a cookie serialized by Json (if it is not serialized, it will be in Plain Text format after being stored in Redis, and the cookie will be useless if it is taken out later.)

The second question: write the cookie to the Redis database (distributed, of course, other spiders can also use this cookie)

def init_cookie(red, spidername):

    redkeys = reds.keys()

    for user in redkeys:

        password = reds.get(user)

        if red.get("%s:Cookies:%s--%s" % (spidername, user, password)) is None:

            cookie = get_cookie(user, password)

            red.set("%s:Cookies:%s--%s"% (spidername, user, password), cookie)

 Determine whether the cookie of the spider and the account exists. If it does not exist, call the get_cookie function to pass in the cookie of the account password obtained from redis; use the redis link we established above to get all the keys in redis db2 (we set it as the account) !), and then get all the Values ​​from redis (I set it as a password!)

Save it into redis, the Key is the spider name and account password, and the value is the cookie.

The operation of redis here is not the reds link established above! It's red; it will be passed in later (because I have to operate two different dbs, I didn't see the method of switching db in the document, so I had to use it like this, and the friends who know it, leave a message).

The spidername acquisition method will also be described later.

There are still remaining update cookies, delete accounts that cannot be used, etc. The big guy can try to write it by himself (it doesn't matter if you can't write it, it doesn't affect normal use)

Okay! Get it! Simply So Easy!!!!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326485692&siteId=291194637