python- reptiles -scrapy

getting Started:

Download: pip install scrapy

Project: scrapy startproject project name

Spider: scrapy genspider reptiles name url (--nolog // optional does not display the log )

 

Summary:

 

Persistent storage:

1 : Terminal stores: Scrapy crawl -o aaa.text

2: pipe storage : items objects i.e. through to the {} dictionary, after storage

3 : open_spider () ----> Link database, close_spider () -> close the database, process_item () ---> storage

Acting Ip:

1 Custom Middleware downloads

middleware.py---》

class MyProxy(object):

def process_request(self,request,spider):

# Ip request to replace

request.meta['proxy'] = "http://202.112.51.51:8082"

2 Open the downloaded middleware

DOWNLOADER_MIDDLEWARES = {

'firstBlood.middlewares.MyProxy': 543,

}

Log level:

1

ERROR : error

The WARNING : Warnings

INFO : General Information

DEBUG : debug information (default)

Specify the log level of information :

settings:LOG_LEVEL = ‘ERROR’

Store log information to develop file:

settings:LOG_FILE = ‘log.txt’

2 two parameter passing

yield scrapy.Request(url=url,callback=self.secondParse,meta={'item':ite

m})

Call: item = response.meta [ 'item']

Request parameter passing:

Way: by scrapy.Requests (method = 'post')

Second way: Rewrite start_request (self) method ( recommended )

class FanyiSpider(scrapy.Spider):

def start_requests(self):

data = {

'kw':'dog'

}

for url in self.start_urls:

# FormRequest post request transmission

yield scrapy.FormRequest(url=url,formdata=data,callback=self.parse)

 

CrawlSpider:

Number of layers is typically much multilayer request method, or recursive methods ---> yield scrapy.Request (url, callback , meta)

There are a variety of special requests:

 

A: initial request turned into a request queue function (get url list, continue to request, obtain a new page in a new url list)

Import Scrapy 

from scrapy.linkextractors Import LinkExtractor 

from scrapy.spiders Import CrawlSpider, the Rule 

 

class CrawlspiderSpider (CrawlSpider): 

name = ' crawlSpider ' 

start_urls = [ ' https://www.qiushibaike.com/text ' ]   

 

the rules = (the Rule (LinkExtractor (the allow = R & lt ' / text / Page / \ D + ' ), the callback = ' parse_item ' , Follow = True),) 

'' ' 

LinkExtractor: setting the extracted links rule (regex) 

the allow = (),: setting allows extract url

restrict_xpaths = (),: The xpath syntax, a label positioned to the link extracted 

restrict_css = (),: The css selectors, positioned next to a link tag extraction 

deny = (),: url settings do not allow extraction ( higher priority than the allow) 

allow_domains = (),: url domain setting allows extraction of 

deny_domains = (),: url domain settings do not allow extraction (higher priority than allow_domains) 

UNIQUE = True,: url occur if a plurality of identical only kept a 

strip = True: the default is True, represents the beginning and end of the url automatically remove spaces 

'' ' 

' '' 

rule 

link_extractor,: linkExtractor Object 

callback = None,: set a callback function 

follow = None,: set whether to follow up 

process_links = none,: callback function may be provided, to intercept all the extracted URL 

process_request = Identity: callback function may be provided, to intercept the request object 

'' ' 

DEF parse_item (Self, Response): 

    div_list = response.xpath ( ' // div [@ id = "content- left"]/div')



   for div in div_list:

   item = PhpmasterItem()

   author = div.xpath('./div/a[2]/h2/text()').extract_first()

   item['author'] = str(author).strip()

   # print(author)

   content = div.xpath('./a[1]/div/span/text()').extract()

   content = ''.join(content)

   item['content'] = str(content).strip()

   yield item

 

 

II: Download img when the img_url reached pipe, pipeline download (the lower requestor)

Spider::yield item[‘img_url’]

Setting::IMAGES_STORE = './images/'

Pip::

from qiubaipic.settings import IMAGES_STORE as images_store

from scrapy.pipelines.images import ImagesPipeline

class QiubaipicPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):

        img_link = "http:"+item['img_link']

        yield scrapy.Request(img_link)        

 

Pictures grouping:

DEF file_path (Self, Request, Response = None, info = None): 

        '' ' to complete the picture storage path ' '' 

        img_name = request.url.split ( ' / ' ) [-. 1]   # image name 

        file_name = request.META [ ' file_name ' ]   # path 

        image_guid = file_name + ' / ' + img_name   # price world famous /2560580770.jpg 

        img_path = file_name IMAGES_STORE + + ' / '   # ./image/ world famous price / must exist 

        IF  Not os.path.exists(img_path):

            os.makedirs(img_path)

        print(request.url)

        return '%s' % (image_guid)

 

Distributed reptiles:

 

Proxy IP pool and UA pool

Proxy ip Middleware:

http_list = []

https_list = []

def process_request(self, request, spider):

        h = request.url.split(':')[0]

        if h == 'http':

            http = 'http://'+random.choice(http_list)

        if h == 'https':

            http = 'https://'+random.choice(https_list)

        request.meta['proxy'] = http

 

Ua Middleware:

= user_agent_list [] 

DEF process_request (Self, request, Spider): 

        # Randomly selected from a list of values ua 

        ua = The random.choice (user_agent_list) 

        # 4. ua current values ua request intercepted write operation 

        request .headers.setdefault ( ' the User-- Agent ' , UA)

 

Script scrapy project :

Under the new project xxx.py ;

from scrapy Import cmdline 

# help us execute scrapy direct command 

cmdline.execute ( ' scrapy crawl logrule --nolog ' .split ())

 

Guess you like

Origin www.cnblogs.com/person1-0-1/p/11390552.html