getting Started:
Download: pip install scrapy
Project: scrapy startproject project name
Spider: scrapy genspider reptiles name url (--nolog // optional does not display the log )
Summary:
Persistent storage:
1 : Terminal stores: Scrapy crawl -o aaa.text
2: pipe storage : items objects i.e. through to the {} dictionary, after storage
3 : open_spider () ----> Link database, close_spider () -> close the database, process_item () ---> storage
Acting Ip:
1 Custom Middleware downloads
middleware.py---》
class MyProxy(object):
def process_request(self,request,spider):
# Ip request to replace
request.meta['proxy'] = "http://202.112.51.51:8082"
2 Open the downloaded middleware
DOWNLOADER_MIDDLEWARES = {
'firstBlood.middlewares.MyProxy': 543,
}
Log level:
1
ERROR : error
The WARNING : Warnings
INFO : General Information
DEBUG : debug information (default)
Specify the log level of information :
settings:LOG_LEVEL = ‘ERROR’
Store log information to develop file:
settings:LOG_FILE = ‘log.txt’
2 two parameter passing
yield scrapy.Request(url=url,callback=self.secondParse,meta={'item':ite
m})
Call: item = response.meta [ 'item']
Request parameter passing:
Way: by scrapy.Requests (method = 'post')
Second way: Rewrite start_request (self) method ( recommended )
class FanyiSpider(scrapy.Spider):
def start_requests(self):
data = {
'kw':'dog'
}
for url in self.start_urls:
# FormRequest post request transmission
yield scrapy.FormRequest(url=url,formdata=data,callback=self.parse)
CrawlSpider:
Number of layers is typically much multilayer request method, or recursive methods ---> yield scrapy.Request (url, callback , meta)
There are a variety of special requests:
A: initial request turned into a request queue function (get url list, continue to request, obtain a new page in a new url list)
Import Scrapy from scrapy.linkextractors Import LinkExtractor from scrapy.spiders Import CrawlSpider, the Rule class CrawlspiderSpider (CrawlSpider): name = ' crawlSpider ' start_urls = [ ' https://www.qiushibaike.com/text ' ] the rules = (the Rule (LinkExtractor (the allow = R & lt ' / text / Page / \ D + ' ), the callback = ' parse_item ' , Follow = True),) '' ' LinkExtractor: setting the extracted links rule (regex) the allow = (),: setting allows extract url restrict_xpaths = (),: The xpath syntax, a label positioned to the link extracted restrict_css = (),: The css selectors, positioned next to a link tag extraction deny = (),: url settings do not allow extraction ( higher priority than the allow) allow_domains = (),: url domain setting allows extraction of deny_domains = (),: url domain settings do not allow extraction (higher priority than allow_domains) UNIQUE = True,: url occur if a plurality of identical only kept a strip = True: the default is True, represents the beginning and end of the url automatically remove spaces '' ' ' '' rule link_extractor,: linkExtractor Object callback = None,: set a callback function follow = None,: set whether to follow up process_links = none,: callback function may be provided, to intercept all the extracted URL process_request = Identity: callback function may be provided, to intercept the request object '' ' DEF parse_item (Self, Response): div_list = response.xpath ( ' // div [@ id = "content- left"]/div') for div in div_list: item = PhpmasterItem() author = div.xpath('./div/a[2]/h2/text()').extract_first() item['author'] = str(author).strip() # print(author) content = div.xpath('./a[1]/div/span/text()').extract() content = ''.join(content) item['content'] = str(content).strip() yield item
II: Download img when the img_url reached pipe, pipeline download (the lower requestor)
Spider::yield item[‘img_url’]
Setting::IMAGES_STORE = './images/'
Pip::
from qiubaipic.settings import IMAGES_STORE as images_store from scrapy.pipelines.images import ImagesPipeline class QiubaipicPipeline(ImagesPipeline): def get_media_requests(self, item, info): img_link = "http:"+item['img_link'] yield scrapy.Request(img_link)
Pictures grouping:
DEF file_path (Self, Request, Response = None, info = None): '' ' to complete the picture storage path ' '' img_name = request.url.split ( ' / ' ) [-. 1] # image name file_name = request.META [ ' file_name ' ] # path image_guid = file_name + ' / ' + img_name # price world famous /2560580770.jpg img_path = file_name IMAGES_STORE + + ' / ' # ./image/ world famous price / must exist IF Not os.path.exists(img_path): os.makedirs(img_path) print(request.url) return '%s' % (image_guid)
Distributed reptiles:
Proxy IP pool and UA pool
Proxy ip Middleware:
http_list = [] https_list = [] def process_request(self, request, spider): h = request.url.split(':')[0] if h == 'http': http = 'http://'+random.choice(http_list) if h == 'https': http = 'https://'+random.choice(https_list) request.meta['proxy'] = http
Ua Middleware:
= user_agent_list [] DEF process_request (Self, request, Spider): # Randomly selected from a list of values ua ua = The random.choice (user_agent_list) # 4. ua current values ua request intercepted write operation request .headers.setdefault ( ' the User-- Agent ' , UA)
Script scrapy project :
Under the new project xxx.py ;
from scrapy Import cmdline # help us execute scrapy direct command cmdline.execute ( ' scrapy crawl logrule --nolog ' .split ())