Log level and the frame transmission reference requests 04scrapy

A log level .scrapy

  - When using scrapy crawl spiderFileName run the program, the log information is scrapy terminal printout.

  - log information Category:

    ERROR: General error

    WARNING: Warnings

    INFO: general information

    DEBUG: debug information

  - Specify the output setting log information:

  In settings.py configuration file, add

  

=, LOG_LEVEL, 'specify the log information category' can. 

 LOG_FILE = ' log.txt ' indicates that the log information is written to the specified file is stored.

II: pass the request parameters:

  - In some cases, we crawled data is no longer the same page, for example, we are crawling a movie website, the name of the movie, score one page, and other details of the film to be crawling in its secondary subpages, then we need to use the request of mass participation

  - Case: crawling www.id97.com Movie Network, the movie name a page, the type of score release time a two page, the director, the film crawling.

  Crawling file:

# -*- coding: utf-8 -*-
import scrapy
from moviePro.items import MovieproItem

class MovieSpider(scrapy.Spider):
    name = 'movie'
    allowed_domains = ['www.id97.com']
    start_urls = ['http://www.id97.com/']

    def parse(self, response):
        div_list = response.xpath('//div[@class="col-xs-1-5 movie-item"]')

        for div in div_list:
            Item=MovieproItem () 
            Item [ ' name ' ] = div.xpath ( ' .// h1 of / A / text () ' ) .extract_first () 
            Item [ ' Score ' ] = div.xpath ( ' .//h1/em/ text () ' ) .extract_first ()
             # XPath (String (.)) denotes the data value to extract all child nodes under the current node (.) indicates that the current node 
            Item [ ' kind ' ] = div.xpath ( ' .// div [@ class = "otherinfo"] ' ) .xpath ( ' String (.) ' ) .extract_first () 
            Item [ 'detail_url' ] = Div.xpath ( ' ./div/a/@href ' ) .extract_first ()
             # requested page two details, parse the contents of the corresponding page two, meta data Request parameter passing through 
            the yield scrapy.Request (item URL = [ ' detail_url ' ], the callback = self.parse_detail, Meta = { ' item ' : item}) 

    DEF parse_detail (Self, Response):
         # Get item by Response 
        item response.meta = [ ' item ' ] 
        item [ ' the Actor ' ] response.xpath = ( ' // div [@ class = "Row"] // Table / TR [. 1] / A / text () ' ).extract_first()
        item['time'] = response.xpath('//div[@class="row"]//table/tr[7]/td[2]/text()').extract_first()
        item['long'] = response.xpath('//div[@class="row"]//table/tr[8]/td[2]/text()').extract_first()
        #提交item到管道
        yield item

File items:

 
 
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
# # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class MovieproItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() score = scrapy.Field() time = scrapy.Field() long = scrapy.Field() actor = scrapy.Field() kind = scrapy.Field() detail_url = scrapy.Field()
 

Pipe file:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json
class MovieproPipeline(object):
    def __init__(self):
        self.fp = open('data.txt','w')
    def process_item(self, item, spider):
        dic = dict(item)
        print(dic)
        json.dump(dic,self.fp,ensure_ascii=False)
        return item
    def close_spider(self,spider):
        self.fp.close()

III. How to improve crawling efficiency of scrapy

  Increase concurrency:

    The default scrapy open to 32 concurrent threads, may be appropriately increased. Modify settings in the configuration file

CONCURRENT_REQUESTS = 100 is 100, 100 to become complicated set

Reduce log level:

  When you run scrpay, there will be a lot of log information output, in order to reduce cpu usage, you can set the output log information, write to the configuration file to INFO or ERROR: LOG_LEVEL = "INFO"

Prohibit cookie:

  If the cookie is not really needed, at the time of scrapy crawling can disable cookie data may thus reduce the cpu usage, improve crawl efficiency, write in the settings file: COOKIES_ENABLED = False

Ban again:

  Identification of re-HTTP request (retry) crawling speed slows down, since it is possible to retry prohibition, prepared in the configuration file:

  RETRY_ENABLED = False

Reduce download times out:

  If a link to a very slow crawling, reduce the download time-out card can make the main link was abandoned quickly, so as to enhance efficiency, it is written in the configuration file:DOWNLOAD_TIMEOUT = 10 超时时间为10s

Test case: crawling Xiaohua Wang Xiao flower pictures: www.421609.com

Reptile file:

# -*- coding: utf-8 -*-
import scrapy
from xiaohua.items import XiaohuaItem

class XiahuaSpider(scrapy.Spider):

    name = 'xiaohua'
    allowed_domains = ['www.521609.com']
    start_urls = ['http://www.521609.com/daxuemeinv/']

    pageNum = 1
    url = 'http://www.521609.com/daxuemeinv/list8%d.html'

    def parse(self, response):
        li_list = response.xpath('//div[@class="index_img list_center"]/ul/li')
        for li in li_list:
            school = li.xpath('./a/img/@alt').extract_first()
            img_url = li.xpath('./a/img/@src').extract_first()

            item = XiaohuaItem()
            item['school'] = school
            item['img_url'] = 'http://www.521609.com'yieldImgurl

            + item

        if self.pageNum < 10:
            self.pageNum += 1
            url = format(self.url % self.pageNum)
            #print(url)
            yield scrapy.Request(url=url,callback=self.parse)

item file

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class XiaohuaItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    school=scrapy.Field()
    img_url=scrapy.Field()

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json
import os
import urllib.request
class XiaohuaPipeline(object):
    def __init__(self):
        self.fp = None

    def open_spider(self,spider):
        print('开始爬虫')
        self.fp = open('./xiaohua.txt','w')

    def download_img(self,item):
        url = item['img_url']
        fileName = item['school']+'.jpg'
        if not os.path.exists('./xiaohualib'):
            os.mkdir('./xiaohualib')
        filepath = os.path.join('./xiaohualib',fileName)
        urllib.request.urlretrieve(url,filepath)
        print(fileName + " download success " ) 

    DEF process_item (Self, Item, Spider): 
        obj = dict (Item) 
        json_str = json.dumps (obj, ensure_ascii = False) 
        self.fp.write (json_str + ' \ the n- ' ) 

        # Download pictures 
        self.download_img (Item)
         return Item 

    DEF close_spider (Self, Spider):
         Print ( ' end of the reptiles ' ) 
        self.fp.close ()

settings.py configuration file:

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100
COOKIES_ENABLED = False
LOG_LEVEL = 'ERROR'
RETRY_ENABLED = False
DOWNLOAD_TIMEOUT = 3
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
DOWNLOAD_DELAY = 3

 

Guess you like

Origin www.cnblogs.com/zhaoyang110/p/11525238.html