Scrapy + seleninu crawls a few questions while downloading pictures

Use the Scrapy + Seleninm + Scrapy_redis grab the content and pictures details page, posted the problem and the need to improve the code, do partial parsing and description.

# - * - Coding: UTF-8 - * - 
Import Time;

from scrapy.linkextractors Import LinkExtractor
from scrapy.spiders Import Rule
from scrapy_redis.spiders Import RedisCrawlSpider # import RedisCrawSpider, in order to achieve a task queue using Scrapy_redis persistence (support pause reptile or restart) and filtration was repeated 2
from selenium import webdriver; obtaining # Import Webdriver seleninum to invoke unbounded fluency, to achieve dynamic content, we use Chrome, you can of course use a lot of browser, you can take a look at the source code, supports many 
from selenium. webdriver.chrome.options import options # calls Chorme browser startup parameter options

from scrapyYF.items Import ScrapyyfItem


class YaofangSpider (RedisCrawlSpider):
name = 'yaofang'
allowed_domains = [ 'www.jian.com']
start_urls = [ 'HTTPS: //www.jian.com/ ']
redis_key =' JK: YP '
# must be a list of
rules = [
# the follow = False (does not follow), extract only the home page url in line with the rules, then crawling these pages url data, callback parsing
# follow = True (follow the link), in the secondary url page url continue to seek compliance with the rules, and so on, until the completion of the station's crawling
# rule (LinkExtractor (allow = ( r '\ / c \ / category \? cat_id = \ d * $ ')), follow = True), # in order to grab more, you can let go of this at the same time, from the category pages can also find the appropriate content
The Rule (LinkExtractor (the allow = (R & lt '\ / Product \ / \ * .html $ D'), UNIQUE = True), the callback = 'parse_druginfo', Follow = True)
# the Rule (LinkExtractor (the allow = (R & lt '\ / Product \ /11929.html '), UNIQUE = True), the callback =' parse_druginfo ', Follow = False)
# the Rule (LinkExtractor (the allow = (R & lt' \ / Article This article was \ / \ * $ D '), UNIQUE = True) , callback = 'parse_item', the Follow = True)
]

DEF __init __ (Self, * args, ** kwargs):
Super (YaofangSpider, Self) .__ the init __ (* args, ** kwargs) # If you do not call the parent of this, there will be AttributeError: error 'xxxSpider' object has no attribute ' _rules' of
chrome_opt = Options (); # create a parameter setting object.
chrome_opt.add_argument (' - headless'); # no interface technology.
chrome_opt.add_argument ( '- -disable-gpu '); # mating interface of the above free.
# no mating interface of the above;. - ( 'disable-infobars') chrome_opt.add_argument
chrome_opt.add_argument ( '- window-size = 1366,768'); # Set the window size, the window size will affect.
chrome_opt .add_argument ( 'blink-settings = imagesEnabled = false'); # ban load picture
# self.bro = webdriver.Chrome (executable_path = r'D: \ Python27 \ Scripts \ chromedriver.exe ')
self.bro = webdriver.Chrome (chrome_options = chrome_opt);

DEF parse_druginfo (Self, Response):
Item ScrapyyfItem = ();
Item [ 'from_url'] = response.url;
Item [ 'AddTime'] = The time.strftime ( "Y-%%% M- H% D:% M:% S ", time.localtime ());
Item [ 'Updated'] = The time.strftime ("% D%% Y-M-% H:% M:% S ", Time. localtime ());
item['class_name'] = response.xpath(
'the normalize-Space (// div [@ class = "crumb"] / div / A [2] / text ())'). extract_first () + '_' response.xpath + (
'the normalize-Space (// div [@ class = "crumb" ] / div / a [3] / text ()) ') extract_first (); # classification, for example: Chinese and Western medicine _ male drugs extract_first extracted first content, to python3 above has been used get () instead of = Extract getAll, GET = extract_first
Item [ 'goods_id'] = "jianke_" response.xpath + (
'the normalize-Space (DL // [@ class = "Assort"] [. 1] / dd / text ()) '.) extract_first (); # + source unique identification, such as: jianke_B13003000675
Item [' drug_name '] response.xpath = (
' the normalize-Space (DL // [@ class = "Assort Tongyong"] / dd / a / text ()) ' ) extract_first (); # Drug name: Liu Wei Huang Wan
item [' goods_name '] = response.xpath (
'normalize-space (// div [ @ class = "det_title"] // h1 / text ())') extract_first (); # Product Name: Liu Wei Huang Wan Tong Ren Tang (pill) 120s
Item [ 'grant_number'] response.xpath = (
'the normalize-Space (DL // [@ class = "Assort"] [2] / dd / span / text ())') extract_first ();. # 'approval number'
Item [ 'Ingredient '] = response.xpath (
U "the normalize-Space (// * [@ ID =' b_1_1 '] / TR Table // [the contains (TD,' the main raw material ')] / td [2] / text ()) ") .extract_first (); # ingredients: Rehmannia, meat, wine dogwood, tree peony bark, Chinese yam, Poria, Alisma. xpath contents extraction entitled "primary material"
Item [ 'indiction'] = response.xpath (
U "the normalize-Space (// * [@ ID = 'b_1_1'] / TR Table // [the contains (TD, ' The main role ')] / td [2] / text ()) ") extract_first (); # major role: Yin and kidney. For loss of kidney yin, dizziness, tinnitus, weak knees, Gu Zheng hot flashes, night sweats, nocturnal emission.
item [ 'standard'] = response.xpath (
u'normalize-space (// * [@ id = "b_1_1"] / table // tr [td = " Product Specification"] / td [2] / text ()) ') extract_first ();. # Product Specifications : 120 pill (pill)
Item [ 'Usages'] = response.xpath (
u'normalize-Space (// * [@ ID = "b_1_1"] / TR Table // [TD = "dosage"] / td [2] / text ()) ') extract_first (); # dosage: oral. 8 pills once, three times a day.
item [ 'manual'] = " " .join (response.xpath (u '// div [@ id = "b_2_2"] / div / child :: p') extract ().); # fetch content description
Item [ 'imgsrc'] = response.xpath (U '// div [@ ID = "Tb21"] / div // Child :: IMG / @ the src') Extract ();.
Item [ 'Manufacturer'] = Response .xpath (
u'normalize-Space (// * [@ ID = "b_1_1"] / TR Table // [TD = "manufacturer"] / td [2] / text ()) ').




  Code Summary:

  •  Use RedisCrawSpider, may be implemented to import a task queue Scrapy_redis persistence (crawler supports stopped or restarted) and filtration was repeated 2
  •    Use CrawlSpider or RedisCrawSpider, if reconstruction _init function, be sure to call the parent function, or will be reported AttributeError: 'xxxSpider' object has no attribute '_rules' error
  •   Webdriver call using selenium to an unbounded browser, you can tune a lot of unbounded browser, such as:. Chrome, firefox, safari, etc., and can set specific parameters corresponding to start, you can see https://www.cnblogs.com/jessicor /p/12072255.html
  •  extract_first extracted first content, python3 been used above get () instead of getall = extract, get = extract_first
  •     xpath contents extraction entitled "primary material" // * [@ id = 'b_1_1'] / table // tr [contains (td, 'the main raw material')] / td [2] / text ()
safetyChainMiddleware class (Object): 
DEF process_request (Self, Request, Spider):
Request.Headers [ 'the User_Agent'] = user_agent.generate_user_agent () call # user_agent generate random headers head, and prevent closure
Referer = request.url
IF Referer:
Request .headers [ 'Referer'] = referer # generated header information Referer, being sealed to prevent

class seleniumMiddleware (Object):
DEF isFindElement (Self, Spider):
the try:
# spider.bro.find_element_by_id ( 'B_2') the Click (). ;
IF WebDriverWait (spider.bro,. 3) .until (
EC.text_to_be_present_in_element ((By.XPATH, "// UL / Li [@ ID =" B_2 "]"), U 'manual')): # per 0.05ms a request, the request 3s clock, determines id = b_2, title equals specification element exists, there is a lower simulated click action is performed
spider.bro.find_element_by_xpath (u "// ul / li [@ id = 'b_2' and contains (text (), ' description')]") the Click ();.
Pass
the except:
# spider.bro.quit ()
Pass
DEF process_response (Self, Request, Response, Spider):
URL = request.url
IF to string.find has (URL, '.jpg') = -1:! # If the picture is directly returned Response, can not be used HtmlResponse, otherwise image You can not return to normal or download
return the Response
the else:
spider.bro.get (request.url)

return HtmlResponse (url = spider.bro.current_url, body = page_text, encoding = 'utf8', Request = Request) # returns unbounded browser returns the response content information.

  • Add custom middleware, in order to prevent closure, generally want to add is user_agent, proxy, referer and cookie, pay attention, we must deal with in process_requess in, Also note that the return value of the function. Different return values, will perform different operations:

    

As each request by downloading middleware, the method is called.

process_request() One of them must return: return  None , return an  Response object that returns an  Request object or The raise  IgnoreRequest .

If it returns  None , Scrapy will continue to process the request, the middleware other corresponding method execution until the appropriate handler downloading (download handler) is called, the request is performed (which response is downloaded).

If it returns  Response an object, Scrapy will not call  any  other  process_request() or  process_exception() a method, or a corresponding download function; it will return the response. Middleware installed  process_response() method will be called when each response is returned.

If it returns  Request an object, Scrapy stop calling process_request method and re-scheduling request to return. When the new request is performed to return the correspondingly intermediate chain will be called according to the downloading of response.

If it raise a  IgnoreRequest exception, the installation of the downloaded middleware  process_exception() method is called. If there is no one way to handle the exception, the request of errback ( Request.errback) method is called. If the code does not thrown exception processing, the exception is ignored and not recorded (as different from other abnormalities).

 

  • process_response, this treatment response, for pictures, be sure to return alone. Do not use htmlResponse. This is the content of unbounded browser to return. I'm here to do judgment. Otherwise the picture can not be downloaded. tangled for several days, finally looked good look at the document, Step by step procedure several times, finally found the reason .scrapy execution flow is very important.
    process_response return value also should be noted that the

 

      

process_request() It must return one of the following: a return to  Response the object and returns a  Request object or raise an  IgnoreRequest exception.

If a return  Response (incoming response may be the same, or may be new objects), the response will be in the other chain middleware  process_response() processing method.

If it returns an  Request object, the middleware chain to stop and return the request will be re-scheduled download. Processing is similar to  process_request() the return done in the request.

If it throws an  IgnoreRequest anomaly, request the errback is called ( Request.errback). If the code does not thrown exception processing, the exception is ignored and not recorded (as different from other abnormalities).

 

Guess you like

Origin www.cnblogs.com/jessicor/p/12109089.html