scrapy framework (internal to download content is to use asynchronous non-twisted rent plug module)
1. dependence twisted
internal implementation reptiles concurrent event-based mechanism cycle of
non-rent plug: Do not wait for a connection request, connection without waiting for a connection to go, sending the next immediately after a send
asynchronous: callback manifestation of the notification sent successfully come back as long as automatically notify
event loop: loop socket mission, to detect whether the socket connection is successful if the state returns the result
vernacular: http single-threaded and can initiate requests to multiple target
official: cycling event-based asynchronous non-plug modules rent
. 1 from twisted.web.client Import the getPage, the defer 2 from twisted.internet Import Reactor . 3 . 4 . 5 # first portion agent begins receiving a task . 6 DEF the callback (Contents): . 7 Print (Contents) . 8 . 9 deferred_list = [] # Task List 10 = URL_LIST [ ' http://www.bing.com ' , ' https://segmentfault.com/ ' , ' https://stackoverflow.com/ ' ] . 11 for URLin URL_LIST: 12 is deferreds the getPage = (bytes (URL, encoding = ' UTF8 ' )) # acquired demand 13 is deferred.addCallback (callback) # notification callback directly execute callback function 14 deferred_list.append (deferreds) 15 16 # second portion after the agent to perform the task, stopping 17 DLIST = defer.DeferredList (deferred_list) 18 DEF all_done (Arg): 19 reactor.stop () 20 dlist.addBoth (all_done) # receiving three tasks regardless of whether the task is executed successfully 21 # begin processing tasks 22 reactor.run ()
2. Write parse
. 1 DEF the parse (Self, Response): 2 . 1 . Response . 3 response.text . 4 response.encoding . 5 response.body . 6 response.request # current response which is initiated by the request: the request package (url to be accessed, then the download is complete which function performed) . 7 2 . resolved . 8 response.xpath ( ' // div [@ the href = "X1" / a] ' ) .extract_first () # first . 9 response.xpath ( ' // div [@ the href = "X1" / A] ' ) .extract () # All 10 response.xpath ( 'div // [@ the href = "X1" / A / text ()] ' ) .extract () . 11 tag_list response.xpath = ( ' // div [@ the href = "X1" / A] ' ) .extract () 12 is for tag in tag_list: 13 is tag.xpath ( ' .// ' ) to find the current label descendants 14 . 3 retransmission request again (but not yet issued, but the package). 15 the yield the request (URL = Page, the callback = Self. the parse) # just packaging, not only performs a request to initiate downloading
the difference:
1 1 .twisted the difference between requests? 2 . 1 .requests can be forged browser sends a Http request module implemented in Python . 3 - encapsulated socket transmission request . 4 . 5 2 .twisted asynchronous event-based non-circular plug frame rent . 6 - encapsulating socket transmission request . 7 - threaded complete concurrent operation does not wait to go directly to hair, regardless of the success, just send 8 - non-Cypriot rent does not wait for 9 - asynchronous callback 10 - event cycle: the cycle has been to check the status