On the standard reptiles, dinner from the parent of the Python!

  First of all have to admit that he made headlines party, the essence of this paper is to analyze the address 500lines or less the crawl project, this project is https://github.com/aosabook/500lines, interested students can see, it is a very high quality collection of open source projects, is said to write a book, but looking at the record submitted to the code, this book should not be available time soon. This article was written very dregs, the error must mention ah. . .

 

  Web crawlers from one or several initial pages of the URL is obtained as a URL on the initial page, in the process of crawling web pages, and continue to extract new URL from the current page into the queue until the system must meet the stop condition. Simple web crawler may be understood as a while loop with a termination condition, when the condition is not triggered, the crawler to constantly obtain data from each page, and the acquired transmission request url, then parse the current page url, continued iteration down. In the crawl among engineering, this process is completed crawler class, he did not use breadth-first or depth-first reptiles, when the current request by python failed to suspend the current task, then, after re-scheduling, which can barely understood based network connectivity a * search, which mode of operation is as follows:

  

  On crawler Objects after an initialization, in which there is a url, a todo collection, storage, yet continue to do url crawler operation; a busy collection, preservation wait for other reptiles data url collection; a done collection, preservation completion page crawling url set. The core reptile is this cycle of death, first of all reptiles get a url from among a set of todo, then initialize the object is used to obtain fetch url on a page, perform a final task scheduling request url task. This process shown in the following code.

 1 @asyncio.coroutine
 2 def crawl(self):
 3         """Run the crawler until all finished."""
 4         with (yield from self.termination):
 5             while self.todo or self.busy:
 6                 if self.todo:
 7                     url, max_redirect = self.todo.popitem()
 8                     fetcher = Fetcher(url,
 9                                       crawler=self,
10                                       max_redirect=max_redirect,
11                                       max_tries=self.max_tries,
12                                       )
13                     self.busy[url] = fetcher
14                     fetcher.task = asyncio.Task(self.fetch(fetcher))
15                 else:
16                     yield from self.termination.wait()
17         self.t1 = time.time()

   

  Obviously a crawler will not consist solely of an endless loop, the outer layer needs to support other modules in the crawl operation thereof, including network connectivity, url acquisition task scheduling tasks, project scheduling framework entire crawl as follows:

 

  Creating initialization time ConnectionPool in first create a crawl:

 

  self.pool = ConnectionPool(max_pool, max_tasks)

  And wherein the retention properties Queue connections, respectively, and holding the collection queue connection for subsequent scheduling; the connection and stored in the host port number and supports SSL,) acquired via asyncio.open_connection (.

  self.connections = {} # {(host, port, ssl): [Connection, ...], ...}
  self.queue = [] # [Connection, ...]

  The method of performing tasks crawl through the first load loop.run_until_complete (crawler.crawl ()) to which event loop, then the above statement built pool ConnectionPool saved link connection object, the connection object is then acquired by a data fetch method crawl object fetcher take. Url request for a task, using fetcher processing, scheduling is used asyncio.Task scheduling method. Which fetch method to get the hang of the generator, to asyncio.Task execution.

  And by the yield from asynico.coroutine statement, this method becomes Generator during execution, executing fetcher.fetch () method if time is pending, the processing is performed by the scheduler.

  fetcher.fetch () method is the core method of Web crawler, which is responsible for getting the url and page data loaded into a collection of them todo from the network, which tries to get the page data when attempts to reach the upper limit stop operation, to succeed html and external data links and redirected links will be stored. In the case where the url link number reaches the upper limit, this will stop the url link operation, the output error log. After the different state of the page, take a different approach.

  The following code is crawling.py file 333 from the end of the start line (crawling.py) to the area corresponding method, the selection of the page status is determined by a different approach. Where the regular expression, access to information on the page url, here is the beginning of the selected string href, the core url extract the code below: 

 1 # Replace href with (?:href|src) to follow image links.
 2 self.urls = set(re.findall(r'(?i)href=["\']?([^\s"\'<>]+)',body))
 3 if self.urls:
 4     logger.warn('got %r distinct urls from %r',len(self.urls), self.url)
 5     self.new_urls = set()
 6     for url in self.urls:
 7         url = unescape(url)
 8         url = urllib.parse.urljoin(self.url, url)
 9         url, frag = urllib.parse.urldefrag(url)
10         if self.crawler.add_url(url):
11             self.new_urls.add(url)            

  Through the code, it is clear you can see a regular set of results are stored in the matching todo urls set among and sequentially processed by a for loop, the current is added to the fetcher objects among the crawler.

 

  On the basis of prior analysis of crawl.py master file for further analysis, we can get the overall architecture of reptiles:

  It is first performed in which the master file through argparse.ArgumentParser analysis, and data reading and control console is provided, wherein the IOCP selected as the event loop target windows environment. The main method, first returned data is stored by the dictionary command parse_args, without root attribute, the prompt. Then configure the log level, the log output level indication, not below the minimum level of output.

  When entering the program through the main entry point method, according to first-line parameters from the command Crawler be initialized, and get to use asyncio the loop event object execution run_until_complete method, this program will be executed until the end of the run.

  In addition reporting.py for the implementation of the current print job. Which fetcher_report (fetcher, stats, file = None) of the print job status url, url is the url fetcher property; report (crawler, file = None) url print job status of the entire project all completed.

  至此,crawl的基本框架就展现在眼前了。至于在这个程序中出现的一些不容易理解的python语言特性,某些应用到的核心模块,将在下一篇博客《标准爬虫分析,精简不简单!》中进行阐述。

 

转载于:https://www.cnblogs.com/wangbiaoneu/p/crawl-python-500lines.html

Guess you like

Origin blog.csdn.net/weixin_34413103/article/details/93550861