Crawler technology selection

First, the acquisition and analytical methods

Due to the need to take a large number of non-precise climb the page, so you can not construct according to the website feature requests, causing many pages can not be constructed entirely of JS successfully acquired HTML.

  There are two options.

    1. JS performed by the page htmlunit other tools, and acquires the return value of the operation.

    2. Use the browser engine load, real analog browser.

  Scenario 1: In general, htmlunit and other tools can parse JS simple statement, but it does not effectively perform a large number of JS (especially all JS written page)

  Scenario 2 also has disadvantages can not be ignored.

    1. Although it is possible to set not read CSS, JS wait but totally finished still take time, less efficient.

    2. Due to the use Phantomjs, in the case of not controlled, multi-threaded phantomjs prone to problems. Do not shut down the phantomjs also waste a lot of resources.


  The project focuses on the needs of accuracy, try not required missing data, so I decided to choose the second option.

  Defect solutions: multi-threaded process rather than open phantomjs add tasks.

Second, the choice of framework

Due to the subsequent development may involve hadoop, but currently there is no hadoop environment, it was decided to adopt webcollector framework.

The framework supports a variety of data acquisition mode driver. As httpclient, htmlunit, selenium, phantomjs like.

Analytical aspects, also supports mainstream CSS selectors, and other commonly used regular screening.

Multithreading can perform the task (here need to make an appointment or thread-count package to ensure the normal operation of phantomjs), automatically remove duplicate url. Currently using version 2.x, it is worth noting that in the future too webcollector 2.x version is not supported by redis only 1.x support. 2.x and later retained only berkeleyDB RamDB (memory database is actually Map)

Which berkeleyDB support breakpoint continued to climb, while RamDB does not support the (apparent)

  At doubts that found RamDB not use HashTable or ConcurrentHashMap, how it is guaranteed not to go wrong when multiple threads add tasks to read the source code? Note: merge method does not seem to multithreading

In addition, when the crawler can add custom filtering seed and regular. It should be noted here, this time adding a regular task is not added to the visit method will be used. That is, if a subsequent link generated by the JS code, webcollector does not automatically add the page url.

@@@ constructor regularization will affect the url visit added.




Other options:

nutch can meet demand. But first of nutch crawl dynamic pages do not know the way, followed by the company no hadoop environment, it will not be considered.

webmagic not read the source code, see the introduction more suitable orientation for a specific site crawling, not suitable for large-scale non-precision crawling business needs, temporary homes.

Failure treatment

Guess you like