Reptiles from entry to exit

Crawler from entry to exit
javascript
Ajax API

The source code of the web page will be parsed into a DOM tree. The common process of
web crawler:
loop: take out the URL from the library to be crawled – grab the extracted URL – parse the structured data in the grabbed webpage – detect the URL in the grabbed webpage – remove The URLs that have been crawled in the detected URLs, and the remaining un-crawled URs are put into the library to be crawled.
Main components:
historical URL library; HTTP request component; web page structured data extraction component; new URL detection Component; URL Deduplication Component
Regular expression
CSS selector or XPath (element locator) is a feature provided by browser JS to get nodes on this data structure. The problem solved
by the crawler framework : parallel crawling, URL deduplication, saving historical information, detecting new URLs, providing developers with web page parsing interfaces (CSS selectors, XPath, etc.), and providing developers with http request customization interfaces (simulation
Login, post form, etc.)
common framework:
general crawler framework: serving search engines, downloading web pages on a large scale, extracting coarse-grained content from web pages, and submitting them to the index. For example: Nutch, heritrix
fine data collection crawler framework: collection of specific structured data Examples: java:Wecollector,Webmagic
python:SCRAPH
General selection considerations: magnitude, excellent URL maintenance mechanism, automatic detection of URLs, repeated crawling of pages to Detect new URLs for easy customization and extension.
Refinement data collection crawler framework: Whether the web page extraction support is excellent, whether the Http request can be deeply customized, the URL detection mechanism can be deeply customized, whether the deduplication component is an efficient bottleneck, whether it can be collected at breakpoints, and whether it can process the data loaded by javascript.
Distributed crawlers use clusters to solve problems such as crawler computing, storage and bandwidth resources
Distributed crawler based on Map-Reduce Distributed crawler
based on distributed message queue

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325478417&siteId=291194637