Basics - Crawler

When pulling data from a website, you need to be a crawler. Any content that can be viewed through a browser can be crawled. Crawler crawlers, Spider spiders, and Scraper collectors are all about the same thing, with slight differences. To be a crawler, you need to be proficient in threads, queues, and distributed processing.

1) Classification
Universal crawler (crawl the entire page of the link) and vertical crawler (crawl the specified data of a certain type of website)
*** Generally speaking, crawler refers to vertical crawler. 2 )

What to climb
Baidu , songs, movies, books Baidu
cloud disk, BT seeds , Netease, Sina) contact information (email, phone) and all the data you want to get. *** See more discussions on Zhihu: What cool, interesting and useful things can be done with crawler technology? 3) Working principle a. Give the crawler a seed URL b. Get the page content and store it, extract the URL in the page c. Add the newly obtained URL to the URL queue and wait for processing d. Get a URL from the URL queue and then Repeat step b 4) Crawl strategy Depth -first strategy: visit the sub-level webpage first, and then visit the same-level webpage. Breadth-first strategy: Visit the same-level webpage first, and then visit the child-level webpage. Best-priority strategy: Prioritize access to web pages that contain targeted information. 5) Crawl the data source




















RSS, API, AJAX, Web (Pagination)

6) Data collection
Regular matching, DOM tree (CSS path, XPath)
*** Generally, crawlers will crawl multiple websites, so the parsing rules of each website need to be managed uniformly.

7) Data deduplication
Do not crawl the same URL
repeatedly shingling algorithm
simHash algorithm
MD5 Checksum

8) Crawling step
Crawling Grab useful data sources
Downloading Download the content of data sources
Scraping Extract useful data information (data check)
Extracting Extract data information Data elements in (de-reclassification)
Formatting Organize data elements into formats required by other systems
Exporting Export data to other systems (stored in databases or indexed)
*** Processing of special websites: simulated login, verification code recognition, Multi-site crawling, JS rendering

9)
User-Agent judgment of crawler detection Frequent access to the same IP in a
short period of time
Multiple executions of the same process
Honeypots detection: set up some connections that humans cannot access

10) Send crawler strategy Follow robots
as much as possible .txt
limits the speed, depth and number of pages crawled to
reduce concurrent requests
Masquerade UA (Dynamic User-Agent)
Use proxy IP (IP Rotation, Proxies, Blacklisting)
Do not grab connections with nofollow display:none
Randomly grabbing model, do not only perform one task
*** When the website is identified as a crawler: verification occurs Code input, useless information, HTTP error
*** Re-dial ADSL to obtain a new IP

11) Monitoring
Crawler server monitoring: bandwidth, CPU, memory, disk
Crawler monitoring: whether the crawler is running normally Monitoring of the
target website : Whether the data source is available, whether the data source structure has
changed Monitoring of scraped information: Is the information garbled

12) Open source framework
a. Framework
Python
  Scrapy https://github.com/scrapy/scrapy
  pyspider https://github.com/binux /pyspider
Java
  crawler4j https://github.com/yasserg/crawler4j
  WebMagic https://github.com/code4craft/webmagic
b. HTTP request
  urlli or HttpURLConnection, HttpClient
c. HTML parsing
  BeautifulSoup https://www.crummy.com/software/BeautifulSoup/
  jsoup https://jsoup.org/
d. JS rendering
  Selenium http://www.seleniumhq.org/
  PhantomJS http://phantomjs. org/
e. Other
  Elasticsearch, Redis

references:
https://github.com/lorien/awesome-web-scraping
http://www.cnblogs.com/wawlian/archive/2012/06/18/2553061.html
http: //www.lanceyan.com/tech/arch/snscrawler.html

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326753059&siteId=291194637