Web Crawler basis

Web Crawler

  • Web spider, web robot, crawl data network program
  • With a Python program to imitate people to visit the site, the more realistic the better to imitate
  • Analysis of market trends, the company through effective decision making large amounts of data

Enterprise access to data mode

  • The company's own data
  • Purchase third-party data platform
  • Reptiles crawling data

Python do reptiles advantage

  • Request module, analysis module rich mature, robust framework scrapy
  • PHP: multithreading, asynchronous support is not very good
  • JAVA: Code bulky, large amount of code
  • C / C ++: Although high efficiency, but the code is very slow molding

Reptile classification

Universal Web Crawler

Search engine references, robots need to comply with the agreement

Focus Web Crawler

Write your own reptiles: reptiles subject-oriented, demand-driven crawler

Step crawling data

  • Determine the need for crawling the URL address
  • To obtain the corresponding HTML page via HTTP / HTTPS protocol
  • HTML page to extract useful data
    • The required data, save
    • There are other page URL, continue to step 2

Reproduced in: https: //www.jianshu.com/p/6c65c88611cf

Guess you like

Origin blog.csdn.net/weixin_34194702/article/details/91228931