Web Crawler
- Web spider, web robot, crawl data network program
- With a Python program to imitate people to visit the site, the more realistic the better to imitate
- Analysis of market trends, the company through effective decision making large amounts of data
Enterprise access to data mode
- The company's own data
- Purchase third-party data platform
- Reptiles crawling data
Python do reptiles advantage
- Request module, analysis module rich mature, robust framework scrapy
- PHP: multithreading, asynchronous support is not very good
- JAVA: Code bulky, large amount of code
- C / C ++: Although high efficiency, but the code is very slow molding
Reptile classification
Universal Web Crawler
Search engine references, robots need to comply with the agreement
Focus Web Crawler
Write your own reptiles: reptiles subject-oriented, demand-driven crawler
Step crawling data
- Determine the need for crawling the URL address
- To obtain the corresponding HTML page via HTTP / HTTPS protocol
- HTML page to extract useful data
- The required data, save
- There are other page URL, continue to step 2
Reproduced in: https: //www.jianshu.com/p/6c65c88611cf