Web crawler Overview
A definition of
Web spider, web robot crawling process network data.
In fact, the Python program to imitate people click your browser and visit the site, but also to imitate the more realistic the better.
Two crawling data object
1, to obtain large amounts of data, used for data analysis
2, the test data of the project company, the company's business required data
Three companies acquired data mode
1, the company's own data
2, third-party data platform purchase (data hall, Guiyang Big Data Exchange)
3, data crawling reptiles
Four advantages do python reptiles
1, Python: request module, analysis module Rich, ripe, powerful web crawler framework Scrapy
2, PHP: multithreading, asynchronous support is not very good
3, JAVA: heavy codes, the code amount
4, C / C ++: Although high efficiency, but slower forming the code
Five reptile classification
1, general web crawlers (search engines, robots comply with the agreement)
robots protocol: Web site tell the search engine robots protocol by which pages can be crawled, which can not crawl the page,
General web crawler robots need to comply with the agreement (gentlemen's agreement)
https://www.taobao.com/robots.txt
2. Focus Web crawler: crawlers write their own
Step six data crawling reptiles
1, to determine the URL address needs crawling
2, a request to the URL address from the requesting module, and get a response site
3, the content of the response from the extract the required data
1, the required data, save
2, the page URL address other needs continue to follow up and continue to step 2 to send the request, so the cycle