Web crawler Overview

Web crawler Overview

A definition of

Web spider, web robot crawling process network data.

In fact, the Python program to imitate people click your browser and visit the site, but also to imitate the more realistic the better.

Two crawling data object

1, to obtain large amounts of data, used for data analysis

2, the test data of the project company, the company's business required data

Three companies acquired data mode

1, the company's own data

2, third-party data platform purchase (data hall, Guiyang Big Data Exchange)

3, data crawling reptiles

Four advantages do python reptiles

1, Python: request module, analysis module Rich, ripe, powerful web crawler framework Scrapy

2, PHP: multithreading, asynchronous support is not very good

3, JAVA: heavy codes, the code amount

4, C / C ++: Although high efficiency, but slower forming the code

Five reptile classification

1, general web crawlers (search engines, robots comply with the agreement)

robots protocol: Web site tell the search engine robots protocol by which pages can be crawled, which can not crawl the page,

General web crawler robots need to comply with the agreement (gentlemen's agreement)

https://www.taobao.com/robots.txt

2. Focus Web crawler: crawlers write their own

Step six data crawling reptiles

1, to determine the URL address needs crawling

2, a request to the URL address from the requesting module, and get a response site

3, the content of the response from the extract the required data

  1, the required data, save

  2, the page URL address other needs continue to follow up and continue to step 2 to send the request, so the cycle

 

Guess you like

Origin www.cnblogs.com/maplethefox/p/11319858.html