Python reptile (a) What are reptiles?

What is a reptile

For entry on Baidu reptiles is defined as: web crawler (also known as web spider , web robot, in FOAF the middle of the community, more often called web Chaser), it is a kind of follow certain rules, automatically grab World Wide Web program or script information. Other less frequently used names include ants , automatic indexing, simulation programs or worms .
That's easy to some reptiles is to simulate client (browser) sends a network request to obtain the network response, and data extraction procedure storing data according to certain rules.

The main flow of reptiles

Url structure

Reptiles to climb data, more than just a web page so simple, sometimes we need to climb is the data for the entire site, if we get a page to a url, that efficiency is certainly too low. So before writing crawlers, we need to know the laws url address, so that the child can construct url list, then the list from url url climb to the data we need.

Sending a request acquisition response

HTTP library initiated by a request to the target site, which is waiting for the server to send a Request response, if the server can be a normal response, will get a Response, Response contents page content is to be acquired, may be the type of HTML, Json string, binary data (pictures or video) and other types.

Extract data

When the data is returned html, we can expression or xpath lxml module with data being extracted; when returns json string, we can use json data analysis module; the return of binary data, can be done to save or further processing.

save data

Save forms, can be saved as text, it can be saved to the database, or to save the file in a specific format.

Reproduced in: https: //www.jianshu.com/p/cd6977510dc8

Guess you like

Origin blog.csdn.net/weixin_33692284/article/details/91077758