Web crawler (also known as web spider, web robot, in the middle of FOAF community, more often called web Chaser), is a kind of follow certain rules, automatically grab information on the World Wide Web program or script.
In fact, the popular talk is to obtain data on the web page you want by program, which is automatically grab data
The basic flow of reptiles
Initiate a request
to initiate a request to the target site via HTTP library, that is, send a Request, the request may contain additional header information, waiting for a server response
Acquiring response content
if the server can be a normal response, will get a Response, Response contents page content is to be acquired, may be the type of HTML, Json string, binary data (images or videos) and other types
Analytical content
obtained content may be HTML, you can use regular expressions to parse, page parsing library, may be Json, it can be directly converted into Json object parsing and may be binary data, or can be stored for further processing
Save data
stored in various forms, can be saved as text, it can be saved to the database, or files stored in a specific format
reference:
1. Write reptiles with Golang (a)
3. layman reptiles way: Comparative Python, Golang with the GraphQuery