Reptile 1

1. What is a reptile?

Definition: A crawler is an automated program that requests a website and extracts data.

2. Basic process

Initiate a request---"get the response content--"parse the content--"save the data

Initiate a request to the target site through the http library, that is, send a Request, the request can contain additional headers and other information, and wait for the server to respond.

If the server responds normally, it will get a Response. The content in the Response is the page content that needs to be obtained, and the type is H5, Json string, binary data, etc.

The content obtained is HTML, which is carried out with regular expressions and web page parsing libraries. It may be json. It can be directly converted to json object parsing, and if it is binary data, it can be saved for further processing.

There are various forms of saving, which can be saved as text, saved to a database, or saved to a special text.

3.1,Request:

My browser ----> visit a website (www.baidu.com) ----> the site is running on a computer/server (Reauest) [that is, the browser sends a message to the server where the website is located]

The site is running on a computer/server (Reauest) ---> return a data Response ----> get the page you want to get [After the server receives the message sent by the browser, it can The message is processed in response, and then the message is sent back to the browser, called Response]

For example (every record in Network under F12 is a request and response)

3.2URL

Definition: Uniform resource locator, such as a picture on a website, a video can be uniquely identified by a url link

3.3 Request headers

Usually add use-agent cookie etc.

3.4 Request body

Additional data carried during the request, such as form data when the form is submitted, etc.

4 Content in Resonse

4.1 Status code

200--success,,300-jump,,404--page not found,,,500 or more--server processing error

4.2 Response headers

Such as content type, content length, server information, setting cookies, etc.

4.3 Response body

The main part contains the content of the requested resource, such as web page HTML, image binary data, etc.

To sum up, the crawler first sends the request, then judges the status code of the request, then obtains the body, and finally analyzes the content of the body.



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325218181&siteId=291194637