Python crawler - the impact of GET and POST on the crawled web page status and common web page status codes (200, 401, 402, 404, etc.)

     Main content: The impact of GET and POST on the crawled web page status and common web page status codes

Table of contents

Distinguish web page request GET or POST

Features of the get method

Features of the post method

Web page return status code

200

non 200


Distinguish web page request GET or POST

GET, request the specified page information and return the entity body.

POST, submits data to the specified resource for processing requests (such as submitting a form or uploading a file). The data is all included in the request body.

Features of the get method

get is actively cached, it is safe and idempotent

Features of the post method

Of course, for our reptiles, these concepts are not important, the important thing is:

Depending on the request, the information is delivered in a different way, especially cookies.

Web page return status code

200

Of course, the normal web page returns 200:

non 200

It is also possible to return some status codes other than 200, pay attention! The non-200 status may also be accessible, but most non-200 statuses represent exceptions and your access request is rejected.

Common web page status codes
Status Code Status Code Is it possible to access meaning
200 can visit

(success)

The server has successfully processed the request. Usually, this means that the server served the requested web page.

202 can visit

(accepted)

The server has accepted the request but has not yet processed it.

203 can visit

(non-authorized information)

The server successfully processed the request, but the information returned may have come from another source.

300 Inaccessible

(multiple choices)

In response to a request, the server can perform various operations. The server can choose an operation according to the requester (user agent), or provide a list of operations for the requester to choose.

301 Inaccessible

(moved permanently)

The requested webpage has been permanently moved to a new location. When the server returns this response (in response to a GET or HEAD request), it automatically forwards the requester to the new location.

302 Inaccessible

(temporary move)

The server is currently responding to requests from pages in a different location, but the requester should continue to use the original location for future requests.

400 Inaccessible

(bad request)

The server did not understand the syntax of the request.

401 Inaccessible

(unauthorized)

The request requires authentication. The server might return this response for web pages that require a login.

403 Inaccessible

(prohibit)

The server rejected the request.

404 Inaccessible

(not found)

The server could not find the requested web page.

406 Inaccessible

(not accepted)

Unable to respond to the requested webpage with the requested content attributes.

500 Inaccessible

(internal server error)

The server encountered an error and could not complete the request.

502 Inaccessible

(bad gateway)

服务器作为网关或代理,从上游服务器收到无效响应。

503 无法访问

(服务不可用)

服务器目前无法使用(由于超载或停机维护)。 一般只是暂时状态。

 当出现401、403这些状态的时候,就去检查你的headers或者cookies吧。

request中的headers和cookies的作用、如何设置headers或者cookies、什么时候可以不加cookies、GET或POST请求的区别请见上文:

python爬虫 - headers请求头和cookies的原理和使用方法_昊昊该干饭了的博客-CSDN博客主要内容:request中的headers和cookies的作用、如何设置headers或者cookies、什么时候可以不加cookies、GET或POST请求的区别、网页对请求的判断以及常见的网页状态码https://blog.csdn.net/qq_52213943/article/details/125148992

原创不易,转载标明出处。

Guess you like

Origin blog.csdn.net/qq_52213943/article/details/125571882