The request to the webpage can be divided into the following parts
- Request URL: the requested URL
- Request headers: request header
- Request body: Request body
- Request method: request method
Elaborate on the components of the request
- Request URL: URL, also called uniform resource locator, through which you can access specific resources in the server, that is, tell the browser what information you want it to store
- Request headers: request header to the server to specify additional information to be used, the following list some of the more important the request header information
cookie:用来维持登录状态,每次你打开网址时,例如优酷视频时发现不用自己输入账号密码就可以登录这都是cookie的功劳
User-agent:用户代理,给自己的爬虫附加上这个信息,可以把爬虫伪装成浏览器
content-type:表示具体请求中媒体类型信息,常见的时text,json
`The request header is an important part of the request. Most crawlers need to attach this information, which means that some crawlers may not include the request header information.
- Request method: request method, here only introduce the two most practical
post:POST请求大多用于提交表单,这些表单通常包含一些加密信息,同时也可以处理上传文件的功能,可以说这是一个比较低调的大佬
GET:相比POST,GET的所有行为都会在URL中体现
- Request body: Generally speaking, this is something that exists relative to a POST request. It contains the form data contained in the sent request. Only this relatively low-key boss is equipped with this kind of treatment, haha
Server response
The server's response can be divided into three parts:
-
Response status code: status code, here is a list of commonly used status codes:
100:继续,服务器已收到请求等待下一波攻击;200:服务器已经成功处理的请求;202:服务器已经接受请求但是尚未处理;204:服务器已经成功处理了请求但是没有返回任何内容;301:网页永久搬家;400:错误请求,服务器无法解析该请求;;401:未授权;403:拒绝访问;404:找不到网页;
-
Response header: Here are a few common values
content-type:说明返回内容的格式,applicatio/json,返回的内容就是json格式的内容,text/html:html文件;content-enconding:指定响应内容的编码方式
-
Response body: This is the big brother. Our crawler is the analysis of the response body, which is the body data of the response obtained after we initiate a request to the URL.