2 HTTP和HTTPS

HTTP和HTTPS

HTTP stands for Hyper Text Transfer Protocol, the Chinese called for Hypertext Transfer Protocol . HTTP protocol is used for hypertext data from the network to the local browser transport protocols that can ensure efficient and accurate transfer hypertext documents. HTTP by the World Wide Web Consortium (World Wide Web Consortium) and the Internet Working Group IETF (Internet Engineering Task Force) work together to develop specifications, widely used is HTTP1.1 version.

HTTPS stands for Hyper Text Transfer Protocol over Secure Socket Layer , is a security target HTTP channel, simply, is a safe version of HTTP, HTTP added SSL layer, referred to as HTTPS.

HTTPS SSL security infrastructure is , therefore by its contents are transmitted through SSL encryption, its main role can be divided into two types:

  1. Interfaces to establish a channel of information security to ensure the safety of data transmission.

  2. Mouth confirm the authenticity of the site, any use of the HTTPS site, you can view real information after the certification website by clicking on the lock symbol in the browser address bar can also be queried via a secure signature CA authority.

Now more and more websites and App have been developed to HTTPS direction , for example:

  • Apple forces all iOSApp on January 1, 2017 all instead use HTTPS encryption, it can not be otherwise App in App Store shelves;

  • Google from January 2017 launch of the Chrome56 beginning of the URL link is not encrypted for HTTPS showed the risk warning, reminding the user that is in a prominent position in the address bar "this page unsafe";

  • Official requirements document Tencent micro letter backstage applet requires the use of HTTPS requests for network communications, domain names and does not meet the conditions of the agreement can not be requested.

  And although some sites that use HTTPS protocol, but they will be prompted to unsafe browsers, such as we open in the Chrome browser inside 12306, link to: HTTPS: / www.12306.cn/ , then the browser will prompt "your connection is not private," it is because CA 12306 certificate is issued by the Chinese Ministry of Railways on their own, and this certificate is not trusted CA mechanism, so here you will not be prompted to verify the certificate by the case, but the actual its data transmission is still through SSL encryption. If you want to crawl this site, you need to set the option to ignore the certificate, otherwise it will prompt SSL link errors.

 

HTTP request process

 

 

request

1. Method Request

There are two common request methods: GET and POST entering the URL directly in the browser and press Enter, which will initiate a GET request, the request will contain the parameters directly into the URL. For example, a search in Baidu Python, this is a GET request, the link is https://www.baidu.com/s ? wd = Python, where the URL contains parameter information requested, here parameter wd pledged to search for keywords.

POST requests are mostly initiated when the form is submitted. For example, for a login form, enter your user name and password, click on the "Login" button, which usually initiate a POST request, the data is typically transmitted in the form of the form, but will not be reflected in the URL.

GET and POST request method with the following differences:

  • GET request containing a parameter, which can be seen in the data URL in the URL, and the URL of the POST request will not contain data, the data are transmitted through the form of the form, it is included in the request body.

  • GET requests submitted data only up to 1024 bytes, while the POST method is not limited.

In general, when you log on, you need to submit a user name and password, which contains sensitive information, use the GET method request, then the password will be exposed on the inside URL, create a password leak, so here is best to send the POST method. When you upload a file, because the file content is relatively large, it will choose the POST method.

 

Other request methods

GET  request page, and return to the page content
HEAD similar to GET requests, responses, but returned no specific content, for obtaining header
POST  mostly used to submit the form or upload a file, the data contained in the request body
PUT from the client to data server transmits to replace the contents of the specified document
dELETE request to the server to delete the specified page
CONNECT server as a springboard to allow clients to access the server instead of the other pages
OPTIONS allows the client to view the performance of the server
request TRACE echo server receives, mainly for testing or diagnosis

 

2. Common request headers

Accept   request header field for the client to specify what type of information is acceptable. Port Accept-Language: Specifies the type of client acceptable language. Port Accept-Encoding: specifying the client acceptable content encoding.

Host   is used to specify the host IP and port number of the requested resource, which reads the position of the original server or gateway requested URL. Starting HTTP1.1 version, the request must contain this content.

Cookie  is also used in the plural Cookies, this is the site to identify the user session tracking and stored in the user's local data. Its main function is to maintain the current access session. For example, when we enter the user name and password to successfully log in to a site, the server will save logged information session, behind every time we refresh or request other pages of the site, you will find are logged in, this is a credit to Cookies. Cookies There is information that identifies a session with the server we corresponding to a time when the browser requests a page of the site, will add Cookies in the request header and sends it to the server, via Cookies to identify our own, and find out the current status is logged in, the returned result page content after login to see.

Referer  此内容用来标识这个请求是从哪个页面发过来的,服务器可以拿到这一信息并做相应的处理,如做来源统计、防盗链处理等。

User-Agent  简称UA,它是一个特殊的字符串头,可以使服务器识别客户使用的操作系统及版本、浏览器及版本等信息。在做爬虫时加上此信息,可以伪装为浏览器;如果不加,很可能会被识别出为爬虫。

Content-Type  也叫互联网媒体类型(Internet Media Type)或者MIME类型,在HTTP协议消息头中,它用来表示具体请求中的媒体类型信息。例如,text/html代表HTML格式,image/gif代表GIF图片,application/json代表JSON类型,更多对应关系可以查看此对照表:
http://tool.oschina.net/commons。

 

3.请求体

请求体一般承载的内容是POST请求中的表单数据,而对于GET请求,请求体则为空。

Content-Type和POST 提交数据方式的关系:

application/x-www-form-urlencoded   表单数据
multipar/form-dat   表单文件上传
application/json  序列化JSON数据
text/xml  XML数据

 

响应

响应,由服务端返回给客户端,可以分为三部分:响应状态码(Response Status Code)、响应头(Response Headers)和响应体(Response Body)

 

1.响应状态码

状态码分类:

1**  信息,服务器收到请求,需要请求者继续执行操作

2**  成功,操作被成功接收并处理

3**  重向,需要进一步的操作以完成请求

4**  客户端错误,请求包含语法错误或无法完成请求

5**  服务器错误,服务器在处理请求的过程中发生了错误

常见的响应状态码:

200   请求成功

301     资源(网页等)被永久转移到其它URL

404  请求的资源(网页等)不存在

500  内部服务器错误

 

 

2.常见的响应头

 

Date  标识响应产生的时间。
Last-Modified  指定资源的最后修改时间。
Content-Encoding  指定响应内容的编码。
Server  包含服务器的信息,比如名称、版本号等。
Content-Type  文档类型,指定返回的数据类型是什么,如texthtml代表返回HTML文档,application/x-javascript则代表返回JavaScript文件,image/jpeg则代表返回图片。
Set-Cookie  设置Cookies。响应头中的Set-Cookie 告诉浏览器需要将此内容放在Cookies中,下次请求携带 Cookies请求。
Expires  指定响应的过期时间,可以使代理服务器或浏览器将加载的内容更新到缓存中。如果再次访问时,就可以直接从缓存中加载,降低服务器负载,缩短加载时间。

 

 

3.响应体

最重要的当属响应体的内容了。响应的正文数据都在响应体中,比如请求网页时,它的响应体就是网页的HTML代码;请求一张图片时,它的响应体就是图片的二进制数据。我们做爬虫请求网页后,要解析的内容就是响应体

在浏览器开发者工具中点击Preview,就可以看到网页的源代码,也就是响应体的内容,它是解析的目标。

在做爬虫时,我们主要通过响应体得到网页的源代码、JSON数据等,然后从中做相应内容的提取。

Guess you like

Origin www.cnblogs.com/shibojie/p/11403946.html