Reptile basis 2.1 http principle

Reptile basis 2.1 http Fundamentals

Why write reptiles to understand the principle of http?

For a brief understanding of the http request response process, easy to grasp the flow of crawlers.

2.11 URL sum URL

    URI: Uniform Resource glyphs

    URN: Uniform Resource named resource name

    URL: Uniform Resource Locator symbol specified location to access resources such as Web links

 

    Resources: refers collectively to all of the content available on the web

2.12 hypertext

    HTML source of the page file can be seen as hypertext,

Hypertext " in the page can contain pictures, links , and even music, programs and other non-text elements

Super text is to collect, store and browse discrete information and the establishment and performance of a correlation between the information technology network

 

2.13 HTML HTML

Why Learn HTML?

Most of the information on the page HTML = crawling reptiles are from the contents page, and therefore have a grasp of HTML.

HTML refers to the HTML (Hyper Text Markup Language)

HTML is a language used to describe web pages.

HTML is not a programming language, but a markup language (markup language)

Markup Language is a set of markup tags (markup tag)

HTML uses markup tags to describe web pages

It is to mark each section of the page to be displayed by a symbol mark. Page file itself is a text file, by adding tags in the text file, you can tell the browser how to display its contents (such as: how to deal with text, how to arrange the picture, how to display images, etc.).

Web browser in order to read the file, and then interpret and display content according to its mark marker for writing an error mark will not point out a mistake, it does not stop to explain the implementation process.

Here, the need to know HTML structure and characteristics of the various tabs:

    The chapters contained in the pages foundation

 

2.14 HTTP和HTTPS

    HTTP Hypertext Transfer Protocol, the protocol on the network server via transmission page, but the page is encrypted, unsafe

    HTTPS HTTP enhanced version of encrypted transmission, can not see the clear text transmission of the page

 

2.15 HTTP request process

    Users want to know the content of the page, you need to initiate a request to the server, and these requests are contained many parameters and settings, and grammar (Hypertext Transfer Protocol) prescribed, out manually request a lot of trouble, so it was a browser.

    Browser in between remote servers and help us initiate a request to the server, parses the data returned by the server.

    Browser to initiate a request and to receive a response process:

Each included a request and response process, parses the request and response contents sequentially, into the cache.

They are each initiated request includes a header, text, parameters, Cokie , and so on. But not the content (request and response) each contains.

Configure crawler request when more will be concerned about the header, that request headers and response headers.

Request Process

Request method:

    Requested content browser or client configuration request comprises a request header and a body

    Request header: Important parameters include the request, the need to construct reptiles

    Request body: general post form data, using the get method request body is empty.

 

    Common methods:

            Get: Direct request parameter contained in the URL request page, and the server returns the request response

            Post: request parameter included in the form of the post, and more to submit the form or upload files

    Other methods:

            Head: Similar get request, to return the response headers only

            Put: replace the contents of the document specified in the client data from the double

            Delete: requests the server to delete the specified page,

            Connect: the server as a springboard to allow clients to access the server instead of the other pages

            Options: allows the client to view server performance

            Trace 回显服务器收到的请求,用测试或者诊断链接状态

 

    请求的网址:

            即网站的链接URL

 

 

    请求头:

请求头为浏览器向 服务器发起请求构造的参数内容,主要参数 信息包括:cookie,refer,user-agent等,写爬虫的时候是需要配置这些重要的请求参数的,通过对参数的配置,来保证爬虫的长期有效,不被服务器识别封掉,因为大量的频繁的请求会占用服务器资源。

        常用的请求头包含的信息:

                Accept 请求报头域,用于指定客户端可接受哪些类型的信息

                Accept-Language 指定客户端可接受的语言类型

                Accept-Encoding 指定客户端可接受的内容编码

Host 用于指定请求资源的主机 IP 和端口号,其内容为请求 URL 的原始服务器或网关的位

                HTTP 1. 版本开始,请求必须包含此内容

Cookie 也常用复数形式 Cookies ,这是网站为了辨别用户进行会话跟踪而存储在用户本地的数据 它的主要功能是维持当前访问会话 例如,我们输入用户名和密码成功登录某个网站后,服务器会用会话保存登录状态信息,后面我们每次刷新或请求该站点的其他页面时,会发现都是登录状态,这就是 Cookies 的功 Cookies 里有信息标识了我们所对应的服务器的会话,每次浏览器在请求该站点的页面时,都会在请求头中加上 Cookies 并将其发送给服务器,服务器通过 Cookies 识别出是我们自己,并且查出当前状态是登录状态,所以返回结果就是登录之后才能看到的网页内容

Referer 此内容用来标识这个请求是从哪个页面发过来的,服务器可以拿到这 信息并做相应的处理,如做来源统计、防盗链处理等

User-Agent :简称 UA ,它是一个特殊的字符串头,可以使服务器识别客户使用的操作系统及版本 浏览器及版本等信息 在做爬虫时加上此信息,可以伪装为浏览器;如果不加,很可能会被识别州为爬虫

Content-Type 也叫互联网媒体类型( Internet Media Type )或者 MIME 类型,在 HTT 协议消息头中,它用来表示具体请求中的媒体类型信息 例如, text/html 代表 HTML 格式,

                image/gif 代表 GIF 图片,

示例:

        请求体:

请求体 般承载的内容是 POST 请求中的表单数据,而对于 GET 请求,请求体则为空,一般用于登录提交表单

登录网站之前,我们填写了用户名和密码信息,提交时这些内容就会以表单数据的形式提交给服务器,此时需要注意 Request Headers 指定 Cont nt-Type application, x-www-form-urlencoded 只有设置Content-Type application/x-www-form-urlencoded ,才会以表单数据的形式提 另外,我们也可以Content-Type 设置为 pplication/ison 来提交 JSON 数据,或者设置为 mu lti part/form-data 上传文件

        注意:

 

响应:

响应内容,即服务器接收到请求以后向浏览器客户端发送的响应内容,响应内容包括响应头和响应体和响应状态码。

 

响应码:

        响应码,表示服务器的响应状态。

            

        响应头:

包含了服务器的请求应答信息,我是谁,我是什么型号,我返回的信息的编码方式,我们之间保持连接方式,我返回信息的格式,我返回信息的时间,我对cookie的设置要求等,主要包括,connect-typeseverset-cookie。。。

 

Date 标识响应产生的时间

Last-Modified 指定资源最后修改时间

Content-Encoding 指定内容的编码

Server 务器的信息 ,比如 、版本号等

Content-Type 文档类型 ,指定返回的数据类型是什么 ,

    如 tex t/ htm 代表返回 HTML 文档,

    application/x-javascript 返回 JavaScript 文件,

    image jpeg 代表返回图片

Set Cookie 设置 Cookie 应头中的 Set Cook 告诉浏览器需要将此内容放在 Co kies次请求携带 Cookies 请求

Expires响应过期时间 可以使代理务器或浏览器将加载的内容更新到缓存中,如果再次访时,就可直接从缓存中夹载,降低服务器负载缩短载时间

        响应体:

爬虫中最重要的就是响应体的内容,大多数时候爬取的内容都是解析自响应体,响应的正文数据都来自响应体

 

请求网页:响应体是HTNL

请求图片:响应体是图片的二进制数据流

请求音乐视频:响应体是音乐视频的二进制流

 

        响应头示例:

                

 

 

        响应体内容示例:

    

            

Guess you like

Origin www.cnblogs.com/binyang/p/10990589.html