[Series] 2. python reptile reptilian Development Network Fundamentals

Section II: reptile Development Network Fundamentals
2.1HTTP and HTTPS
1) HTTP
HTTP is a request between a client and server and response standards, generally use the TCP protocol. By using a web browser, web crawler or other tools, the client initiates a HTTP request to the server on the specified port (default 80). We call this client is a user agent. Some resources are stored on the server response. Such as HTML files and images. We call this the response server as the source server. There may be a plurality of "intermediate layer", such as proxy servers, in the middle of the tunnel gateway or the user agent and the origin server.
Seven original network protocol: OSI
Here Insert Picture Description

Is divided into four simplified: the application layer, transport layer, network layer, network interface layer Tcp / Ip Model

Application Layer
The application refers to a software program for network communication people,
some network-aware applications are end-user programs, such as running your app
i.e. these program implements the application layer protocol, and communicates directly with the lower layers of the protocol stack.
E-mail client and Web browser belongs to this type of application.
Network layer (IP) cmd >>> ipconfig
to the network layer, it will be IP encapsulation, (that is, the local IP encapsulation your computer, on your data, preceded by head said: I am from 192.168.0.1 after the computer, sent to a Baidu's IP is a Baidu's IP is your IP) will carry the target server initiated the request. (That is, is encapsulated in the Get on your request)
tunnel VPN
gateway overweight decoding (general decoding)
(analogous to translate an English website into Chinese site transfer back)
proxy
servers Some proxy servers
TCP have streamed connection, and Like calling (full duplex), we can communicate with each other
and I know you will answer the phone, you hear me

UDP for connectionless, do not care whether you received my message
is similar:
e-mail, you know your mail sent out, but are not sure the other party can not receive.
Contrast , relatively speaking, UDP simpler than TCP. (HTPP using TCP)

2) HTTP /1.1 process (focus)
the request type represents a meaning
GET issued a "display" request to the designated resource
GET method should only be used in reading data. (Request source code HTML)
you are once again get in the browser Enter,
the reason will see the screen from the reactive web browser rendering write
HEAD simplified version of GET, as with the GET method, it is sent to the server specified resource requests.
But the server will not return text resources, will only return Response Headers
as he returned data is relatively small, multi-server for development and testing
will be used when the rear end of the front are more interactive, restful-- front-end and back-end to get along interface.
Usually we are talking about is the separation of the front and rear end restful) is the Headers will return only the Response
POST to submit data to the specified resource, request the server for processing (such as submitting a form or uploading files).
For instance, your login request is submitted data post process.
PUT attention and distinguish POST, PUT for uploading new content to the specified resource location for updating changes.
Not all programmers to comply restful, some programmers easy way to get all requests with all forms are sent with POST, not impossible, but try to follow the preparation of specifications
DELETE requests the server to delete the resource Requests-URL identified.
TRACE equivalent ping, back to the requesting server received significant
Mainly used for testing or diagnostic test can link with the server,
the OPTIONS This method enables the server to return all HTTP supported the resource request method.
General return Get, Head, Post, Delete, Put
That supports what method you use to access the site
Note: some of the previously established sites may be wrong
the CONNECT HTTP / 1.1 protocol reserved for ways to connect pipes instead of the proxy server.
PATCH than PUT lighter for topical application to modify the resource, such as changing the watermark on the picture


Universal access to knowledge:
Enter nslookup in cmd in, and enter the URL Baidu's
Here Insert Picture Description
IP address inside the IP are Baidu, you can jump to Baidu

Generally, you launch the browser GET request, that is our internal computer to perform a bit: nslookup access address, return address to DNS (IP address), and then re-visit the IP address.

We are using the server IP address, domain name service is not. Therefore, use DNS to resolve what the corresponding IP address and access to, the nature of the domain name after the DNS resolution is the IP address plus port 80

Non-authoritative answer: that is, we can not guarantee this is safe, there is likely to be among hackers. We see this thing, there is likely to be hackers tampering.

3)https

超文本传输安全协议(HTTPS)是一种通过计算机网络进行安全通信的传输协议。HTTPS经由HTTP进行通信,但利用SSL/TLS来加密数据包。HTTPS开发的主要目的是提供对网站服务器的身份认证,保护交换数据的隐私与完整性。HTTPS由网景公司在1994年首次提出,随后扩展到互联网上。
超文本传输协议HTTP协议被用于在Web浏览器和网站服务器之间传递信息,HTTP协议以明文方式发送内容,不提供任何方式的数据加密,如果攻击者截取了Web浏览器和网站服务器之间的传输报文,就可以直接读懂其中的信息,因此,HTTP协议不适合传输一些敏感信息,比如:信用卡号、密码等支付信息。
为了解决HTTP协议的这一缺陷,需要使用另一种协议:安全套接字层超文本传输协议HTTPS,为了数据传输的安全,HTTPS在HTTP的基础上加入了SSL协议,SSL依靠证书来验证服务器的身份,并为浏览器和服务器之间的通信加密。

小辨别方法:网页显示不安全就是http传输,否则就是https传输

HTTPS和HTTP的区别主要如下:
1)费用:https协议需要到CA申请证书,一般免费证书较少,因而需要一定费用。
2)安全性:http是超文本传输协议,信息是明文传输,https则是具有安全性的ssl加密传输协议,http的连接很简单,是无状态的;HTTPS协议是由SSL+HTTP协议构建的可进行加密传输、身份认证的网络协议,比http协议安全。
3)端口:http和https使用的是完全不同的连接方式,用的端口也不一样,前者是80,后者是443。 :80/:443
4)资源消耗:HTTPS需要很多计算,消耗过量的CPU

2.2请求头 Headers
HTTP头字段是指在超文本传输协议(HTTP)的请求和响应消息中的消息头部分。它们定义了一个超文本传输协议事务中的操作参数。
HTTPS头部字段可以自己根据需要定义,因此可能在Web服务器和浏览器上发现非标准的头字段。
查看请求头的方式:

在浏览器中右键选择审查元素(检查),会出现如下页面(这也是之后爬虫最常用的一个操作):
Here Insert Picture Description
选取菜单栏中的Network,选取Headers:
Here Insert Picture Description

找到其中的Request Headers:
Here Insert Picture Description

请求头:
Accept:表示客户端会接受的文本
Accept-Encoding:表示客户端可以接受的编码方式
Accept-Language:表示客户端可以接受的语言
Cache-Control:客户端是否使用缓存,使用缓存会节省二次载入时间
Connection:客户端请求连接时长,keep—alive()长连接,如果不采用的话会间隔两次访问,服务器也不知道你从哪里来。
Cookie:保存在客户端本地的可被服务端识别身份的数据
Host:客户端请求的主机
User-Agent:客户端使用什么终端访问
DNT:表示客户端是否允许网站追踪和记录,会分析你的访问,1代表可以追踪
Upgrade-Insecure-Request:表示客户端优先接受加密响应
Program:HTTP1.0用来向后兼容只支持HTTP1.0的缓存服务器
Request URL:我们请求的页面URL
Requests Method:页面的请求方式
Status Code:相应状态码
Remote Address:我们访问国内网站使用的IP地址
Referrer Policy:用于过滤Referer内容,这里的意思是当发生降级的时候不传递referer报头

响应头:
Cache-Control:服务器指定缓存方式,这里表示代理服务器不能缓存,只能用户缓存
Connection:当前事务结束后是否关闭连接
Content-Encoding:内容编码方式
Content-Type:返回的数据类型
Expires:在此日期之后,相应失效
Server:服务器处理信息的软件信息
Set-Cookie:服务器给客户端设置cookies
Strict-Transport-Security:在这个时间内发起的请求都使用HTTPS
Transfer-Encoding:数据以块的方式发送

扩展内容:

Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description

Here Insert Picture Description

Here Insert Picture Description
Here Insert Picture Description

  1. HTTP 请求流程
    HTTP请求的7个步骤
    1,建立TCP连接(输入url之后,三次握手,四次挥手等)
    2,浏览器发送请求
    3,浏览器发送请求头
    4,服务器发送应答
    5,服务器发送应答头
    6,服务器发送数据(html等)
    7,服务器关闭TCP连接

对于上述我们获取的Request Headers,
有这样的一行:

Accept: image/webp,image/apng,image/*,*/*;q=0.8

如果加在你的请求头里面相应的,就只会返回你要求的响应数据
2) 必备状态码

响应码 对应含义
2xx 以2开头的基本上是没问题的,是可以正常返回数据的。
201 请求一开始创建了
202 请求创建之后,服务器接受了
204 就是返回的这个,没有内容(信息)
206 返回部分信息(一般是返回图片)
有时候我们访问网站的时候,有些图标是模糊不清的
这时候后台就会的发起一个206 的请求,
请求后面的信息,返回清晰图片。
3xx 一般是重定向(有时候,你访问的是另一个网站,跳转另一个网站
301 永久移动
302 暂时移动
4xx 出错啦
401 未验证,
页面需要你登陆,如果直接访问,它就会出现 401 页面
403 IP 被封了,禁止你登陆
404 你有可能路由写错了
405 就是这个页面原本只能用 get 方法来请求
而你却用 Post 的方法来请求,不支持这个方法
提醒你,你是不是写错了
408 请求超时(可能是你服务器的问题,也可能是你的问题)
一般来说:个人问题居多
5xx 服务器崩溃,如网站崩了,网关不支持、没有挂载等

同样是在Headers里 我们可以查看状态码 Status Code:

当然你也可以利用代码来查看状态码,这里我们就先介绍一下常用的Request模块

安装方法:

Windows:pip install requests
Mac:pip3 install requests

如果安装失败或过慢使用:

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple requests

查看状态码代码:

import  requests

html = requests.get("https://www.2345.com/")
print(html.status_code)

一样可以输出:200
当然我们日常使用都是用在下述的情况:

import  requests

html = requests.get("https://www.2345.com/")
if html.status_code == 200:
	print("1")
	pass

else:
	print("2")

2.3Cookies
Cookie 可以翻译为“小甜品,小饼干” ,是存储在你电脑上的一些数据,因为HTTP是面向无连接的,也就是每一个请求和响应都是单独分开的,有时候我们需要保存用户的状态,比如你在网站一直在线,就需要使用cookie记录你的信息,下一次请求时候网站会识别你的本地cookie来验证你的身份。
Cookies以键值对形似存在,也就是key=value。
登陆网站有记住我之类的选项,当你点击 记住我也就是启用 Cookies
缺陷:
1,Cookies 会被附加在每个HTTP请求中,所以无形中增加了流量。
2,由于在HTTP请求中的Cookies是明文传递,所以安全性成问题,除非HTTPS。
3,Cookies的大小限制在4KB左右 ,对于复杂需求来说是不够用的
现在也主要用 Session 之后会写到。

现在开发网站也基本上尽量不用 Cookies ,而用 Session

2.4HTML,CSS,JavaScript
网页三剑客
CSS对于爬虫来说无足轻重,JS只有在破解很难的前端加密时才用到,如果想进阶很高级的爬虫工程师,js还是需要精通的
HTML HTML 是描述网页的语言,(超文本标记语言),并不是编程语言。
CSS CSS 层叠样式表,用来修饰页面。
JavaScript JavaScript 网络脚本语言,用来和用户进行交互。
合集 HTML 就是类似于骨架,CSS 就类似于给 HTML 穿上衣服。

2.5.json
Json is a lightweight data interchange format, generally used to build a website API.
Json syntax:
data is key-value pairs
of data separated by a comma
braces storage target
brackets backup array
{ "name": "python" } is a target json

Ajax 2.6
Ajax is actually an upgraded version of JS, we all know it asynchronous programming, Python asynchronous programming, it certainly has asynchronous JavaScript programming. In JavaScript, it's asynchronous programming is Ajax.

For example, you open Baidu picture of the page, you will find that even if you do not refresh the page, the picture still keep coming, that is not a lower limit.
Ajax is asynchronous network requests js meaning. In general, we submitted a form, once the user clicks submit, the browser will refresh. Ajax is not to refresh the page, users are still left with the current page, while the background issue a new request after receiving the data, by js refresh the page so that the user feels that he has been in the current page, but the data can be constantly refreshed. For example, to see Baidu picture, you can see the picture constantly refreshed.

Published 28 original articles · won praise 25 · views 2035

Guess you like

Origin blog.csdn.net/AI_LINNGLONG/article/details/104451825