1. Introduction to reptiles
-
What is a reptile?
-
That is, we always hope to save some important data information on the Internet
为己所用
.
For example:- When I browsed some excellent pictures that made my blood boil, I always wanted to save them as wallpapers on my desktop in the future
- When browsing some important data (all walks of life), I hope to keep it to add luster to my various sales behaviors in the future
- When browsing some weird and exciting videos, I hope to save them in the hard disk for later tasting
- When browsing some excellent singing tracks, I hope to save them for us to add a wonderful part to our boring life
It should be noted that what is crawled must be
能看见的东西
some公开
things of .
Reptilian Spear and Shield
-
反爬机制
:策略
Portal websites can prevent crawlers from crawling website data by formulating corresponding or technical means (encryption).
-
反反爬策略
(JS reverse):- The crawler program can obtain relevant data in the portal website by formulating
策略或者技术手段
relevant破解
anti-crawling mechanisms in the portal website.
robots.txt protocol: Baidu Spider
- The crawler program can obtain relevant data in the portal website by formulating
-
commonly known as
君子协议
. It specifies which data in the website can be以被爬虫爬取
crawled and which data cannot be crawled. Just add robots.txt to the suffix of each website.
This is a gentleman's agreement on Taobao
2. Web request process (Baidu as an example)
When visiting 百度
, the browser will 请求
send this time to Baidu's server (a computer of Baidu), the server 接收
will receive this request, and then load some data, return it to the browser, and then the browser will display it. Notes Yes, what Baidu’s server returns to the browser is not the page directly, but 页面源代码
(composed of html, css, js). It 浏览器
executes the source code of the page, and then displays the result to the user. So we can see Baidu's 源代码
(that is, the pile of unintelligible ghost things). The specific process is shown in the figure.
2.1 Page rendering
It should be noted here that not all data is on the page source code, some 动态加载
come in through tools such as js, this is 页面渲染数据
the process, there are two common page rendering processes, one is called 服务器渲染
, and the other is called前端JS渲染
1. Server rendering -> data can be searched directly in the page source code
When we request the server, the server directly sends the data 全部写入
to html
the server, and our browser can 直接拿到
carry the html content of the data.
Since the data is written directly in the html, all the data we can see can be found in 页面源代码
the This kind
of webpage can usually 相对比较容易
be crawled页面内容
2. Front-end JS rendering -> data cannot be found in the page source code
This is a little troublesome. This mechanism is generally the first request to the server to return a bunch of HTML frame structures. Then 再次请求
to the actual storage of data 服务器
, the server returns the data, and finally in 浏览器上对数据进行加载
. Like this:
When is the data What is loaded? In fact, when we scroll down the page, jd is secretly loading data. At this time, if we want to see the whole process of loading this page, we need to use the debugging tool of the browser ( F12)
So when you can't find the data displayed on the page in the page source code, you can find the related left side in the debugging tool (F12) 数据包
, and the data is hidden in the data package (Remember to open the F12 debugging tool first, and then refresh the page, otherwise the data has been loaded, and there is no content in the data packet)
3. The use of browser tools (emphasis)
The browser is the most intuitive place to see the status of the webpage and the content loaded on the webpage. We can press F12
down view some tools that ordinary users rarely use. Among them, the first four
are the most importantElements, Console, Sources, Network
1. Elements
Elements is us 实时的网页内容情况
, and sometimes 并不是页面源代码
, the content rendered by the front-end JS will also be presented on it. So this is the real-time content of the web page
Notice,
页面源代码
之前
It is returned to us by the server that executes js scripts and user operations最原始的内容
Elements
The content seen in isjs脚本以及用户操作之后
the page display effect at that time.
You can understand that one is 老师批改之前的卷子
and the other is 老师批改之后的卷子
. Although they are all papers, the content is different. And what we can get at present is the source code of the page. It is what the teacher looked like before the correction. This point should be paid special attention to.
Elements
We can use it in . 左上角的小箭头
You can intuitively see the position of each block in the browser 对应的当前html状况
. It is still very considerate.
2. Console
The second window Console
is used to view some printed content and log content left by the programmer. We can enter some js code here to execute automatically.
3. Source
The third window, Source
, can be seen here 该网页打开时加载的所有内容
. Including page source code, script, style, picture and so on.
4. Network
The fourth window, Network
we are generally used to call it 抓包工具
. That is, the front-end JS rendering just mentioned is here to see, we can see 当前网页加载的所有网络请求
, and 请求的详细内容
. This is very important for our crawlers.
4. Hypertext Transfer Protocol
协议
: It is the one set up between two computers for smooth communication 君子协定
. Common protocols include TCP/IP, SOAP protocol, HTTP protocol, SMTP protocol, etc...
Different protocol. The format of the transmitted data is different.
HTTP协议
, the abbreviation of Hyper Text Transfer Protocol (Hypertext Transfer Protocol), is a transfer protocol used to transfer hypertext from World Wide Web (WWW: World Wide Web) servers to local browsers. To put it bluntly, it is 浏览器和服务器之间的数据交互遵守的就是HTTP协议
.
The most commonly used HTTP protocol is the 加载网页
.
The HTTP protocol divides a message into three pieces of content 三大块内容
whether it is 请求
or响应
ask:
请求行 -> 请求方式(get/post) 请求url地址 协议
请求头 -> 放一些服务器要使用的附加信息(cookie验证,token,各式各样的反扒信息)
请求体 -> 一般放一些请求参数
response:
状态行 -> 协议 状态码
响应头 -> 放一些客户端要使用的一些附加信息(cookie验证,token,各式各样的反扒信息)
响应体 -> 服务器返回的真正客户端要用的内容(HTML,json)等 页面代码
Pay special attention when writing crawlers 请求头和响应头
. These two places are generally隐含着一些比较重要的内容
注意
, the browser actually reorganizes the content of the HTTP request and response. It is displayed as an effect that is easier for us to read.
Some of the most important content in the request header 常见
(needed by crawlers):
User-Agent
: The carrier of the request身份标识
(what is used to send the request)Referer
:防盗链
(Which page did this request come from? Anti-crawling will be used)cookie
: Local string data information (user login information, anti-climbing token)Content-Type
: The server responds back to the client数据类型
(str, json, etc.)
Some of the response headers 重要的内容
:
- cookie: local string data information (user login information, anti-crawling token)
- All kinds of magical
莫名其妙
strings (this requires experience, usually the word token, to prevent various attacks and anti-climbing)
https protocol
Almost http协议
the same as, s
means safe, means 安全的超文本传输协议
. (data encryption)
Encryption method (three kinds)
1. Symmetric key encryption
It means that the client will first 将即将发送给服务器端的数据
, 数据加密
after 加密的方式是由客户端自己指定的
the encryption is completed, and after 密文包括解密的方式(密钥)一块发送给服务器端
the server 接收到了密钥和加密的密文数据
, it will use the key to decrypt the ciphertext data, and finally 服务器端会获得原文数据
.
-
disadvantages
-
- When the key and ciphertext are being processed
数据传输的过程中
, it is very likely that a third-party organization will get拦截
it, possibly存在数据暴露的风险
.
- When the key and ciphertext are being processed
2. Asymmetric key encryption
Aiming at the potential safety hazards of symmetric encryption, an improved encryption method is used. There are two locks when using it, one is called " " and the 私有密钥
other is " 公开密钥
". When using non-object encryption, the server first tells the client to follow the After the client follows 公开密钥
the instructions , the server receives the information and then decrypts it. The advantage of this is that , so it is . Even if the public key is obtained by an eavesdropper, it is difficult to decrypt it, because the decryption process is to evaluate the discrete logarithm, which is not easy to do.加密处理
公开密钥加密
自己的私有密钥
解密的钥匙根本就不会进行传输
避免了被挟持的风险
-
shortcoming:
-
- 1.
效率比较低
, it is more complicated to deal with, and there is a certain efficiency problem in the communication process影响通信速度
. - 2. As long as the key is sent, there may be a risk of being hijacked. If the intermediary agency sends it
公钥篡改
to the client,不能保证
the public key obtained by the client must be created by the server.
- 1.
3. Certificate key encryption
In view of the defects of asymmetric key encryption, we cannot guarantee that the public key obtained by the client must be created by the server, so certificate key encryption is introduced.
The developer of the server carries 公开密钥
, 向数字证书认证机构提出公开密钥的申请
and the digital certificate certification authority 认清申请者的身份
, after passing the audit, will 开发者申请
do the public key 数字签名
, and then 分配这个已经签名的公开密钥
put the key 证书
inside and bind them together.
The server uses this 数字证书发送给客户端
, because 客户端也认可证书机构
the client uses the digital signature in the digital certificate 验证公钥的真伪
to ensure that the public key passed by the server is authentic. In general, 证书的数字签名很难被伪造的
it depends on the credibility of the certification body. Once the information is confirmed to be correct, 客户端
it will 公钥
pass through the message 加密发送
, and 服务器
use its own after receiving it 私钥进行解密
.
V. Summary
- What you want is in the source code of the page.
源代码
Just take the extracted data directly - What you want is not in the source code of the page, you need to find a way to find the real request to load the data. Then extract the data
- Observance
君子协定
- Common content in the request header:
User-Agent
,referer
,cookie
,Content-Type