Crawler Learning (01): Understanding Crawlers & Hypertext Transfer Protocol

1. Introduction to reptiles

What is a reptile?

That is, we always hope to save some important data information on the Internet 为己所用.
For example:

  • When I browsed some excellent pictures that made my blood boil, I always wanted to save them as wallpapers on my desktop in the future
  • When browsing some important data (all walks of life), I hope to keep it to add luster to my various sales behaviors in the future
  • When browsing some weird and exciting videos, I hope to save them in the hard disk for later tasting
  • When browsing some excellent singing tracks, I hope to save them for us to add a wonderful part to our boring life

It should be noted that what is crawled must be 能看见的东西some 公开things of .

Reptilian Spear and Shield

反爬机制:

  • 策略Portal websites can prevent crawlers from crawling website data by formulating corresponding or technical means (encryption).

反反爬策略(JS reverse):

  • The crawler program can obtain relevant data in the portal website by formulating 策略或者技术手段relevant 破解anti-crawling mechanisms in the portal website.
robots.txt protocol: Baidu Spider

commonly known as 君子协议. It specifies which data in the website can be 以被爬虫爬取crawled and which data cannot be crawled. Just add robots.txt to the suffix of each website.

This is a gentleman's agreement on TaobaoThis is a gentleman's agreement on Taobao

2. Web request process (Baidu as an example)

When visiting 百度, the browser will 请求send this time to Baidu's server (a computer of Baidu), the server 接收will receive this request, and then load some data, return it to the browser, and then the browser will display it. Notes Yes, what Baidu’s server returns to the browser is not the page directly, but 页面源代码(composed of html, css, js). It 浏览器executes the source code of the page, and then displays the result to the user. So we can see Baidu's 源代码(that is, the pile of unintelligible ghost things). The specific process is shown in the figure.
insert image description here

2.1 Page rendering

It should be noted here that not all data is on the page source code, some 动态加载come in through tools such as js, this is 页面渲染数据the process, there are two common page rendering processes, one is called 服务器渲染, and the other is called前端JS渲染

1. Server rendering -> data can be searched directly in the page source code

When we request the server, the server directly sends the data 全部写入to htmlthe server, and our browser can 直接拿到carry the html content of the data.
Since the data is written directly in the html, all the data we can see can be found in 页面源代码the This kind
of webpage can usually 相对比较容易be crawled页面内容

2. Front-end JS rendering -> data cannot be found in the page source code

This is a little troublesome. This mechanism is generally the first request to the server to return a bunch of HTML frame structures. Then 再次请求to the actual storage of data 服务器, the server returns the data, and finally in 浏览器上对数据进行加载. Like this:
insert image description here
When is the data What is loaded? In fact, when we scroll down the page, jd is secretly loading data. At this time, if we want to see the whole process of loading this page, we need to use the debugging tool of the browser ( F12)
insert image description here
So when you can't find the data displayed on the page in the page source code, you can find the related left side in the debugging tool (F12) 数据包, and the data is hidden in the data package (Remember to open the F12 debugging tool first, and then refresh the page, otherwise the data has been loaded, and there is no content in the data packet)

3. The use of browser tools (emphasis)

The browser is the most intuitive place to see the status of the webpage and the content loaded on the webpage. We can press F12down view some tools that ordinary users rarely use. Among them, the first four
insert image description here
are the most importantElements, Console, Sources, Network

1. Elements

Elements is us 实时的网页内容情况, and sometimes 并不是页面源代码, the content rendered by the front-end JS will also be presented on it. So this is the real-time content of the web page

Notice,

  1. 页面源代码之前It is returned to us by the server that executes js scripts and user operations最原始的内容
  2. ElementsThe content seen in is js脚本以及用户操作之后the page display effect at that time.

You can understand that one is 老师批改之前的卷子and the other is 老师批改之后的卷子. Although they are all papers, the content is different. And what we can get at present is the source code of the page. It is what the teacher looked like before the correction. This point should be paid special attention to.

ElementsWe can use it in . 左上角的小箭头You can intuitively see the position of each block in the browser 对应的当前html状况. It is still very considerate.
insert image description here

2. Console

The second window Consoleis used to view some printed content and log content left by the programmer. We can enter some js code here to execute automatically.
insert image description here

3. Source

The third window, Source, can be seen here 该网页打开时加载的所有内容. Including page source code, script, style, picture and so on.
insert image description here

4. Network

The fourth window, Networkwe are generally used to call it 抓包工具. That is, the front-end JS rendering just mentioned is here to see, we can see 当前网页加载的所有网络请求, and 请求的详细内容. This is very important for our crawlers.
insert image description here
insert image description here
insert image description here
insert image description here

4. Hypertext Transfer Protocol

协议: It is the one set up between two computers for smooth communication 君子协定. Common protocols include TCP/IP, SOAP protocol, HTTP protocol, SMTP protocol, etc...

Different protocol. The format of the transmitted data is different.

HTTP协议, the abbreviation of Hyper Text Transfer Protocol (Hypertext Transfer Protocol), is a transfer protocol used to transfer hypertext from World Wide Web (WWW: World Wide Web) servers to local browsers. To put it bluntly, it is 浏览器和服务器之间的数据交互遵守的就是HTTP协议.

The most commonly used HTTP protocol is the 加载网页.

The HTTP protocol divides a message into three pieces of content 三大块内容whether it is 请求or响应

ask:

请求行 -> 请求方式(get/post) 请求url地址 协议
请求头 -> 放一些服务器要使用的附加信息(cookie验证,token,各式各样的反扒信息)

请求体 -> 一般放一些请求参数

response:

状态行 -> 协议 状态码 
响应头 -> 放一些客户端要使用的一些附加信息(cookie验证,token,各式各样的反扒信息)

响应体 -> 服务器返回的真正客户端要用的内容(HTML,json)等 页面代码

Pay special attention when writing crawlers 请求头和响应头. These two places are generally隐含着一些比较重要的内容

注意, the browser actually reorganizes the content of the HTTP request and response. It is displayed as an effect that is easier for us to read.

insert image description here

Some of the most important content in the request header 常见(needed by crawlers):

  1. User-Agent: The carrier of the request 身份标识(what is used to send the request)
  2. Referer: 防盗链(Which page did this request come from? Anti-crawling will be used)
  3. cookie: Local string data information (user login information, anti-climbing token)
  4. Content-Type: The server responds back to the client 数据类型(str, json, etc.)

Some of the response headers 重要的内容:

  1. cookie: local string data information (user login information, anti-crawling token)
  2. All kinds of magical 莫名其妙strings (this requires experience, usually the word token, to prevent various attacks and anti-climbing)

https protocol

Almost http协议the same as, smeans safe, means 安全的超文本传输协议. (data encryption)

Encryption method (three kinds)

1. Symmetric key encryption

It means that the client will first 将即将发送给服务器端的数据, 数据加密after 加密的方式是由客户端自己指定的the encryption is completed, and after 密文包括解密的方式(密钥)一块发送给服务器端the server 接收到了密钥和加密的密文数据, it will use the key to decrypt the ciphertext data, and finally 服务器端会获得原文数据.

disadvantages
  • When the key and ciphertext are being processed 数据传输的过程中, it is very likely that a third-party organization will get 拦截it, possibly 存在数据暴露的风险.
    insert image description here
2. Asymmetric key encryption

Aiming at the potential safety hazards of symmetric encryption, an improved encryption method is used. There are two locks when using it, one is called " " and the 私有密钥other is " 公开密钥". When using non-object encryption, the server first tells the client to follow the After the client follows 公开密钥the instructions , the server receives the information and then decrypts it. The advantage of this is that , so it is . Even if the public key is obtained by an eavesdropper, it is difficult to decrypt it, because the decryption process is to evaluate the discrete logarithm, which is not easy to do.加密处理公开密钥加密自己的私有密钥解密的钥匙根本就不会进行传输避免了被挟持的风险

shortcoming:
  • 1. 效率比较低, it is more complicated to deal with, and there is a certain efficiency problem in the communication process 影响通信速度.
  • 2. As long as the key is sent, there may be a risk of being hijacked. If the intermediary agency sends it 公钥篡改to the client, 不能保证the public key obtained by the client must be created by the server.

insert image description here

3. Certificate key encryption

In view of the defects of asymmetric key encryption, we cannot guarantee that the public key obtained by the client must be created by the server, so certificate key encryption is introduced.
The developer of the server carries 公开密钥, 向数字证书认证机构提出公开密钥的申请and the digital certificate certification authority 认清申请者的身份, after passing the audit, will 开发者申请do the public key 数字签名, and then 分配这个已经签名的公开密钥put the key 证书inside and bind them together.
The server uses this 数字证书发送给客户端, because 客户端也认可证书机构the client uses the digital signature in the digital certificate 验证公钥的真伪to ensure that the public key passed by the server is authentic. In general, 证书的数字签名很难被伪造的it depends on the credibility of the certification body. Once the information is confirmed to be correct, 客户端it will 公钥pass through the message 加密发送, and 服务器use its own after receiving it 私钥进行解密.
​​​​insert image description here

V. Summary

  1. What you want is in the source code of the page. 源代码Just take the extracted data directly
  2. What you want is not in the source code of the page, you need to find a way to find the real request to load the data. Then extract the data
  3. Observance君子协定
  4. Common content in the request header: User-Agent, referer, cookie,Content-Type

Guess you like

Origin blog.csdn.net/m0_48936146/article/details/127281659