Sesame HTTP: The Fundamentals of Proxying

We often encounter such a situation in the process of doing crawlers. At first, the crawlers are running normally and fetching data normally. Everything looks so good, but there may be errors in a cup of tea, such as 403 Forbidden. At this time Open the webpage and take a look, you may see a prompt like "Your IP access frequency is too high". The reason for this phenomenon is that the website has taken some anti-crawling measures. For example, the server will detect the number of requests made by an IP in a unit of time. If it exceeds this threshold, it will directly refuse service and return some error information. This situation can be called IP blocking.

Since the server detects the number of requests per unit time of a certain IP, then we can use some way to disguise our IP so that the server cannot recognize the request initiated by our local machine, can we successfully prevent the IP from being blocked?

An effective way is to use a proxy, the usage of which will be explained in detail later. Before that, you need to understand the basic principle of proxy, how does it implement IP masquerading?

1. Basic principles

Proxy actually refers to a proxy server, which is called proxy server in English. Its function is to proxy network users to obtain network information. Figuratively speaking, it is the transfer station of network information. When we request a website normally, we send the request to the web server, and the web server sends the response back to us. If a proxy server is set, a bridge is actually built between the machine and the server. At this time, the machine does not directly initiate a request to the Web server, but sends a request to the proxy server. The request will be sent to the proxy server, and then sent by the proxy server. The proxy server sends it to the Web server, and then the proxy server forwards the response returned by the Web server to the local machine. In this way, we can also access the web page normally, but the real IP recognized by the web server is no longer the IP of our local machine, and IP masquerading is successfully implemented, which is the basic principle of the proxy.

2. The role of the agent

So, what does a proxy do? We can simply list them as follows.

  • Break through your own IP access restrictions and visit some sites that you can't normally visit.
  • Access to internal resources of some units or groups: For example, by using a free proxy server in the address segment of the education network, it can be used for various FTP download and upload services open to the education network, as well as various data query and sharing services.
  • Improve the access speed: Usually, the proxy server sets a large hard disk buffer. When there is information from the outside world, it also saves it in the buffer. When other users access the same information, the buffer is directly used. The information is taken out and passed to the user to improve the access speed.
  • Hide real IP: Internet users can also hide their IP in this way to avoid attacks. For crawlers, we use proxies to hide our own IP and prevent our own IP from being blocked.

3. Crawler Agent

For crawlers, because the crawling speed of the crawler is too fast, it may encounter the problem that the same IP is accessed too frequently during the crawling process. At this time, the website will let us enter the verification code to log in or directly block the IP, which will give the crawlers It brings great inconvenience.

Use a proxy to hide the real IP, so that the server mistakenly thinks that the proxy server is requesting itself. In this way, by constantly changing the proxy during the crawling process, it will not be blocked, and a good crawling effect can be achieved.

4. Proxy classification

When classifying proxies, it can be distinguished according to the protocol or according to the degree of anonymity.

(1) According to the agreement

According to the agreement of the agent, the agent can be divided into the following categories.

  • FTP proxy server : mainly used to access the FTP server, generally has upload, download and cache functions, the port is generally 21, 2121 and so on.
  • HTTP proxy server : mainly used to access web pages, generally has content filtering and caching functions, and the ports are generally 80, 8080, 3128, etc.
  • SSL/TLS proxy : mainly used to access encrypted websites, generally with SSL or TLS encryption function (up to 128-bit encryption strength), and the port is generally 443.
  • RTSP proxy : It is mainly used to access the Real streaming media server, which generally has a cache function, and the port is generally 554.
  • Telnet proxy : mainly used for telnet remote control (usually used to hide identity when hackers invade the computer), the port is generally 23.
  • POP3/SMTP proxy : It is mainly used for sending and receiving emails in POP3/SMTP mode. It generally has a cache function, and the port is generally 110/25.
  • SOCKS proxy : It simply transmits data packets, and does not care about the specific protocol and usage, so the speed is much faster, generally has a cache function, and the port is generally 1080. The SOCKS proxy protocol is further divided into SOCKS4 and SOCKS5. The former only supports TCP, while the latter supports TCP and UDP, as well as various authentication mechanisms, server-side domain name resolution, etc. Simply put, SOCKS5 can do everything that SOCK4 can do, but SOCK4 that SOCKS5 can do may not be able to do it.

(2) According to the degree of anonymity

According to the degree of anonymity of the proxy, the proxy can be divided into the following categories.

  • Highly anonymous proxy : It will forward the data packets intact, and it seems to the server that it is really an ordinary client accessing, and the recorded IP is the IP of the proxy server.
  • Ordinary anonymous proxy : Some changes will be made to the data packets. The server may find that this is a proxy server, and there is a certain probability that the real IP of the client can be traced. Proxy servers usually add HTTP headers HTTP_VIAand HTTP_X_FORWARDED_FOR.
  • Transparent proxy : not only changes the data packet, but also tells the server the real IP of the client. Apart from using caching technology to improve browsing speed and content filtering to improve security, this kind of proxy has no other significant effect. The most common example is a hardware firewall in an intranet.
  • Espionage proxy : refers to a proxy server created by an organization or individual to record the data transmitted by the user, and then conduct research, monitoring, etc.

5. Common proxy settings

  • Use a free proxy on the Internet: It is best to use a high-security proxy. In addition, there are not many proxies available. You need to filter the available proxies before using them, or you can further maintain a proxy pool.
  • Use paid proxy services: There are many proxies on the Internet that can be used for a fee, and the quality is much better than free proxies.
  • ADSL dialing: Dial a number to change an IP once, with high stability, it is also a more effective solution.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326295459&siteId=291194637