How to understand the HTTP online proxy IP?

Insert picture description here

Most people understand that when crawling the same webpage repeatedly using crawlers, it will usually be blocked by the IP anti-crawler mechanism of the webpage, so as to deal with the problem of banned IP, most of them will use proxy IP.

But there are also a small number of people who have misunderstandings in the use of HTTP proxy IP. They think that using proxy IP can solve any problem. They don’t know that proxy IP is not a panacea. It is just a tool. If you make a mistake, it will be blocked. IP.

There are three forms of proxy IP: transparent proxy, ordinary anonymous proxy, and advanced anonymous proxy.

The main difference between high hidden, anonymous and transparent proxy lies in the difference between the three parameters of REMOTE_ADDR, HTTP_X_FORWARDED_FOR, and HTTP_VIA obtained by the other server.

As we all know, REMOTE_ADDR cannot be forged.

Using a transparent proxy (Transparent), the other server knows that you use a proxy and also knows your real IP. REMOTE_ADDR=ProxyIP, HTTP_VIA=ProxyIP, HTTP_X_FORWARDED_FOR=YourIP

Using an anonymous proxy (Anonymous), the other server knows that you use a proxy, but does not know your real IP. REMOTE_ADDR=ProxyIP, HTTP_VIA=ProxyIP, HTTP_X_FORWARDED_FOR=ProxyIP

Using a high anonymity proxy (High), the other server does not know that you use a proxy, nor does it know your real IP. REMOTE_ADDR=ProxyIP, HTTP_VIA=NULL, HTTP_X_FORWARDED_FOR=NULL

The use of transparent proxy and ordinary anonymous proxy will be informed by the target page that the proxy IP is used and will be restricted, while advanced anonymous proxy will not, so you should pay attention to this when choosing a proxy IP.

Use a proxy IP to crawl the target webpage. There are many key factors for the blocked IP, such as cookies, such as User Agent. When the domain value is exceeded, the IP will be blocked; when the number of visits to the target webpage is too fast, the IP It will also be blocked, because under normal circumstances, ordinary people browse far less than that number of times, and they will definitely be recognized by the anti-crawler strategy of the target page.

As long as it simulates the browsing of real customers under normal circumstances as much as possible, the blocked IP can be avoided to the highest degree.

Guess you like

Origin blog.csdn.net/ipiohiuhn/article/details/114085466