How do crawlers solve IP blocking

In the process of crawling information, if the crawling frequency exceeds the threshold set by the website, access will be prohibited. Usually, the anti-crawling mechanism of the website is based on the IP to identify the crawler.

1. Use a proxy IP and quickly replace the IP before or after the IP is blocked. This method mainly requires a large number of stable proxy IPs. There are free proxy IPs, but they are unstable. The trick here is to recycle. Before an IP is blocked, replace it, and then change it back after a while. This allows a lot of access with relatively few IPs . The free proxy that is updated every 10 minutes on the news agency homepage is still very good, so I first climb down the free proxy on the news agency homepage (the data is cached once every 10 minutes, and the cache is invalid after 10 minutes), and then use the free proxy that climbed down. Proxies cycle through stuff to crawl other sites.

2. Use a VPN, which is similar to a VPN, but with a slightly different technology. The essence is the same.

3. Use software tools with large-scale cloud collection clusters, such as Octopus

4. Control the access frequency when crawling, and rest after crawling (sleep) before crawling

5. For routers that obtain IP dynamically, the IP will be automatically changed after restarting

For the above three methods, it is recommended to use proxy IP first . Of course, the Octopus collector also supports the use of proxy IP. Or use a cloud collection platform

 

1. It is simpler to forge X-FORWARDED-FOR in the header and forge the referer. The code is as follows:

curl_setopt($ch, CURLOPT_HTTPHEADER, array('X-FORWARDED-FOR:111.222.333.4', 'CLIENT-IP:111.222.333.4'));
curl_setopt($ch, CURLOPT_REFERER, "http://www.test.com");
2. Most of the above methods can be fooled, but some have caught the real IP. Just use the proxy IP, the trouble is that you have a valid proxy IP and port number, and some require a username and password. You can build an effective proxy database based on the proxy. code show as below:
// Specify the proxy address
$ip = $ips[array_rand($ips, 1)]; // Get a random proxy IP
curl_setopt($ch, CURLOPT_PROXY, $ip);
// Provide username and password if required  
curl_setopt($ch, CURLOPT_PROXYUSERPWD,'user:pass');
There is another situation, that is, it can be accessed with a browser, but not with curl. It is found that the other party has checked the useragent. If it does not, it is considered to be an illegal source such as crawling. Then we add useragent to the header by ourselves. The code is as follows:
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11");
 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326357542&siteId=291194637