Web crawlers IP strategy to avoid being sealed

background

These two days have been engaged in Java web crawler as Java curriculum design, the goal is to crawl IMDb top250 critic, after sentiment analysis may also be required, of course, this is not the content of a crawler. My crawlers just at the beginning of a page a page crawling information has not been any great problems, the overall test until the last night, there is a problem IP is sealed. Probably just crawling tens of thousands of comments, be tested again after the emergence of abnormal process error. Internet search found that IP may be closed, this time re-entering the website watercress tips I need to log in to access, indicating that indeed has been sealed.

This problem also occurs again today, was closed three IP, IP Multicast laboratory fortunately, but I can not let squandered, after a battle of wits with watercress day today, here are some strategies to be sealed to avoid IP.

The main reference from https://www.cnblogs.com/mooba/p/6484340.html

user_agent camouflage and rotation

Different versions of different browsers have different user_agent, header information is important detailed information about the browser type, browser also submit Http request. We can provide different user_agent at every request, anti-crawler mechanism to bypass Web site detects the client. For example, you can put a lot of user_agent in a list, each randomly selected to submit a request for access.

Here are some user_agent

Opera
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60 Opera/8.0 (Windows NT 5.1; U; en) Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50 Firefox Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10 Safari Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 chrome Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11 Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16 360 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36 Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko 淘宝浏览器 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11 猎豹浏览器 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER) Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)" QQ浏览器 Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400) Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E) sogou浏览器 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) maxthon浏览器 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36 UC浏览器 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36
手机端

IPhone Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5 IPod Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5 IPAD Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5 Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5 Android Mozilla/5.0 (Linux; U; Android 2.2.1; zh-cn; HTC_Wildfire_A3333 Build/FRG83D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1 Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1 QQ浏览器 Android版本 MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1 Android Opera Mobile Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10 Android Pad Moto Xoom Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13 BlackBerry Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+ WebOS HP Touchpad Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0 Nokia N97 Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124 Windows Phone Mango Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan) UC浏览器 UCWEB7.0.2.37/28/999 NOKIA5700/ UCWEB7.0.2.37/28/999 UCOpenwave Openwave/ UCWEB7.0.2.37/28/999 UC Opera Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999

Recently, another that specializes in providing disguise the identity of the browser open source library name made very straightforward:

fake-useragent

Use a proxy IP and rotation

Check ip access is anti-climb mechanism favorite way to use the site is also a favorite. This time you can switch to a different ip addresses to crawl content. Of course, you have a lot of hosts public network ip address or vps is a better choice, if not, then you can consider using a proxy, the proxy server to let go and you get page content, and then forwarded back to your computer. Agent, transparency can be divided into transparent proxy, anonymous proxy and highly anonymous proxy:

  • Transparent Proxy : Target website know you use a proxy and know your source IP address, this agency is clearly inconsistent with the original intention of the agent we use here
  • Anonymous Proxy : Anonymous relatively low level, that is, the site know you use a proxy, but you do not know the source IP address
  • High anonymous proxy : This is the safest way, that you did not know the target site using a proxy but do not know your source IP
    proxy access to ways to buy, of course, can go to their crawling free, where there is a free offer proxy sites can use to climb down, but free agency is usually not stable enough.

Set access time interval

Many anti-reptile mechanism websites are set up access time interval, a short time if IP exceeds the specified number of times will enter the "cooling CD", so in addition to IP and user_agent rotation
between access time interval can be set a little longer, such as not crawl a page a random sleep time:

  1. import timerandom
  2. time.sleep(random.random()*3)

For a crawler, this is a more responsible approach.
Because reptiles may have been caused by the load pressure to give each other access to the site, so that precautions can either be sealed to prevent a certain extent, the pressure can also reduce access to each other's.

For Java- set access time interval is most effective HttpClient

Private  static RequestConfig getConfig () {
         return RequestConfig.custom () 
                .setConnectTimeout ( 10000) // Create the longest connection 
                .setConnectionRequestTimeout (10000) // Get the longest connection 
                .setSocketTimeout (10000) // most data transmission long 
                .build (); 
    }

Finally, attach a connection introduction of User-Agent

https://developer.mozilla.org/zh-CN/docs/Web/HTTP/Headers/User-Agent

Guess you like

Origin www.cnblogs.com/wkfvawl/p/11831526.html