IP access solutions reptiles come across frequency limit

background:

In most cases, we encountered was the frequency of access restrictions. If you visit too fast, the website will think you're not alone. You need a good set threshold frequency of this case, or else they might injure. If we TOEFL test, or bought a train ticket at 12306 above, you should have such an experience, and sometimes even if you are really hand in operating the page, but as you point the mouse too fast, it will prompt you : "... operating frequency is too fast."

In such pages, the most direct way is to limit access time. For example once every 5 seconds to access the page. However, if you encounter a little clever website, it detects your access time, this person visited dozens of pages, but each visit just five seconds, how could people do such a precise time interval? Certainly reptile, was sealed is taken for granted! So you can access time interval is set to a random value, such as a random number of seconds between 0-10.

Of course, if you encounter frequency limit access to sites, we use Selenium to access becomes more advantageous, because this thing Selenium open a page itself will take some time, so we have a blessing in disguise, its low efficiency but let us bypass anti reptile mechanism the frequency of inspections. And Selenium can also help us to render the web page JavaScript, eliminating the hassle of manual analysis of JavaScript source code sense.

Here is the modified frequency of visits several scenes I often use for your reference:

1, Request stand-alone reptiles:

After the above code can be placed REQUEST request

 

2, scrapy stand-alone distributed crawling reptiles or scrapy_redis

 

Here the time interval setting request to explain the other parameters, I will update to the new post, please look forward to

In addition, the distinction is not clear scapy and scrapy_redis friends, venue to be here

https://

 

3, there is a case can be ignored

Some sites, such as the one I faced before this website (hwt), the server will limit your access frequency, but does not seal IP, the page will be displayed for 403 (server denied access), occasionally display 200 (the request was successful ), then prove (provided that you have set for a top request information), such as anti-climb mechanism, but limits the frequency of requests, but does not affect the normal collection, of course, such cases are rare, so we to learn to write targeted reptiles.

 

4, some servers for performance reasons, the response is slow (response timeout will lead to termination request)

This scenario usually occurs in many small sites, such as (DYW), after we requested parameters are arranged, they found, server performance reasons, the acquisition program continued to report 404 pages, this happens we can only prolonged response timeout duration, as shown below:

 

5, the proxy IP or distributed reptiles:

If required, it can not be accessed by a method of setting the frequency interval to bypass a check of the efficiency of the crawler page.

IP proxy access may solve this problem. If access to 100 pages with 100 agents IP, to the site can create a kind of 100 people, each person visited the illusion of one. Naturally this will not limit your visit.

Proxy IP frequently instability occurs. You just found a "free agent", there will be a lot of sites, each site will give you a lot of proxy IP, but in fact, really is not much available proxy IP. You need to maintain a pool of available IP proxy, but a free agent IP, maybe the test is when you can use, but after a few minutes becomes ineffective. Free proxy IP is already time-consuming, but also a test of what you luck.

Http://icanhazip.com/ you can use this site to detect your proxy IP is set to succeed. When you use a browser to directly access this site, it will return your IP address. As shown below:

By requests, we can set up a proxy to access the site, the get method requests, there are a proxies parameter, the data it receives a dictionary in the dictionary, we can set the proxy.

You can see more information about setting up proxy requests in the official Chinese document: http: //docs.python-requests.org/zh_CN/latest/user/advanced.html#proxies

I choose the first type of HTTP proxy to do the test for you, run the results as shown below:


 

As can be seen from the chart, we have successfully passed the proxy IP to access the site.

我们还可以使用分布式爬虫。分布式爬虫会部署在多台服务器上,每个服务器上的爬虫统一从一个地方拿网址。这样平均下来每个服务器访问网站的频率也就降低了。由于服务器是掌握在我们手上的,因此实现的爬虫会更加的稳定和高效。这也是我们这个课程最后要实现的目标。

那么分布式的爬虫,动态代理IP是在settings.py中设置的,如下图:

 

然后在这里调用

 

最终是在middlewares.py中生效

 

有一些网站,他们每个相同类型的页面的源代码格式都不一样,我们必需要针对每一个页面写XPath或者正则表达式,这种情况就比较棘手了。如果我们需要的内容只是文本,那还好说,直接把所有HTML标签去掉就可以了。可是如果我们还需要里面的链接等等内容,那就只有做苦力一页一页的去看了。

 

Guess you like

Origin www.cnblogs.com/beiyi888/p/11280116.html