Anti-gathering and anti-reptile common strategy and Solutions

 

1 , limit the IP unit of time as well as frequency of visits

 

Background : no ordinary one second access to the same website N times (whether it is not the same page)

 

Solution : Generally this happens we will slow down the frequency of collection, whether you write code to add Sleep, or set the interval at which we can solve octopus

 

Evolutionary 1: Anti-collection policy and some high point, he even monitor the frequency of each request, if you have been in the same frequency, such as once every second, he also will be sealed.

 

Solution: This situation generally we need to add a random number in the collection interval inside, the relative frequency of each random access.

 

Evolutionary 2: Some of the more brutal anti-acquisition strategy, he will even monitor the number of IP per day or every time the page is requested. Because he can through data analysis, generally know his true user typically accesses up to him how many pages, if more than he did the same letters.

 

 

Solution : This situation will only use multi- IP or multiple servers to solve, many different virtual terminal access, access to resources shared equally. Octopus also provided in this proxy IP pool with Ultimate Package cloud server cluster to protect.

 

2 , codes

 

Background : Captcha born identify you in the end is a person or a machine artifact

 

Resolve to do law:

 

This trick is a trick used rotten, and now ordinary verification code, even if the added confusion can to break through image recognition technology on the market so a lot more strange code, the most classic, should be considered 12306 verification code instead. But the more complex the code, including those Chinese idioms, Chinese, Math and the like, are all break the law. Because this world there are coding platform, it is a kind of artificial platform to help you enter a verification code.

 

 

 

Octopus built-in support of the vast majority of crack code, except for one two special extreme minority, and now can not find all over the world to break the law, others are supported.

 

 

3 , users log on with COOKIE to access Web content

 

Background : to limit your access permissions by the account

 

Solution :

General, we only need to collect octopus landing operations, as long as you can provide corresponding account password, octopus can simulate the operation go to the website, you can go get the data. If you do not have an account, then nothing can be done up. Like IT oranges, you do not have an account, you can only see the first 1000 data. You only pay for their SAAS account, you can see more data.

 

Evolutionary one: even if the account does not work

 

Solution :

Like Jingdong comments, you can only see the latest 1000. This time it is necessary to spend octopus timing acquisition, we monitor certain frequency, a new data collection down immediately, following up and maintain accumulate.

 

4 , using the JS encrypted web content

 

Background : The browser calculated results JS web content

 

Solution :

This trick, in dealing with when an HTTP POST request, is a method of increasing complexity and difficulty of the trick. But this confrontation is born octopus, octopus built-in browser to access the web page data, when it will open the page to execute JS calling code to get data and then parse the web page data. So comes JS operation, easy to put this passed around.

 

The general through code or crawler technology HTTP request mode, is how this kind are not open around around, and collected students by writing code, he must JS to break the encryption.

 

5 , link randomization

 

Background : Web page links randomized, multiple links on the same page or to create a link under different circumstances

 

Solution :

 

This case, the general asked us to start from the source access, analog people visit, such as the home page, go to the list, to the content page. He links inside pages randomized, home address can not random bar. This just maintaining the status quo, to crack.

 

Evolutionary 1: Using script to generate page address

 

This situation, in front of octopus browser, but also a little effect at all. Because the octopus is a simulation of human action, unless he generated page address is not to give people access, or else, still mining.

 

6 , pages which increase confusion invisible elements

 

BACKGROUND : Conventional data analysis page is structured data, are matched by regular expression character string and the positive positioning. So increasing the obfuscated code or text, increase the difficulty of your crack, you increase the trouble. I have seen a page out of time to resolve reads: "Do not pick, do not pick, then they have to be taken I Diao"

 

Solution : Because the octopus mainly by way XPATH location, this small trick in front of XPATH, it was easily bypassed. Big deal then we string substitution, some confused character segments by certain rules to replace the can. After all, Web developers leave obfuscated code is to follow a certain law left.

 

7 , website templates random

 

Background : Increased acquisition difficulty, the same type of page, but a variety of templates to show

 

Solution : This is mainly patience, I have seen paged list page, singular page is a template, double the number of pages is another, or is a regular, every 10 to another. This is required when we start collecting observed clearly. But this has a very good observation, generally for a template, we would not data mining.

 

 

 

Can not be the same template, you picked a good front, one behind to die. Most are caused by inconsistencies appear in the template. Analyzing logic built octopus, you can page through different characteristics, with different guide octopus parsing crack.

 

8 , artificial intelligence gathering anti

 

Background : Internet 99.9% of the anti-acquisition measures, is estimated to and fro on this move some of it, but that another 0.01%, is the people strenuous. As some large companies, there is a special anti-artificial intelligence collection team.

 

They can recognize your web request, whether it is to take the browser, or go the way of the request, as long as you visit their website trajectory, unlike track general user access, or the vast majority of users of the track, they will make some anti acquisition strategy, such as increasing verification code, or false data, and so on.

 

Solution : This time we are required, more like a "man" operation as to carry out the acquisition. For instance, we will normally visit the home page, and then little some positions, drag it, and then the list page, then take a look, and then enter the details page, and so on. These people simulation operation can be completed by the octopus, including how many automatic drop-down screen, residence time, hover position, and so on.

 

Evolutionary one: the establishment of black IP pool

 

Some large companies, will establish black IP pool, pool house was once an IP access, immediately rejected. This is typically appear outside IP, or some room IP, saying that white is bad IP has been used up. And this time, the quality of agency IP resources, is especially precious.

 

 

Octopus quality proxy IP settings interface

 

In my opinion, the acquisition and anti-acquisition, always a contradictory problem, says absolutely not collected, or absolutely can not collect. In this line of business inside, the real problem is sophisticated and successful acquisition of benefits, such as increased verification code, IP, etc., are some of the overhead required format, especially in front of a large amount of data, this cost is sometimes very great.

 

The party sites, but also through this, to increase the difficulty of your collection acquisition cost, so as to achieve a relatively balanced controllable range. I had handled the project, some projects need to spend millions annually IP or verification code before acquisition costs to the desired data, octopus can do is, to the best cost, to help you get to where you want to the data, rather than zero cost.

 

Acquisition-related tutorial :

Today's headlines Data collection:

http://www.bazhuayu.com/tutorialdetail-1/jrtt-7.html

Acquisition know almost topic information (to know almost found an example):

http://www.bazhuayu.com/tutorialdetail-1/zh-ht.html

Taobao commodity information collection:

http://www.bazhuayu.com/tutorialdetail-1/cjtbsp-7.html

US group business information collection:

http://www.bazhuayu.com/tutorialdetail-1/mtsj_7.html

Lottery data collection:

http://www.bazhuayu.com/tutorialdetail-1/cpkjdatacj.html

Chinese network starting point of the novel collection methods as well as detailed steps:

http://www.bazhuayu.com/tutorialdetail-1/qidianstorycj.html

Amazon Reviews collection:

http://www.bazhuayu.com/tutorialdetail-1/ymxspplcj.html

 

Octopus --90 million users choose web data collector.

1, the operation is simple, anyone can use: no technical background, the Internet will be able to collect. Complete visualization process, click the mouse to complete the operation, a 2-minute quick start.

2, powerful, any web site can be taken: to click, landing page, the identification code, waterfall, page Ajax asynchronous script loading data, can be done by a simple set collection.

3, collecting the cloud, may be shut down. After the acquisition task can be configured off, the task can be performed in the cloud. Pang Taiyun acquisition cluster 24 * 7 uninterrupted run, do not worry about IP was blocked, network outages.

4, features free + value-added services, on-demand options. Free version has all the features to meet the basic needs of the user's collection. At the same time set a number of value-added services (such as private cloud), meet the needs of high-paying business users.

 

Guess you like

Origin www.cnblogs.com/haibo123/p/11294318.html