Domestic and international business platform anti reptile mechanism

Electronic business platform is broadly divided into two core engine, search architecture and product placement, it should be said that each have their own characteristics. Of course, today's theme is anti-crawler mechanism, how electronic business platform can protect their data, without affecting the normal user experience, the so-called industry today a protracted offensive and defensive game.

First-order reptiles (technical papers)

Application Scene One: static results page, without frequency limitations, no blacklist.

Attack: Direct use scrapy crawling

Anti-: nginx layer write lua scripts, the reptile IP blacklist, shielding some time (do not prompt time)

Application Scene 2: The results of static pages, no frequency limit, blacklist

Attack: Use a proxy (http proxy, VPN), a random user-agent

Anti: frequency increase period, than a certain number of hourly or daily IP masking period (not prompt time)

Application Scene Three: static results page, there is a frequency limit, blacklist

Attack: using a proxy, random crawling 1-3 seconds, rest for 10 seconds to 10 seconds climbing, crawling or even time range, increase machine

Anti: 5 minutes when the request exceeds 60 times, the page eject code verification, validation by increasing unlimited time 5 minutes, then mask no additional hour (time self) codes by

Application Scene Four (Amazon): static results page, there is a frequency limit, blacklist, a verification code

Attack: python + tesseract library identification codes training simulation, or based tor, crawlera (charge) middleware (breadth traversal IP)

Anti: Front loaded asynchronously js, dynamic encryption token

Five scenarios (Aliexpress): results of a dynamic page, has a frequency limit, blacklist, a verification code

Attack: python + Selenium, use chrome kernel to load dynamic results page, but recommended node + hex + ie kernel to do a crawling client. java program can refer to "crack the simple Java browser component jxbrowser"

Defense: See second-order reptiles

A reptile belonging to the order purely technical game, start a real interactive games below

Second-order reptiles (Advanced chapter)

Scene 6 application (PC Day cat search page): https, dynamic results page, there is a frequency limit, no blacklist, a verification code

Defense: Based on personality into a leading advocate active users log in to get a better user experience. Some recommend buying habits in accordance with the normal promotional merchandise for the user, such as 9.9 shampoo, shower gel, tea (Walch often do), as well as some high-quality diamond exhibition goods. Not only can distinguish human, but also to collect user access preferences, targeted optimization of personalized big data, but also against ddos, can serve three

Attack: collecting brush single account, with distributed task

Scenarios seven (business staff): https, React single-page application, a verification code, LocalStorage, machine learning middleware

Defense: General Staff itself is the official business class service fees, measured transition from http to https, and the recent crack down on the behavior of the acquisition, directly take title warning strategy. To increase user acquisition cost limitations, restrictions attacker convergence is.

Single page application access is follow a normal trajectory. E.g., request:

1. The user information acquisition

2. Data List 1

3. Data List 2

4. Data Detail 1

For data visualization applications, most of the data analysis is obtained by calculation, and does not change often (or even the same). So, the result is stored into the data LocalStroage, not only saves the network request to speed up page (equivalent to cache), but also to differentiate track user behavior.

Detail, the programming obtained by crawlers, whether based url request, or based decompression WebKit (eg: jxbrower). The generated reptiles objects are temporary objects, then the data is not stored LocalStroage, leading, request access to the data page of each track will be the

1. The user information acquisition

2. Data List 1 (actually should be stored LocalStroage)

3. Data List 2 (actually be stored LocalStroage)

4. Data Detail 1

The normal user behavior (duplicate pages have been accessed through a browser)

1. The user information acquisition

2. Data Detail 1

... In short, there will not be any request LocalStorage

Constituency _085.png

Encryption and decryption of JS code


setItem: function(e, t) {

                    return void 0 === t ? this.removeItem(e) : (localStorage.setItem(e, this._serialize(t)), t) }, getItem: function(e) { return this.deserialize(localStorage.getItem(e)) }, 

In addition, single-page application is asynchronous load data, a page planted with ABC three categories, accounting for a dialog screen only when needed Class A verification code, BC class data display properly, must not consider these circumstances when reptiles development, validation code is not mandatory asks (as usual visit after refresh)

You can also analyze user requests a day, access habits, etc.

Analysis track user behavior are basically three kinds of ways: nginx flow middleware, web controller layer interceptor, log collection (flume + hadoop + sperk) *. Bayesian analysis may be based on actual or decision tree [developers know how to count only]

Had been closed once, not a real-time closure was only the next day, so off-line log results should be calculated when

Constituency _087.png

 

 

Attack:.. Chrome plugin (available at https traffic), in addition to the jump page links to records in the database because some links only need to modify parameters such as date or ID can reuse some of the links might point riveting is calculated by the trajectory of factors PS:. this is the business staff has been warning of the way, all acts of the responsibility of the reader, regardless of the author

Constituency _086.png

Third-order reptiles (counterattack articles)

Reason why the attacker needs to climb to take data electronic business platform, it is an object, a right inverse calculation platform weight calculation derived indicators within a reasonable range of all types (with a single brush, brush flow). Technically, the process is always a game between, if someone under the original capital semi-artificial, machine stack of violence gripping way, is difficult to prevention and control. And we all know, technology is transforming the gold content is very high electricity supplier, machine and labor costs drop in the bucket, if your business model and model grazing on some internal auxiliary channels, either as merchants or service providers are very fast realization

Therefore, the final core point is to let the anti-reptile attacker does not know that they have been determined to be a reptile. Then, the attacker will only leisurely crawling data, and happily began calculus. From the platform side our ultimate goal is to protect our data and model, then the key point came. The need is for an attacker to obtain data was not representative, to model feasible. With the flow barrel, locate the attacker, we will make some discrete raw data processing, adding some noise, allowing the attacker to derive the model wrong direction. The final say attacker can not distinguish which data is available, that is noise.

At this time, you will say, if the system manslaughter normal user, gives a wrong impression data from some of the outrageous how to do. This is actually very good degree of certainty, we only need * ranking, the dimensions of such singular dynamic transactions, such as adding noise hits, not price processing, shipping, product details, even if the attacker is determined as the program does not affect normal user experience

Guess you like

Origin www.cnblogs.com/wxcclub/p/11184462.html