Python Reptile Practice - Recognition 1. mechanism of anti-climb

51zxw released a new course, is in February of this year, is now finally resigned and idle time, thinking Learning reptiles, anyway, learned not loss. Reptile regarded as the most simple things, reptiles backed by data mining, data analysis and machine learning, seem not so big on big data and compared with the AI, a programmer can easily write a small reptile. However, hidden in the data block, the deepening of the anti-climbing mechanism, how to run a distributed architecture designed reptiles, effective high hide the agent pool, anti-shielding anti title, cleaning effective data storage, optimized crawling strategy, combined with big data technology, more efficient access to quality data and so on and so on, nor is it seemingly so simple. Because today century, that is, all the data, small reptiles, to some extent, become the source of the information available, this is the meaning of existence reptiles.

工欲善其事必先利其器, self-learning network with examples teacher is sublime, but towards the end, packge-control because some (crab) suspended the service factor (could also be white whore too many people .. forced true), before writing selenium and has been used pycharm, simply to then use it. . Configured in Anaconda interpreter, followed by my pycharm, pip install urllib, bin open dry.

 

The reptile essence is to do the work of a simulated browser. Made from the beginning of the simulation browser HTTP request, send WebSocket request, to simulate browser compile js behind, in fact, do is one thing.

 

Simple anti-climb mechanism

1. verification request header User-Agent, Cookie, Referer

And jump source request header is the first layer anti-climb protection, different analog browser user-agent client's http request header, add tags analog jump source referer

2.ip limit the high hide ip proxy, ip self-built pool (ADSL dial-up will each assign different ip), ip access frequency settings

Verify the operation of a machine or person, many times the same ip will be banned permanently high frequency

3. login authentication restrictions

Text selection, marking, dragging an image, the Semantics Recognition binding (or set identification line), cookie, OCR, pytesseract, selenium analog operation, manual input codes? . . Network climb people?

4. Non-static pages JS confusion encryption, Ajax asynchronous loading

Oh shit, js anti-climb, usually two ideas, python rewrite js execjs content or third-party libraries to resolve js. oh shit back in school now, text replacement, js confusion, too much knowledge of the encryption algorithm, the good news was slowly learning, as well as the headless browser + selenium wonders bingo> __ <

Guess you like

Origin www.cnblogs.com/liuchaodada/p/12037637.html