table of Contents
Verification code
Love and hatred between verification code and crawler
Anti-climbing mechanism: verification code, identifying the data in the verification code picture, used to simulate the login operation.
Operation to identify the verification code:
- Artificial naked eye operation
- tesserocr library/tesseract library
- Third-party automatic identification
cookie
HTTP/https protocol features: stateless
cookie: used to allow the server to record the client's relevant status.
Cookie source: created by the client after simulating a login post request.
If a cookie is generated during the request, the cookie will be automatically stored and carried In the session object.
Python uses session to send post request
session.post(url=url, headers=headers, data=data)
For cookies, please refer to: Cookie, Session, AJAX, JSON
proxy
Proxy: Anti-crawl mechanism of cracking IP block
effect:
- Break through your own IP access restrictions
- Hiding yourself is really IP
You can refer to: crawler IP proxy pool code record
Type of proxy IP:
- http
- https
Anonymity of proxy IP:
- Transparent: The server knows that the request uses a proxy, and also knows the IP of the real request
- Anonymous: I know the proxy is used, but I don’t know the real IP
- Gao An: I don’t know the proxy is used, let alone the real IP