Python crawler series-start into the soil (3)

table of Contents

Verification code

Love and hatred between verification code and crawler

Anti-climbing mechanism: verification code, identifying the data in the verification code picture, used to simulate the login operation.

Operation to identify the verification code:

  • Artificial naked eye operation
  • tesserocr library/tesseract library
  • Third-party automatic identification

cookie

HTTP/https protocol features: stateless
cookie: used to allow the server to record the client's relevant status.
Cookie source: created by the client after simulating a login post request.
If a cookie is generated during the request, the cookie will be automatically stored and carried In the session object.

Python uses session to send post request

session.post(url=url, headers=headers, data=data)

For cookies, please refer to: Cookie, Session, AJAX, JSON

proxy

Proxy: Anti-crawl mechanism of cracking IP block

effect:

  • Break through your own IP access restrictions
  • Hiding yourself is really IP

You can refer to: crawler IP proxy pool code record

Type of proxy IP:

  • http
  • https

Anonymity of proxy IP:

  • Transparent: The server knows that the request uses a proxy, and also knows the IP of the real request
  • Anonymous: I know the proxy is used, but I don’t know the real IP
  • Gao An: I don’t know the proxy is used, let alone the real IP

Guess you like

Origin blog.csdn.net/qq_36171287/article/details/113803087
Recommended