python - Introduction to reptiles

What are reptiles?
  1. Simulate a browser sending a request to a web server
  2. Parse the response data returned by the server and save the data
What data can crawlers obtain?
  1. In principle, all data that can be obtained through the browser can be crawled
  2. The crawler can only obtain the data that the crawling browser can normally obtain
Application scenarios of crawlers?
  1. Data analysis (such as movie box office, stock information, commodity sales, etc.)
  2. Public opinion monitoring (such as Weibo, forums, etc.)
  3. Search engine browsing playback volume, etc. (such as various self-media accounts)
  4. Grab tickets and vote (send requests to ticket purchase and voting interfaces)
  5. Network security (SMS bombing - multiple websites send verification codes)
Why is there a backlash?
  1. Prevent valuable data from being obtained maliciously
  2. Block junk traffic, reduce server pressure and operating costs
The confrontation between reptiles and anti-crawlers?
  1. Some data requires login to obtain
  2. Identify real people and reptiles through verification codes
  3. Monitor the amount of requests per unit time for the same IP address
  4. The request needs to carry specific data
  5. The response data is encrypted and requires a specific algorithm to decrypt
Learning Content
  • How to crawl web page data? (How to send a request to the server to get the source file)
    • Requests module (send requests to the server, get data)
    • Simulate real browser state
    • Set Proxy proxy IP (to avoid sending a single ip too fast, which is considered as a crawler by the server)
  • How to extract key data? (How to get useful data in source files)
    • regular expression
    • XPath expressions
  • How to store the extracted data?
    • MongoDB database
  • Scrapy framework crawls massive data
    • Integrate sending requests, data parsing, and data saving
    • Scrapy combines MongoDB to store data
  • Scrapy-Redis distributed crawler (multiple machines crawl a task together)
    • Redis database
    • Scrapy-Redis framework
  • simulated login
    • Login principle: Cookie and Session
    • Selenium Browser Automation
    • Crawl data that requires login to obtain
  • Identification codes
    • OpenCV Computer Vision
    • OCR text recognition engine
    • EasyDL machine learning cloud service
  • Anti-climbing and anti-anti-climbing
    • Crack text encryption and anti-climbing
    • Various encryption algorithms: MD5, SHA256, AES, RSAc
    • JS reverse analysis: restore the encryption process of the website
  • Expand content
    • Data Analysis: Pandas Module
    • Frequent interview questions
Are reptiles legal?
  • As technology itself is not prohibited in law
  • The data that crawlers can obtain are public data that can be normally obtained through browsers
  • The crawler just gets a lot of data faster

A legal risk situation? 

  • Combining crawlers to hack website servers
  • Financial gain from acquired data
  • Improper commercial competition through crawlers
  • The crawled data violates the copyright or privacy of the other party

How to avoid legal risks?

  • Never massively request-bomb your web server
  • Do not publicly disseminate or sell the crawled data
  • Do not crawl data involving intellectual property rights and user privacy

Guess you like

Origin blog.csdn.net/violetta521/article/details/132199039