python - Introduction to reptiles
Language
2023-08-11 17:20:48
views: null
What are reptiles?
- Simulate a browser sending a request to a web server
- Parse the response data returned by the server and save the data
What data can crawlers obtain?
- In principle, all data that can be obtained through the browser can be crawled
- The crawler can only obtain the data that the crawling browser can normally obtain
Application scenarios of crawlers?
- Data analysis (such as movie box office, stock information, commodity sales, etc.)
- Public opinion monitoring (such as Weibo, forums, etc.)
- Search engine browsing playback volume, etc. (such as various self-media accounts)
- Grab tickets and vote (send requests to ticket purchase and voting interfaces)
- Network security (SMS bombing - multiple websites send verification codes)
Why is there a backlash?
- Prevent valuable data from being obtained maliciously
- Block junk traffic, reduce server pressure and operating costs
The confrontation between reptiles and anti-crawlers?
- Some data requires login to obtain
- Identify real people and reptiles through verification codes
- Monitor the amount of requests per unit time for the same IP address
- The request needs to carry specific data
- The response data is encrypted and requires a specific algorithm to decrypt
Learning Content
- How to crawl web page data? (How to send a request to the server to get the source file)
- Requests module (send requests to the server, get data)
- Simulate real browser state
- Set Proxy proxy IP (to avoid sending a single ip too fast, which is considered as a crawler by the server)
- How to extract key data? (How to get useful data in source files)
- regular expression
- XPath expressions
- How to store the extracted data?
- Scrapy framework crawls massive data
- Integrate sending requests, data parsing, and data saving
- Scrapy combines MongoDB to store data
- Scrapy-Redis distributed crawler (multiple machines crawl a task together)
- Redis database
- Scrapy-Redis framework
- simulated login
- Login principle: Cookie and Session
- Selenium Browser Automation
- Crawl data that requires login to obtain
- Identification codes
- OpenCV Computer Vision
- OCR text recognition engine
- EasyDL machine learning cloud service
- Anti-climbing and anti-anti-climbing
- Crack text encryption and anti-climbing
- Various encryption algorithms: MD5, SHA256, AES, RSAc
- JS reverse analysis: restore the encryption process of the website
- Expand content
- Data Analysis: Pandas Module
- Frequent interview questions
Are reptiles legal?
- As technology itself is not prohibited in law
- The data that crawlers can obtain are public data that can be normally obtained through browsers
- The crawler just gets a lot of data faster
A legal risk situation?
- Combining crawlers to hack website servers
- Financial gain from acquired data
- Improper commercial competition through crawlers
- The crawled data violates the copyright or privacy of the other party
How to avoid legal risks?
- Never massively request-bomb your web server
- Do not publicly disseminate or sell the crawled data
- Do not crawl data involving intellectual property rights and user privacy
Origin blog.csdn.net/violetta521/article/details/132199039