Originality is not easy. Plagiarism and reprinting are prohibited in this article. Years of actual crawler development experience is summarized. Infringement must be investigated!
Table of contents
1. Reptile task
Task background : Crawl Python internship data from the internship monk website
Task objective : Use the parsing library Beautiful Soup to parse the webpage and obtain the required data
Two, analysis
First, enter the homepage of the official website of Shixiseng: https://www.shixiseng.com
to crawl the Python internship information of the Shixiseng IT Internet, as shown in the figure below:
slide to the bottom of the page, click the next page, and observe the rules of the URL, as shown in the figure below:
As can be seen from the URL above, only page=? This has changed
, then click in to view the corresponding detailed data: https://www.shixiseng.com/intern/inn_1k3vhcwwguaf?pcm=pc_SearchList
Then check the properties of the corresponding source code, as shown in the figure below: As
shown in the figure above, the data in this field is invisible, maybe it doesn’t want you to get the data of its website easily, these data are more important to him, don’t want to Let us get it easily, so anti-crawling is enabled
. If you run it directly, the data cannot be crawled, as shown in the figure below:
Anti-crawling technique : In fact, this is a coding problem. We only need to use a coding method, such as "utf- 8" encoding to represent these data, and then replace the corresponding data part with the encoding method you choose, as shown in the figure below: As shown in the figure above, the relevant
data has been presented in the form of "utf-8" encoding
to create the function hack_number(), use Yu decoded the numbers:
Then observe the URL you clicked on:
here we are crawling data first in breadth and then in depth
. After writing the relevant code, check the running results:
3. Source code download
CSDN source code download link: Download source code
Originality is not easy, if you find it useful, I hope you can give it a thumbs up, thank you guys!
4. Author Info
Author: Xiaohong's Fishing Daily, Goal: Make programming more interesting!
Focus on algorithms, reptiles, websites, game development, data analysis, natural language processing, AI, etc., looking forward to your attention, let us grow and code together!
Reprint instructions: This article prohibits plagiarism and reprinting, and infringement must be investigated!