Web Crawler | Internship Monk for Getting Started (Coding Anti-crawling)

Originality is not easy. Plagiarism and reprinting are prohibited in this article. Years of actual crawler development experience is summarized. Infringement must be investigated!

1. Reptile task

Task background : Crawl Python internship data from the internship monk website
Task objective : Use the parsing library Beautiful Soup to parse the webpage and obtain the required data

Two, analysis

First, enter the homepage of the official website of Shixiseng: https://www.shixiseng.com
to crawl the Python internship information of the Shixiseng IT Internet, as shown in the figure below:
insert image description here
insert image description here
slide to the bottom of the page, click the next page, and observe the rules of the URL, as shown in the figure below:
insert image description here
As can be seen from the URL above, only page=? This has changed

, then click in to view the corresponding detailed data: https://www.shixiseng.com/intern/inn_1k3vhcwwguaf?pcm=pc_SearchList

Then check the properties of the corresponding source code, as shown in the figure below: As
insert image description here
shown in the figure above, the data in this field is invisible, maybe it doesn’t want you to get the data of its website easily, these data are more important to him, don’t want to Let us get it easily, so anti-crawling is enabled

. If you run it directly, the data cannot be crawled, as shown in the figure below:
insert image description here
Anti-crawling technique : In fact, this is a coding problem. We only need to use a coding method, such as "utf- 8" encoding to represent these data, and then replace the corresponding data part with the encoding method you choose, as shown in the figure below: As shown in the figure above, the relevant
insert image description here
data has been presented in the form of "utf-8" encoding

to create the function hack_number(), use Yu decoded the numbers:
insert image description here
Then observe the URL you clicked on:
insert image description here
here we are crawling data first in breadth and then in depth

. After writing the relevant code, check the running results:
insert image description here

3. Source code download

CSDN source code download link: Download source code

Originality is not easy, if you find it useful, I hope you can give it a thumbs up, thank you guys!

4. Author Info

Author: Xiaohong's Fishing Daily, Goal: Make programming more interesting!

Focus on algorithms, reptiles, websites, game development, data analysis, natural language processing, AI, etc., looking forward to your attention, let us grow and code together!

Reprint instructions: This article prohibits plagiarism and reprinting, and infringement must be investigated!

Guess you like

Origin blog.csdn.net/qq_44000141/article/details/121480796