A 3-Minute Python Crawler Crash Course

This is an introductory guide to Python, aimed at those students who do not have any programming experience and learn Python from scratch. Whether your starting point for learning is interest-driven, thinking expansion, work needs, or want to change careers, you can use this article as a reference.

In this era of information explosion, there are tens of thousands of results for the keyword "Python introduction". Many novice players will inevitably move the hammer here and there, and finally read a lot of articles, but they still haven't crossed the novice threshold.

As long as you want to learn and study hard, you will definitely gain something!

first! You need to have a clear understanding of reptiles, here is the quote from Chairman Mao:

defy strategically

  • "All websites can be crawled": the content of the Internet is written by people, and they are all written lazily (the first page is not a, the next page is 8), so there must be rules, which gives people With the possibility of crawling, it can be said that there is no website in the world that cannot be crawled

  • "The framework remains the same": The websites are different, but the principles are similar. Most crawlers proceed from the process of sending requests-obtaining pages-parsing pages-downloading content-storing content, but the tools used are different

Tactical emphasis

  • Persevere, guard against arrogance and impetuosity: For beginners, don’t be complacent, thinking that you can crawl everything after crawling a little content. Although crawlers are relatively simple technology, there is no end to deep learning (such as search engines, etc.)! Only by constantly trying and studying hard is the kingly way! (Why there is a sense of sight in primary school composition)

Then, you need a grand goal to keep you motivated to continue learning (without practical projects, it is really difficult to be motivated)

I want to climb the whole Douban!

I want to climb the entire xx community!

I want to know the contact information of various girls*& #% $#

1. Basic knowledge of web pages:

Basic knowledge of HTML language (know href and other university computer-level content is enough)

Understand the concept of website sending and receiving (POST GET)

A little bit of js knowledge is used to understand dynamic web pages (of course, it would be better if you know it yourself)

2. Some analysis languages ​​to prepare for the next analysis of web content

3. Then, you need some efficient tools to assist

(Similarly, understand here first, and then get familiar with the application when it comes to specific projects)

3.1 F12 Developer Tools:

  • Look at the source code: quickly locate elements

  • Analysis of xpath: 1. Google browser is recommended here, you can directly right-click on the source code interface to view

3.2 Packet capture tool:

  • Recommended httpfox, a plug-in under the Firefox browser, which is better than the F12 tool that comes with Google Firefox, and can easily check the information of the website receiving and sending packages

3.3 XPATH CHECKER (Firefox plugin):

Very good xpath testing tool, but there are a few pitfalls, all of which are stepped on by individuals, and I would like to warn everyone: 1. The xpath checker generates absolute paths, and some dynamically generated icons (common ones include list page turning) button, etc.), the erratic absolute path is likely to cause errors, so it is suggested here that it is only used as a reference when analyzing it. 2. Remember to remove the "x:" in the xpath box in the figure below. Grammar, currently incompatible with some modules (such as scrapy), it should be deleted to avoid error reporting

3.4 Regular expression test tool: Online regular expression test, used for more practice and analysis! There are many ready-made regular expressions that can be used, and can also be used for reference!


Epilogue

Finally, I would like to thank everyone who has read my article carefully. Reciprocity is always necessary. Although the following information is not very valuable, you can take it away if you need it:

  • ① Learning roadmap for all directions of Python, clear what to learn in each direction
  • ② More than 600 Python course videos, covering the necessary basics, crawlers and data analysis
  • ③ More than 100 practical cases of Python, including detailed explanations of 50 super-large projects, learning is no longer just theory
  • ④ 20 mainstream mobile games forced solutions Retrograde forced solution tutorial package for reptile mobile games
  • ⑤ Crawler and anti-crawler offensive and defensive tutorial package, including 15 large-scale website forced solutions
  • ⑥ Reptile APP reverse actual combat tutorial package, including detailed explanations of 45 top-secret technologies
  • ⑦ More than 300 Python e-books, ranging from beginners to advanced
  • ⑧ Huawei produces exclusive Python comic tutorials, which can also be learned on mobile phones
  • ⑨ The actual Python interview questions of Internet companies over the years are very convenient for review

Guess you like

Origin blog.csdn.net/BlueSocks152/article/details/131211880