An introductory tutorial for Python crawlers, even beginners can learn it easily

1 Interest introduction

This article will lead you to understand what a crawler is and the basic principles of crawlers. It's very simple, even a novice can understand it. If you don't understand, just lie down and forget it. (Hahaha)

Crawling, in essence, is to use programs to obtain data that is valuable to us on the Internet.

2 Basic principles

2.1 How the browser works

Insert image description here
(1) Parse data: When the server responds to the data to the browser, the browser will not directly throw the data to us. Because the data is written in computer language, the browser also needs to translate the data into content we can understand; (2) Extract
data: We can select the data that is relevant to us from the data we obtained. Used data;
(3) Store data: Save the selected useful data in a certain file/database.

2.2 How crawling works

Insert image description here

(1) Obtain data : The crawler will initiate a request to the server based on the URL we provide, and then return the data;
(2) Parse the data : The crawler will parse the data returned by the server into a format we can understand;
(3) Extract data : the crawler program then extracts the data we need;
(4) Store data : the crawler program saves these useful data for your later use and analysis.

3 crawler steps

Insert image description here

Below we conduct a detailed analysis and explanation of the above figure.

3.1 requests.get()

3.1.1 Install requests library

(1) Open the terminal software (terminal) on the Mac computer, enter pip3 install requests, and then click enter;
(2) Call the command prompt (cmd) on the Windows computer and enter pip install requests.

Tip: When installing other libraries in the future, it is similar to the above, pip install module name

3.1.2 Function of requests library

The requests library can help us download web page source code, text, images, and even audio. In fact, "download" is essentially

Send a request to the server and get a response.

3.1.3 Usage of requests library

res = requests.get('URL')

requests.get is calling the get() method in the requests library. It sends a request to the server. The parameters in brackets

The number is the URL where the data you need is located, and the server responded to the request. We return the result of this response

Assign the value to the variable res.

Insert image description here

3.2 Common properties of Response objects

Insert image description here

3.2.1 response.status_code

Print the response status code of the response to check whether the request was successful.

Insert image description here

3.2.2 response.content

Return the content of the Response object in the form of binary data, suitable for downloading pictures, audios, and videos.

3.2.3 response.text

Return the content of the Response object in the form of a string, suitable for downloading text and web source code.

3.2.3 response.encoding

Can help us define the encoding of the Response object. (Consider using res.encoding only when encountering garbled text problems)

Computer systems and programming languages ​​(low-level thinking)

4 Reptile Ethics

4.1 Robots Protocol

The Robots protocol is a recognized code of ethics for Internet crawling. Its full name is "Robots exclusion protocol". This protocol is used to tell crawlers which pages can be crawled. Which ones are not allowed.

4.2 Agreement View

(1) Just add /robots.txt after the website domain name. For example, Taobao’s robots protocol ( http://www.taobao.com/robots.txt );

(2) The English words that appear most frequently in the protocol are Allow and Disallow. Allow means that access is allowed, and Disallow means that access is prohibited.


If you are interested in Python crawlers, you can try this complete set of Python learning materials I compiled.

For beginners with 0 basics to get started:

If you are a novice and want to get started with Python quickly, you can consider it.
On the one hand, the learning time is relatively short and the learning content is more comprehensive and focused. The second aspect is that you can find a study plan that suits you

Including: Python permanent installation package, Python web development, Python crawler, Python data analysis, artificial intelligence, machine learning and other learning tutorials. Let you learn Python systematically from scratch!

Introduction to zero-based Python learning resources

1. Learning routes in all directions of Python

The Python all-direction route is to organize the commonly used technical points of Python to form a summary of knowledge points in various fields. Its usefulness is that you can find corresponding learning resources according to the above knowledge points to ensure that you learn more comprehensively.
Insert image description here

2. Python learning software

If a worker wants to do his job well, he must first sharpen his tools. The commonly used development software for learning Python is here!
Insert image description here

3. Python introductory learning video

There are also many learning videos suitable for beginners. With these videos, you can easily get started with Python~Insert image description here

4. Python exercises

After each video lesson, there are corresponding exercises to test your learning results haha!
Insert image description here

5. Python practical cases

Optical theory is useless. You must learn to type code along with it and practice it in order to apply what you have learned to practice. At this time, you can learn from some practical cases. This information is also included~Insert image description here

6. Python interview materials

After we learn Python, we can go out and find a job if we have the skills! The following interview questions are all from first-tier Internet companies such as Alibaba, Tencent, Byte, etc., and Alibaba bosses have given authoritative answers. I believe everyone can find a satisfactory job after reviewing this set of interview materials.
Insert image description here
Insert image description here

7. Data collection

The complete set of Python learning materials mentioned above has been uploaded to CSDN official. Friends who need it can scan the CSDN official certification QR code below on WeChat and enter "receive materials" to get it for free! !

Guess you like

Origin blog.csdn.net/maiya_yaya/article/details/131780130