1 Interest introduction
This article will lead you to understand what a crawler is and the basic principles of crawlers. It's very simple, even a novice can understand it. If you don't understand, just lie down and forget it. (Hahaha)
Crawling, in essence, is to use programs to obtain data that is valuable to us on the Internet.
2 Basic principles
2.1 How the browser works
(1) Parse data: When the server responds to the data to the browser, the browser will not directly throw the data to us. Because the data is written in computer language, the browser also needs to translate the data into content we can understand; (2) Extract
data: We can select the data that is relevant to us from the data we obtained. Used data;
(3) Store data: Save the selected useful data in a certain file/database.
2.2 How crawling works
(1) Obtain data : The crawler will initiate a request to the server based on the URL we provide, and then return the data;
(2) Parse the data : The crawler will parse the data returned by the server into a format we can understand;
(3) Extract data : the crawler program then extracts the data we need;
(4) Store data : the crawler program saves these useful data for your later use and analysis.
3 crawler steps
Below we conduct a detailed analysis and explanation of the above figure.
3.1 requests.get()
3.1.1 Install requests library
(1) Open the terminal software (terminal) on the Mac computer, enter pip3 install requests, and then click enter;
(2) Call the command prompt (cmd) on the Windows computer and enter pip install requests.
Tip: When installing other libraries in the future, it is similar to the above, pip install module name
3.1.2 Function of requests library
The requests library can help us download web page source code, text, images, and even audio. In fact, "download" is essentially
Send a request to the server and get a response.
3.1.3 Usage of requests library
res = requests.get('URL')
requests.get is calling the get() method in the requests library. It sends a request to the server. The parameters in brackets
The number is the URL where the data you need is located, and the server responded to the request. We return the result of this response
Assign the value to the variable res.
3.2 Common properties of Response objects
3.2.1 response.status_code
Print the response status code of the response to check whether the request was successful.
3.2.2 response.content
Return the content of the Response object in the form of binary data, suitable for downloading pictures, audios, and videos.
3.2.3 response.text
Return the content of the Response object in the form of a string, suitable for downloading text and web source code.
3.2.3 response.encoding
Can help us define the encoding of the Response object. (Consider using res.encoding only when encountering garbled text problems)
Computer systems and programming languages (low-level thinking)
4 Reptile Ethics
4.1 Robots Protocol
The Robots protocol is a recognized code of ethics for Internet crawling. Its full name is "Robots exclusion protocol". This protocol is used to tell crawlers which pages can be crawled. Which ones are not allowed.
4.2 Agreement View
(1) Just add /robots.txt after the website domain name. For example, Taobao’s robots protocol ( http://www.taobao.com/robots.txt );
(2) The English words that appear most frequently in the protocol are Allow and Disallow. Allow means that access is allowed, and Disallow means that access is prohibited.
If you are interested in Python crawlers, you can try this complete set of Python learning materials I compiled.
For beginners with 0 basics to get started:
If you are a novice and want to get started with Python quickly, you can consider it.
On the one hand, the learning time is relatively short and the learning content is more comprehensive and focused. The second aspect is that you can find a study plan that suits you
Including: Python permanent installation package, Python web development, Python crawler, Python data analysis, artificial intelligence, machine learning and other learning tutorials. Let you learn Python systematically from scratch!
Introduction to zero-based Python learning resources
1. Learning routes in all directions of Python
The Python all-direction route is to organize the commonly used technical points of Python to form a summary of knowledge points in various fields. Its usefulness is that you can find corresponding learning resources according to the above knowledge points to ensure that you learn more comprehensively.
2. Python learning software
If a worker wants to do his job well, he must first sharpen his tools. The commonly used development software for learning Python is here!
3. Python introductory learning video
There are also many learning videos suitable for beginners. With these videos, you can easily get started with Python~
4. Python exercises
After each video lesson, there are corresponding exercises to test your learning results haha!
5. Python practical cases
Optical theory is useless. You must learn to type code along with it and practice it in order to apply what you have learned to practice. At this time, you can learn from some practical cases. This information is also included~
6. Python interview materials
After we learn Python, we can go out and find a job if we have the skills! The following interview questions are all from first-tier Internet companies such as Alibaba, Tencent, Byte, etc., and Alibaba bosses have given authoritative answers. I believe everyone can find a satisfactory job after reviewing this set of interview materials.
7. Data collection
The complete set of Python learning materials mentioned above has been uploaded to CSDN official. Friends who need it can scan the CSDN official certification QR code below on WeChat and enter "receive materials" to get it for free! !