[2022] Python3 Crawler Tutorial - What is a crawler?

In short, crawlers can help us quickly extract and save the information on the website.

We can compare the Internet to a large web, and crawlers (ie, web crawlers) are spiders that crawl on the web. Comparing the nodes of the network to web pages, the crawler crawls it, which is equivalent to visiting the page, and can extract the information on the web page. We can compare the connection between nodes to the link relationship between web pages and web pages, so that after the spider passes through a node, it can continue to crawl along the node connection to reach the next node, that is, continue to obtain subsequent web pages through a web page, so that The nodes of the entire web can be crawled by spiders, and the data of the website can be crawled.

1. What is the use of reptiles?

Through the above words, you may have a preliminary understanding of what reptiles do, but generally we have to learn one thing. We have to know what to do with it, right?

In fact, reptiles are much more useful.

For example, if we want to study the recent headlines of major websites, we can use crawlers to crawl down the popular news of these websites, so that we can analyze the headlines, content, etc. to know the hot keywords.
For example, we want to organize and analyze some weather, finance, sports, companies and other information, but these contents are distributed on various websites, then we can use crawlers to crawl the data on these websites Down, organize it into the data we want and save it, we can analyze it.
For example, we have seen a lot of beautiful pictures on the Internet, such as landscapes, food, beauties, or some information and articles, and want to save them on the computer, but it is obviously very time-consuming and laborious to right-click to save, copy and paste, then we can use crawlers to save These pictures or resources are quickly crawled down, which greatly saves time and effort.

In addition, there are many other technologies, such as scalpers robbing tickets, self-help robbing classes, website ranking and other technologies, which are inseparable from crawlers. The usefulness of crawlers can be said to be very great. It can be said that everyone should be able to order crawlers.

In addition, learning crawlers can also help us learn Python by the way. To learn crawlers, my first recommendation is the Python language. If you are not familiar with Python, it doesn’t matter, crawlers are very suitable for learning as a way to get started with Python. While learning crawlers, you can learn Python at the same time.

Not only that, crawler technology and other fields almost have intersections, such as front-end and back-end web development, database, data analysis, artificial intelligence, operation and maintenance, security and other fields are all related to crawlers, so learning crawlers well is equivalent to It has also paved a step for other fields, and if you want to enter other fields in the future, you can connect more easily. Python crawler is one of the good introductory directions for learning computer.

2. The crawler process

Simply put, a crawler is an automated program that fetches web pages and extracts and saves information, as outlined below.

(1) Get the webpage

The first job that the crawler has to do is to get the web page, here is the source code of the web page. The source code contains some useful information of the web page, so as long as the source code is obtained, the desired information can be extracted from it.

When we browse web pages with a browser, the browser actually simulates this process for us. The browser sends requests to the server one by one, and the returned response body is the source code of the web page, which is then parsed and rendered by the browser. Therefore, the crawler we want to do is actually similar to the browser. It is good to get the source code of the web page and parse the content, but we are not using the browser, but Python.

As I just said, the most critical part is to construct a request and send it to the server, and then receive and parse the response, so how to implement this process in Python?

Python provides many libraries to help us achieve this operation, such as urllib, requests, etc. We can use these libraries to implement HTTP request operations. Both the request and the response can be represented by the data structure provided by the class library. After getting the response, we only need to parse the bodypart , that is, get the source code of the web page, so that we can Use the program to realize the process of obtaining the web page.

(2) Extract information

After obtaining the source code of the web page, the next step is to analyze the source code of the web page and extract the data we want from it. First of all, the most common method is to use regular expression extraction, which is a versatile method, but it is more complicated and error-prone when constructing regular expressions.

In addition, because the structure of web pages has certain rules, there are also some libraries that extract web page information based on web page node attributes, CSS selectors or XPath, such as Beautiful Soup, pyquery, lxml, etc. Using these libraries, we can efficiently and quickly extract web page information, such as node attributes, text values, etc.

提取信息是爬虫非常重要的部分，它可以使杂乱的数据变得条理、清晰，以便我们后续处理和分析数据。

(3) 保存数据

提取信息后，我们一般会将提取到的数据保存到某处以便后续使用。这里保存形式有多种多样，如可以简单保存为 TXT 文本或 JSON 文本，也可以保存到数据库，如 MySQL 和 MongoDB 等，还可保存至远程服务器，如借助 SFTP 进行操作等。

(4) 自动化程序

说到自动化程序，意思是说爬虫可以代替人来完成这些操作。首先，我们手工当然可以提取这些信息，但是当量特别大或者想快速获取大量数据的话，肯定还是要借助程序。爬虫就是代替我们来完成这份爬取工作的自动化程序，它可以在抓取过程中进行各种异常处理、错误重试等操作，确保爬取持续高效地运行。

3. 能爬怎样的数据？

在网页中我们能看到各种各样的信息，最常见的便是常规网页，它们对应着 HTML 代码，而最常抓取的便是 HTML 源代码。

另外，可能有些网页返回的不是 HTML 代码，而是一个 JSON 字符串（其中 API 接口大多采用这样的形式），这种格式的数据方便传输和解析，它们同样可以抓取，而且数据提取更加方便。

此外，我们还可以看到各种二进制数据，如图片、视频和音频等。利用爬虫，我们可以将这些二进制数据抓取下来，然后保存成对应的文件名。

另外，还可以看到各种扩展名的文件，如 CSS、JavaScript 和配置文件等，这些其实也是最普通的文件，只要在浏览器里面可以访问到，就可以将其抓取下来。

上述内容其实都对应各自的 URL，是基于 HTTP 或 HTTPS 协议的，只要是这种数据，爬虫都可以抓取。

4. 总结

本节结束，我们已经对爬虫有了基本的了解，接下来让我们一起接着迈入爬虫学习的世界吧！

更多精彩内容，请关注我的公众号「进击的 Coder」和「崔庆才丨静觅」。