Getting Started with Python Crawler: Basic Understanding of Crawler

There is a private message from a fan who wants me to make a more basic version, so I copied the previous platform , you can take a rough look, and then I will slowly publish it.

1. What is a reptile

A crawler, that is, a web crawler, can be understood as a spider that crawls on the Internet. The Internet is like a big web, and a crawler is a spider that crawls around on this web. If it encounters resources, it will will be grabbed. What do you want to grab? It's up to you to control it.

For example, it is crawling a web page. In this web, it finds a road, which is actually a hyperlink to the web page. Then it can crawl to another web to obtain data. In this way, the whole connected web is within reach for this spider, and it is not a problem to climb down in minutes.

2. The process of browsing the web

In the process of users browsing the web, we may see many beautiful pictures, such as http://zhimaruanjian.com/ , through the DNS server , find the server host, send a request to the server, after the server parses it, send it to the user The browser HTML, JS, CSS and other files can be parsed by the browser, and users can see all kinds of pictures.

Therefore, the web pages that users see are essentially composed of HTML codes, and the crawler crawls these contents. By analyzing and filtering these HTML codes, the acquisition of resources such as pictures and text is realized.

3. Meaning of URL

URL, that is, Uniform Resource Locator, which is what we call web site, Uniform Resource Locator is a concise representation of the location and access method of resources that can be obtained from the Internet, and is the address of standard resources on the Internet. Every file on the Internet has a unique URL that contains information indicating the file's location and what the browser should do with it.

The format of the URL consists of three parts: ① The first part is the protocol (or service mode). ②The second part is the IP address of the host where the resource is stored (sometimes also includes the port number). ③ The third part is the specific address of the host resource, such as directory and file name.

When crawlers crawl data, they must have a target URL to obtain data. Therefore, it is the basic basis for crawlers to obtain data. Accurate understanding of its meaning is very helpful for crawlers to learn.

4. Configuration of the environment

To learn Python, of course, the configuration of the environment is indispensable. At first, I used Notepad++, but I found that its prompt function is too weak. Therefore, I used PyCharm under Windows, and Eclipse for Python under Linux. There are several excellent IDEs, you can refer to this article to learn the IDE recommended by Python . A good development tool is the propellant of progress , I hope you can find an IDE that suits you