Basic understanding of crawler for getting started with Python crawler

1. What is a reptile

A crawler, a web crawler, can be understood as a spider crawling on the Internet. The Internet is like a big web, and a crawler is a spider crawling around on this web. If it encounters resources, it will will be grabbed. What do you want to grab? It's up to you to control it.

For example, it is crawling a web page. In this web, it finds a road, which is actually a hyperlink to the web page. Then it can crawl to another web to obtain data. In this way, the whole connected web is within reach for this spider, and it is not a problem to climb down in minutes.

2. The process of browsing the web

In the process of users browsing the web, we may see many beautiful pictures, such as  http://image.baidu.com/  , we will see several pictures and Baidu search box, this process is actually the user inputting the URL After that, through the DNS server, the server host is found, and a request is sent to the server. After the server parses, it sends the HTML, JS, CSS and other files to the user's browser. After the browser parses it, the user can see all kinds of pictures.

Therefore, the web pages that users see are essentially composed of HTML codes, and the crawler crawls these contents. By analyzing and filtering these HTML codes, the acquisition of resources such as pictures and text is realized.

3. Meaning of URL

URL, that is, Uniform Resource Locator, which is what we call web site, Uniform Resource Locator is a concise representation of the location and access method of resources that can be obtained from the Internet, and is the address of standard resources on the Internet. Every file on the Internet has a unique URL that contains information indicating the file's location and what the browser should do with it.

The format of the URL consists of three parts:
① The first part is the protocol (or service mode).
②The second part is the IP address of the host where the resource is stored (sometimes also includes the port number).
③ The third part is the specific address of the host resource, such as directory and file name.

When crawlers crawl data, they must have a target URL to obtain data. Therefore, it is the basic basis for crawlers to obtain data. Accurate understanding of its meaning is very helpful for crawlers to learn.

4. Configuration of the environment

To learn Python, of course, the configuration of the environment is indispensable. At first, I used Notepad++, but I found that its prompt function is too weak. Therefore, I used PyCharm under Windows, and Eclipse for Python under Linux. There are several excellent IDEs, you can refer to this article  to learn the IDE recommended by Python  . A good development tool is the propellant of progress, I hope you can find an IDE that suits you

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326400268&siteId=291194637