Python crawlers from entry to abandonment (1) - Understanding crawlers

What is a reptile

The explanation in Wikipedia is as follows:

A web crawler (also known as a web spider , web robot, and more often called web chaser in the FOAF community) is a program or script that automatically crawls information from the World Wide Web according to certain rules. Other less commonly used names are ant , autoindex, emulator, or worm .

A crawler is a program or script that can automatically grab information according to certain rules .

It's a little simpler: a tool to intelligently obtain information from a web page.

what can reptiles do

" Everything Can Climb "

Text, audio, video, pictures, , , , etc.

How crawlers work

When we browse the web, there is a basic process as follows:

The user enters the URL , goes through the DNS server , finds the server host, and sends a request to the server. After the server parses, it sends the HTML, JS, CSS and other files to the user's browser, and then the browser parses the  HTML, JS, CSS and other files. The information contained in it is aggregated and displayed to the user.

PS: The information here can be divided into useful information and useless information. If you want to crawl the comment content of a movie on Zhihu, then the text of the comment is useful information for you, and the style of the comment box and other information are correct. You are useless information.

OK, now that you understand the basic process of browsing web pages, you can draw a conclusion that the web pages that users see are essentially composed of HTML codes .

That crawler crawling web page information is actually the process of finding useful information in HTML code and obtaining it.

The crawler obtains the useful information we want (text, audio, video, pictures,,,,,, etc.) by analyzing, filtering, and filtering the content in the HTML code.

Meaning of URLs

URL, that is, Uniform Resource Locator, which is what we call web site, Uniform Resource Locator is a concise representation of the location and access method of resources that can be obtained from the Internet, and is the address of standard resources on the Internet. Every file on the Internet has a unique URL that contains information indicating the file's location and what the browser should do with it.

The format of the URL consists of three parts:
① The first part is the protocol (or service mode).
②The second part is the IP address of the host where the resource is stored (sometimes also includes the port number).
③ The third part is the specific address of the host resource, such as directory and file name.

When crawlers crawl data, they must have a target URL to obtain data. Therefore, it is the basic basis for crawlers to obtain data. Accurate understanding of its meaning is very helpful for crawlers to learn.


The basic crawler is just the beginning and that's it~

From entry to abandonment




Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326113638&siteId=291194637