Python crawler finishing - introduction to crawler

1. What is a reptile

A web crawler (also known as a web spider , web robot, and in the FOAF community, more often referred to as a web chaser) is a program or script that automatically crawls information from the World Wide Web according to certain rules.

2. The meaning of url

URL, that is, Uniform Resource Locator, which is what we call web site, Uniform Resource Locator is a concise representation of the location and access method of resources that can be obtained from the Internet, and is the address of standard resources on the Internet. Every file on the Internet has a unique URL that contains information indicating the file's location and what the browser should do with it.

The format of the URL consists of three parts:
① The first part is the protocol (or service mode).
②The second part is the IP address of the host where the resource is stored (sometimes also includes the port number).
③ The third part is the specific address of the host resource, such as directory and file name.

When crawlers crawl data, they must have a target URL to obtain data. Therefore, it is the basic basis for crawlers to obtain data. Accurate understanding of its meaning is very helpful for crawlers to learn.

.Why use python for crawling

  • Crawl the interface of the web page itself

Compared with other static programming languages, such as java, c#, c++, python, the interface for crawling web documents is more concise; compared with other dynamic scripting languages, such as perl, shell, python's urllib2 package provides a more complete access to web documents. API. (Of course, ruby ​​is also a good choice.)
In addition, crawling web pages sometimes needs to simulate the behavior of browsers, and many websites are blocked for blunt crawler crawling. This is where we need to simulate the behavior of the user agent to construct appropriate requests, such as simulating user login, simulating session/cookie storage and setting. There are very good third-party packages in python to help you, such as Requests, mechanize

  • Post-crawl processing

The scraped web pages usually need to be processed, such as filtering html tags, extracting text, etc. Python's beautifulsoap provides a concise document processing function, which can complete most document processing with extremely short code.
In fact, many languages ​​and tools can do the above functions, but python can do it the fastest and cleanest.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325305446&siteId=291194637