An article teaches you the basic execution process of a Python web crawler program

Web crawlers are programs that automatically crawl website content information on the Internet, and are also called web spiders or web robots. Large-scale crawler programs are widely used in search engines, data mining and other fields. Individual users or companies can also use crawlers to collect data that is valuable to themselves.

The basic execution process of a web crawler program can be summarized in three processes: request dataparse data , and  save data

Request data

In addition to ordinary HTML, the requested data also includes json data, string data, pictures, videos, audios, etc.

Analytical data

When a data download is complete, analyze the content of the data and extract the required data. The extracted data can be saved in various forms. There are many data formats. Common ones are csv, json, pickle Wait

save data

Finally, the data is written into a file in a certain format (CSV, JSON) or stored in a database (MySQL, MongoDB). Save as one or more at the same time

Usually, the data we want to get is not only in one page, but distributed in multiple pages. These pages are connected to each other. One page may contain one or more links to other pages. After extracting the current page After the data is collected, some links in the page are also extracted, and then the linked pages are crawled.

When designing a crawler program, we must also consider a series of issues such as preventing repeated crawling of the same page (URL deduplication), web search strategy (depth-first or breadth-first, etc.), and the limitation of crawler access boundaries.

Developing a crawler program from scratch is a tedious task. In order to avoid consuming a lot of time due to manufacturing wheels, we can choose to use some excellent crawler frameworks in practical applications. Using frameworks can reduce development costs, improve program quality, and enable us to Focus on business logic (crawling valuable data)

 

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/114369484