Python web crawler program and Operation Process

1 Introduction

Python developers web crawler web page data acquired basic process:

Initiate a request

Initiated by URL request to a server request, the request may contain additional header information.

Acquiring response content

Normal server response, you will receive a response, web content that is requested, may contain HTML, Json string or binary data (video, pictures) and so on.

Parsing content

If the HTML code, you can use the page parser to parse, and if Json data, you can convert Json object parsing, if the binary data can be saved to a file for further processing.

save data

Can be saved to a local file, it can also be saved to the database (MySQL, Redis, MongoDB, etc.).

2 crawler and Operation Process

Web crawler frame includes the following five modules:

  • Reptile scheduler
  • URL Manager
  • HTML downloader
  • HTML parser
  • Data Memory

Five module functions as follows:

  • Reptile scheduler: mainly responsible for the overall coordination of the work of four other modules.
  • URL Manager: Manages the URL link, to maintain the collection URL already crawling and climbing a set of URL is not taken, providing access to new URL link interface.
  • HTML downloader: used to obtain the URL link is not crawling from the URL Manager and download the HTML page.
  • HTML Parser: used to get from the HTML downloader has downloaded HTML page and parses URL link to the new URL Manager, parse out valid data to the data memory.
  • Data storage: HTML parser for parsing out the data stored in the form of a file or database.

Dynamic processes running web crawler frame as follows:

3 Summary

This paper describes the Python framework developed web crawler, the web crawler running process into different modules according to the specific function, in order to carry out their duties, coordinated operation. After the framework to build a good web crawler, can effectively improve the efficiency of our web crawler project development, to avoid some duplication of work-inventing the wheel.

Guess you like

Origin www.cnblogs.com/yangmi511/p/12448067.html