Python's web crawler framework - a common framework for web crawlers


I. Introduction

  • Personal homepage : ζ Xiaocaiji
  • Hello everyone, I'm Xiaocaiji, let us understand Python's web crawler framework - a common framework for web crawlers
  • If the article is helpful to you, welcome to follow, like, and bookmark (one-click three links)

2. Introduction

   The crawler framework is the semi-finished product of some crawler projects. You can write some commonly used functions, and then leave some interfaces. In different crawler projects, call the interface suitable for your own project, and then write a small amount of code to achieve the functions you need. Therefore, the common functions of crawlers have been implemented in the framework, which saves developers a lot of energy and time.


3. Scrapy crawler framework

   The Scrapy framework is a relatively mature Python crawler framework, which is simple, lightweight and very convenient. It can efficiently crawl web pages and extract structured data from the pages. Scrapy is an open source framework, so you don't need to worry about charging fees when using it. The official website of Scrapy is: https://scrapy.org , and the official page is shown in the figure.

insert image description here

The Scrapy open source framework provides developers with very considerate development documents, which introduce in detail the installation of the open source framework and the usage tutorial of Scrapy.


4. Crawley crawler framework

  Crawley is also a crawler framework developed by Python, which is dedicated to changing the way people extract data from the Internet. The specific features of Crawley are as follows:

  • A high-speed web crawler framework based on Eventlet.
  • Data can be stored in a relational database, eg Postgres, Mysql, Oracle, Sqlite.
  • The crawled data can be imported into Json and XML formats.
  • Supports non-relational databases, Mongodb and Couchdb.
  • Support for command line tools.
  • You can use your favorite tool for data extraction, for example, XPath or Pyquery tools.
  • Supports the use of cookies to log in or access pages that are only accessible when logged in.
  • Easy to learn.

Five, PySpider crawler framework

Compared with the Scrapy framework, the PySpider framework is a rookie. It is written in Python language, distributed architecture, supports multiple database backends, and powerful WebUI supports script editor, task monitor, project manager and result viewer. The script features of PySpider are as follows:

  • Python script control, you can use any html parsing package you like (built-in pyquery)
  • The web interface is used to write debugging scripts, start and stop scripts, monitor execution status, view activity history, and obtain output results.
  • 支持MySQL、MongoDB、Redis、SQLite、Elasticsearch、PostgreSQL与SQLAlchemy。
  • Support RabbitMQ, Beanstalk, Redis, Kombu as message queues.
  • Supports crawling JavaScript pages.
  • Powerful scheduling control, support overtime re-climbing and priority setting.
  • Components can be replaced, stand-alone/distributed deployment is supported, and Docker deployment is supported.

  Python's web crawler framework - the introduction of the common framework of web crawlers, this is the end, thank you for reading, if the article is helpful to you, welcome to follow, like, bookmark (one key three links)


Guess you like

Origin blog.csdn.net/weixin_45191386/article/details/131615431