Common Library Summary Python Web Crawler

Reptiles, many programming languages, but Python is definitely one of them mainstream. Here we will introduce some of the libraries at the Python web crawler written often used.

Request Libraries: Implementing HTTP request operation

  • urllib: a series of functions for the operation of the URL.
  • requests: Based urllib written, blocking HTTP request library, make a request, the server has to wait for a response before the program for further processing.
  • selenium: automated testing tools. A call to the driver's browser, through this library you can call directly to the browser to complete certain operations, such as input verification code.
  • aiohttp: HTTP-based framework asyncio implementation. Asynchronous operation by means of the async / await keyword library using asynchronous crawling, can greatly improve efficiency.

Parsing library: extracts information from a web page

  • BeautifulSoup: HTML and XML parsing, extracting information from web pages, also has a strong and diverse API analytical methods.
  • pyquery: jQuery Python implementation can operate with jQuery syntax parsing HTML documents, ease of use and speed are good resolve.
  • lxml: Support for HTML and XML parsing, support XPath analytical methods, and analytical efficiency is very high.
  • tesserocr: an OCR library, in the face verification code (CAPTCHA based), OCR can be identified directly.

Repository: Python interact with the database

  • pymysql: a pure Python MySQL client library is implemented.
  • pymongo: a library for direct connection mongodb database query operations.
  • redisdump: a data import tool redis / export used. Ruby-based implementation, so use it, you need to install Ruby.

Reptile framework

  • Scrapy: very powerful reptile framework to meet the simple page crawling (such as can be clearly informed of the situation url pattern). With this framework, data can easily climb down commodity information such as the Amazon and the like. But for a little more complex pages, such as pages of information on weibo, this framework will not meet the needs.
  • Crawley: High-speed crawling content corresponding to the site, supports relational and non-relational databases, data can be exported as JSON, XML and so on.
  • Portia: Visualization crawling web content.
  • newspaper: analysis extract news, articles and content.
  • GOOSE-Python: the Java extraction tool to write the article.
  • cola: a distributed crawler frame. The overall project design is a bit bad, a high degree of coupling between modules.

Web framework library

  • flask: lightweight web service program, simple, easy to use, flexible, mainly to do some API service. You may need it the agent.
  • django: a web server framework that provides a complete back-office management, engine, interface, using it to do a complete website.

related suggestion:

Guess you like

Origin www.cnblogs.com/shiyanlou/p/11504767.html