Python is and reptiles common library introduction


   Reptiles whole process consists of three parts: the fetch page, the page parse, store data. Crawl pages need to request support libraries, parsing the page needs to resolve support libraries, databases and data storage needs to support connection to the database package.
Here Insert Picture Description

1, common library request

  1. requests library: Python3 built another request library urllib, but more cumbersome to use this library, semantics is not very clear methods. So with requests library, which belongs to the third-party libraries, so you need to install.
  2. Selenium library: Selenium is a reptile can be automated library, this library is very powerful, you can use the browser to complete the library drive crawling. We can write some automated script, then the program will be able to indulge our crawling page.
  3. ChromeDriver: ChromeDriver is a driver, you want to automate the only reptiles Selenium enough, you also need to drive. ChromeDriver is driving Google browser.
  4. Like driving a drive, then there is Firefox, but I think that Firefox did not use up Google so smooth, no over again. Want crawling simple page, usually requests + Selenium + combination ChromeDriver is sufficient.

2, commonly used parsing library

  1. lxml library: lxml library supports parsing HTML and XML, XPath support analytical methods, and analytical efficiency is very high.
  2. Beautiful Soup library: The library supports HTML and XML parsing, it has the advantage that it has a powerful API, a lot easier than lxml, features and more powerful.
  3. pyquery library: equally powerful, its API and jQuery (a js framework) is very similar to the familiar front end is particularly convenient to use. I personally prefer using this library.
  4. tesserocr library: tesserocr Python is a library OCR recognition, identification codes and the like is mainly used.

3, common database

  1. The database includes a relational database and non-relational database. I used is MySQL, Redis and MongoDB.
  2. PyMySQL, PyMongoDB, redis-py library: These three libraries are connected to the database, similar to Java database-driven.

4, crawling APP associated libraries

  1. Charles: a capture editing tools, easy to use, easy to control a data request, modify simple and convenient data capture.
  2. mitmproxy: it is a support for HTTP and HTTPS packet capture program that can intercept the request and send a request, and so on.
  3. Appium: like Selenium, belonging to the end of APP automated testing tools.

5, the frame

  1. Frame: If the amount is not crawling, speed requirements are not large, such as library use requests + selenium completely meet the requirements. However, if the amount crawling up, many of the code is the code repetition, this time frame is adopted.
  2. pyspider: pyspider with the WebUI, script editor, task monitoring, project management, and more powerful.
  3. Scrapy: is a website for crawling data, extract structured data written application framework.

Please indicate the wrong place! Thought that it was in trouble if you can give a praise! We welcome comments section or private letter exchange!

Published 30 original articles · won praise 72 · views 10000 +

Guess you like

Origin blog.csdn.net/Orange_minger/article/details/104731724