Recommended Python crawler library, it is recommended to keep it as a collection

Many people start learning Python from crawlers. After all, there are abundant similar resources on the Internet, and there are also many open source projects.

Python learning web crawlers are mainly divided into 3 major sections:  crawling  ,  analysis  ,  storage

When we enter a url in the browser and press Enter, what happens in the background?

In simple terms, this process takes place in the following four steps:

• Find the IP address corresponding to the domain name. • Send a request to the server corresponding to the IP. • The server responds to the request by sending back the content of the web page. • The browser parses the web page content.

So what libraries do you need to master to learn reptiles?

General:

1. urllib - network library (stdlib).
2. requests - network library.
3.grab – Network library (based on pycurl).
4.pycurl – network library (binding libcurl).
5.urllib3 – Python HTTP library, secure connection pool, support file post, high availability.
6. httplib2 – Network library.
7.RoboBrowser – a simple, very Python-style Python library, you can browse the web without a separate browser.
8. MechanicalSoup - a Python library that automatically interacts with websites.
9.mechanize - a stateful, programmable web browsing library.
10.socket – the underlying network interface (stdlib).
11. Unirest for Python – Unirest is a set of lightweight HTTP libraries available in multiple languages.
12.hyper – HTTP/2 client for Python.
13. PySocks – updated and actively maintained version of SocksiPy, including bug fixes and some other features. As a drop-in replacement for the socket module.

Web crawler framework

1. A fully functional crawler

•grab - Web crawler framework (based on pycurl/multicur). •scrapy – web crawler framework (based on twisted), does not support Python3. •pyspider – a powerful crawler system. •cola - A distributed crawler framework.

2. Other

•portia - Visual crawler based on Scrapy. •restkit - HTTP Resource Kit for Python. It allows you to easily access HTTP resources and objects built around it. •demiurge - PyQuery based crawler microframework.

HTML/XML parser

1. Universal

• lxml – efficient HTML/XML processing library written in C language. XPath is supported. • cssselect – Parse the DOM tree and CSS selectors. • pyquery - Parses DOM trees and jQuery selectors. •BeautifulSoup – inefficient HTML/XML processing library, implemented in pure Python. • html5lib - Generates DOM for HTML/XML documents according to the WHATWG specification. This specification is used in all modern browsers. • feedparser - parses RSS/ATOM feeds. •MarkupSafe - provides safe escaped strings for XML/HTML/XHTML. • xmltodict - A Python module that lets you work with XML as if you were dealing with JSON. • xhtml2pdf - Convert HTML/CSS to PDF. • untangle - Easily convert XML files to Python objects.

2. Clean up

• Bleach – cleans HTML (requires html5lib). •sanitize – Bring clarity to the chaotic world of data.

text processing

Libraries for parsing and manipulating simple text.


1. General
2.difflib – (Python standard library) to help with differential comparisons.
3.Levenshtein – Quickly calculate Levenshtein distance and string similarity.
4. fuzzywuzzy – fuzzy string matching.
5.esmre – regular expression accelerator.
6.ftfy – Automatically organize Unicode text to reduce fragmentation.

natural language processing

Libraries for dealing with human language problems.

• NLTK - the best platform for writing Python programs to process human language data. •Pattern – A web mining module for Python. He has tools for natural language processing, machine learning, and more. • TextBlob – Provides a consistent API for in-depth natural language processing tasks. It is developed on the shoulders of giants based on NLTK and Pattern. •jieba – Chinese word segmentation tool. •SnowNLP – Chinese text processing library. •loso – Another Chinese word segmentation library.

Browser Automation and Emulation

• selenium - automates real browsers (Chrome, Firefox, Opera, Internet Explorer). •Ghost.py – wrapper for PyQt's webkit (requires PyQT). • Spynner - wrapper for PyQt's webkit (requires PyQT). • Splinter - Generic API browser emulator (selenium web driver, Django client, Zope).

multiprocessing

• threading - threading of the Python standard library. Works well for I/O intensive tasks. Useless for CPU bound tasks because of the python GIL. • multiprocessing - the standard Python library to run multiple processes. • celery - Asynchronous task queue/job queue based on distributed message passing. •concurrent-futures – The concurrent-futures module provides a high-level interface for invoking asynchronous execution.

asynchronous

Asynchronous network programming library

•asyncio – (Python standard library in Python 3.4+) asynchronous I/O, event loops, coroutines and tasks. •Twisted – An event-driven network engine framework. • Tornado – A networking framework and asynchronous networking library. •pulsar - Python event-driven concurrency framework. •diesel – Green event-based I/O framework for Python. • gevent - A coroutine-based networking library for Python using greenlets. • eventlet - asynchronous framework with WSGI support. •Tomorrow - Fantastic grooming syntax for asynchronous code.

queue

• celery - Asynchronous task queue/job queue based on distributed message passing. •huey – small multi-threaded task queue. • mrq - Mr. Queue - Python distributed job queue using redis & Gevent. • RQ - A lightweight task queue manager based on Redis. • simpleq – a simple, infinitely scalable, Amazon SQS-based queue. •python-gearman – Python API for Gearman.

cloud computing

•picloud – Execute Python code in the cloud. • dominoup.com – execute R, Python and matlab code in the cloud

Web Content Extraction

A library for extracting web content.

•Text and metadata for HTML pages •newspaper - news extraction, article extraction and content curation in Python. • html2text – Convert HTML to Markdown formatted text. •python-goose – HTML content/article extractor. •lassie – a user-friendly web content retrieval tool

WebSocket

Libraries for WebSockets.

• Crossbar - Open source application messaging router (Python implementation of WebSocket and WAMP for Autobahn). •AutobahnPython – provides Python implementations of the WebSocket protocol and the WAMP protocol and is open source. • WebSocket-for-Python – WebSocket client and server libraries for Python 2 and 3 and PyPy.

DNS resolution

•dnsyo - Check your DNS on over 1500 DNS servers worldwide. • pycares – interface to c-ares. c-ares is a C library for DNS requests and asynchronous name resolution.

computer vision

• OpenCV – Open source computer vision library. • SimpleCV - Introduction to cameras, image processing, feature extraction, format conversion, with a readable interface (based on OpenCV). • mahotas - Fast computer image processing algorithm (implemented entirely in C++), based entirely on numpy arrays as its data type.

Some frameworks for web development

1.Django

Django is an open source web application framework, written in Python, supports many database engines, can make web development fast and scalable, and will be constantly updated to match the latest version of Python, if you are a novice programmer, you can download it from this Start with the frame.

2.Flask

Flask is a lightweight web application framework written in Python. Based on WerkzeugWSGI toolkit and Jinja2 template engine. Use the BSD license.

Flask is also known as a "microframework" because it uses a simple core and uses extensions to add additional functionality. Flask has no database or form validation tools used by default. However, Flask retains the flexibility of expansion, and these functions can be added with Flask-extension: ORM, form validation tools, file uploads, and various open authentication technologies.

3. Web2py

Web2py is a free open source web framework written in Python language, designed to develop agile and fast web applications, with fast, scalable, secure and portable database-driven applications, following the LGPLv3 open source protocol.

Web2py provides a one-stop solution. The entire development process can be carried out on the browser. It provides the functions of online development of the Web version, HTML template writing, static file uploading, and database writing. Other features include logging, and an automated admin interface.

4.Tornado

Tornado is a Web server (this article will not describe it in detail), and it is also a micro-framework like web.py. The idea of ​​Tornado as a framework mainly comes from Web.py. You can also use it on the homepage of Web.py website I saw such a passage from Bret Taylor, the boss of Tornado (the framework used by FriendFeed he mentioned here can be regarded as the same thing as Tornado):

Because of this relationship, Tornado will not be discussed separately later.

5.CherryPy

CherryPy is a simple and very useful web framework for Python. Its main role is to connect web servers with Python code with as little operation as possible. Its features include built-in analysis functions, a flexible plug-in system, and a single run The function of multiple HTTP servers is compatible with running on the latest versions of Python, Jython, and Android.

Misunderstandings about the choice of framework

When it comes to the choice of framework, many people easily fall into the following two misunderstandings without knowing it: which framework is the best - there is no best framework in the world, only the framework that is most suitable for you and your team . The choice of programming language is also the same. If your team is most familiar with Python, use Python. If you are most familiar with Ruby, then use Ruby. Programming languages ​​and frameworks are just tools. Finishing is a good thing.

Pay too much attention to performance—in fact, most people don’t need to care too much about the performance of the framework, because the website you develop is simply a small website, and there are not many websites that can reach 10,000 IPs, and it is even more difficult to reach 100,000 IPs. Very little. It doesn't make much sense to talk about performance without a certain amount of visits, because your CPU and memory are always idle.

Guess you like

Origin blog.csdn.net/veratata/article/details/128624694