Summary of commonly used python machine learning libraries


1. Python web crawler toolset

A real project must start with acquiring data. Regardless of text processing, machine learning and data mining, data is required. In addition to professional data purchased or downloaded through some channels, people often need to crawl the data themselves. At this time, crawler is particularly important. Fortunately, Python provides a A batch of very good web crawler tool frameworks, which can not only crawl data, but also acquire and clean data, start here:

1.1 Scrapy

Scrapy, a fast high-level screen scraping and web crawling framework for Python.

The famous Scrapy, I believe many students have heard of it. Many courses in the course map are captured by Scrapy. There are many introduction articles in this regard. I recommend an article in the early years of Daniel pluskid: "Scrapy Easily Customize the Network Reptiles, timeless.

1.2 Beautiful Soup

You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it’s been saving programmers hours or days of work on quick-turnaround screen scraping projects.

Objectively speaking, Beauty Soup is not exactly a set of crawler tools that need to be used with urllib, but a set of HTML/XML data analysis, cleaning and acquisition tools.

1.3 Python-Goose

Html Content / Article Extractor, web scrapping lib in Python

Goose was originally written in Java, and later rewritten in Scala. It is a Scala project. Python-Goose is rewritten in Python and relies on Beautiful Soup. I used it some time ago, and it feels very good. Given the URL of an article, it is very convenient to get the title and content of the article.

2. Python text processing toolset

After obtaining text data from the web page, basic text processing is required depending on the task. For example, for English, basic tokenize is required, and for Chinese, common Chinese word segmentation is required. Further, regardless of English and Chinese, It can also tag part of speech, syntactic analysis, keyword extraction, text classification, sentiment analysis and more. In this regard, especially for the English field, there are many excellent toolkits, we will come together one by one.

2.1 NLTK — Natural Language Toolkit

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and an active discussion forum.

There should be no students who are engaged in natural language processing who do not know NLTK, so I won't say much here. However, two books are recommended for students who have just come into contact with NLTK or who need to learn more about NLTK: one is the official "Natural Language Processing with Python", which mainly introduces the function usage in NLTK, and also includes some Python knowledge. A Chinese version has been translated, here you can see: The Chinese translation of "Natural Language Processing with Python" is recommended - NLTK companion book; the other is "Python Text Processing with NLTK 2.0 Cookbook", this book is more in-depth and will involve To the code structure of NLTK, it will also introduce how to customize your own corpus and models, which is quite good.

2.2 Pattern

Pattern is a web mining module for the Python programming language.

It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and canvas visualization.

Pattern is produced by the CLiPS Laboratory of the University of Antwerp, Belgium. Objectively speaking, Pattern is not only a set of text processing tools, but also a set of web data mining tools, including data capture modules (including Google, Twitter, Wikipedia APIs) , as well as crawlers and HTML analyzers), text processing modules (part-of-speech tagging, sentiment analysis, etc.), machine learning modules (VSM, clustering, SVM) and visualization modules, etc. It can be said that this whole set of logic of Pattern is also this article. The organization logic, but here we put Pattern into the text processing part for now. Personally, I mainly use its English processing module Pattern.en, which has many very good text processing functions, including basic tokenize, part-of-speech tagging, sentence segmentation, grammar check, spelling error correction, sentiment analysis, syntax analysis, etc., Pretty good.

2.3 TextBlob: Simplified Text Processing

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

TextBlob is an interesting Python text processing toolkit, which is actually based on the above two Python toolkits NLKT and Pattern (TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both), and provides Interfaces for many text processing functions, including part-of-speech tagging, noun phrase extraction, sentiment analysis, text classification, spell checking, etc., and even translation and language detection, but this is based on Google's API and has a limited number of calls. TextBlob is relatively young, and interested students can pay attention.

 

2.4 MBSP for Python

MBSP is a text analysis system based on the TiMBL and MBT memory based learning applications developed at CLiPS and ILK. It provides tools for Tokenization and Sentence Splitting, Part of Speech Tagging, Chunking, Lemmatization, Relation Finding and Prepositional Phrase Attachment.

MBSP has the same origin as Pattern, and comes from the CLiPS laboratory of the University of Antwerp, Belgium. It provides basic text processing functions such as Word Tokenization, sentence segmentation, part-of-speech tagging, Chunking, Lemmatization, and syntactic analysis. Interested students can pay attention.

 

2.5 Gensim: Topic modeling for humans

Gensim is a fairly professional topic model Python toolkit. Whether it is code or documentation, we once introduced the installation and use process of Gensim in "How to Calculate the Similarity of Two Documents", so I won't talk about it here.

 

2.6 langid.py: Stand-alone language identification system

Language detection is a very interesting topic, but it is relatively mature. There are many solutions in this area, and there are also many good open source toolkits, but for Python, I have used the langid toolkit and am very willing to recommend it. langid currently supports the detection of 97 languages ​​and provides many easy-to-use functions, including the ability to start a suggested server, call its API through json, and customize and train its own language detection model. Everything".

2.7 Jieba: Stuttering Chinese word segmentation

"Jieba" Chinese word segmentation: Do the best Python Chinese word segmentation component "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.

Well, we can finally say a domestic Python text processing toolkit: stuttering word segmentation, its functions include support for three word segmentation modes (exact mode, full mode, search engine mode), support for traditional word segmentation, support for custom dictionaries, etc., yes Currently a very good Python Chinese word segmentation solution.

 

3. Python Scientific Computing Toolkit

Speaking of scientific computing, the first thing that comes to mind is Matlab, which integrates numerical computing, visualization tools and interaction, but unfortunately it is a commercial product. In terms of open source, in addition to GNU Octave trying to make a toolkit similar to Matlab, these toolkits of Python can also replace the corresponding functions of Matlab: NumPy+SciPy+Matplotlib+iPython. At the same time, these toolkits, especially NumPy and SciPy, are also the basis of many Python text processing & machine learning & data mining toolkits, which are very important. Finally, I recommend a series of "Scientific Computing with Python", which will involve NumPy, SciPy, and Matplotlib for reference.

3.1 NumPy

 

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

1)a powerful N-dimensional array object

2)sophisticated (broadcasting) functions

3)tools for integrating C/C++ and Fortran code

4) useful linear algebra, Fourier transform, and random number capabilities

 

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

NumPy is almost an unavoidable scientific computing toolkit. Perhaps the most commonly used is its N-dimensional array object. Others include some mature function libraries, toolkits for integrating C/C++ and Fortran code, linear algebra, Fu Lie transform and random number generation functions, etc. NumPy provides two basic objects: ndarray (N-dimensional array object) and ufunc (universal function object). An ndarray is a multidimensional array that stores a single data type, and a ufunc is a function that can manipulate arrays.

3.2 SciPy:Scientific Computing Tools for Python

SciPy refers to several related but distinct entities:

 

1)The SciPy Stack, a collection of open source software for scientific computing in Python, and particularly a specified set of core packages.

2)The community of people who use and develop this stack.

3)Several conferences dedicated to scientific computing in Python – SciPy, EuroSciPy and SciPy.in.

4)The SciPy library, one component of the SciPy stack, providing many numerical routines.

"SciPy is an open source Python algorithm library and mathematical toolkit. SciPy includes modules for optimization, linear algebra, integration, interpolation, special functions, fast Fourier transforms, signal processing and image processing, ordinary differential equation solving and other Computation commonly used in science and engineering. Its functionality is similar to the software MATLAB, Scilab, and GNU Octave. Numpy and Scipy are often used in combination, and most machine learning libraries in Python rely on these two modules." - Quoted from "Python Machine "Learning Library"

3.3 Matplotlib

matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB®* or Mathematica®†), web application servers, and six graphical user interface toolkits.

matplotlib is the most famous plotting library in python. It provides a set of command APIs similar to matlab, which is very suitable for interactive plotting. And it can also be easily used as a drawing control, embedded in GUI applications. Matplotlib can be used with ipython shell, providing a drawing experience no less than Matlab, in short, it is good to use it.

 

4. Python Machine Learning & Data Mining Toolkit

The concepts of machine learning and data mining are not very easy to distinguish, so they are put together here. There are many open source Python toolkits in this area. Let’s start with the familiar ones, and then add materials from other sources. You are welcome to add them.

4.1 scikit-learn: Machine Learning in Python

scikit-learn (formerly scikits.learn) is an open source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, logistic regression, naive Bayes, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

First of all, we recommend the famous scikit-learn, which is an open source machine learning toolkit based on NumPy, SciPy, and Matplotlib, mainly covering classification, regression and clustering algorithms, such as SVM, logistic regression, naive Bayes, random forest, Algorithms such as k-means, the code and documentation are very good and used in many Python projects. For example, in the familiar NLTK, the classifier has an interface specifically for scikit-learn, which can call the scikit-learn classification algorithm and training data to train the classifier model.

4.2 Pandas: Python Data Analysis Library

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

Pandas is also developed based on NumPy and Matplotlib. It is mainly used for data analysis and data visualization. Its data structure DataFrame is very similar to data.frame in R language, especially for time series data. It has its own set of analysis mechanisms, which is very good . Here is a recommended book "Python for Data Analysis". The author is the main developer of Pandas. It introduces related functions in iPython, NumPy, and Pandas in turn, data visualization, data cleaning and processing, time data processing, etc. Cases include financial stock data Digging etc, pretty good.

4.3 mlpy – Machine Learning Python

mlpy is a Python module for Machine Learning built on top of NumPy/SciPy and the GNU Scientific Libraries.mlpy provides a wide range of state-of-the-art machine learning methods for supervised and unsupervised problems and it is aimed at finding a reasonable compromise among modularity, maintainability, reproducibility, usability and efficiency. mlpy is multiplatform, it works with Python 2 and 3 and it is Open Source, distributed under the GNU General Public License version 3.

4.4 PyBrain

PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms.

PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive “Backronym”.

"PyBrain (Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network) is a machine learning module for Python. Its goal is to provide flexible, adaptable, and powerful machine learning algorithms for machine learning tasks. (The name is very domineering)

PyBrain, as the name suggests, includes neural networks, reinforcement learning (and a combination of the two), unsupervised learning, and evolutionary algorithms. Because many current problems deal with continuum and behavioral spaces, function approximations (such as neural networks) must be used to deal with high-dimensional data. PyBrain takes neural network as the core, and all training methods use neural network as an instance. "

4.5 Theano

 

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Theano features:

1)tight integration with NumPy – Use numpy.ndarray in Theano-compiled functions.

2)transparent use of a GPU – Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)

3)efficient symbolic differentiation – Theano does your derivatives for function with one or many inputs.

4)speed and stability optimizations – Get the right answer for log(1+x) even when x is really tiny.

5)dynamic C code generation – Evaluate expressions faster.

6) extensive unit-testing and self-verification – Detect and diagnose many types of mistake.

Theano has been powering large-scale computationally intensive scientific investigations since 2007. But it is also approachable enough to be used in the classroom (IFT6266 at the University of Montreal).

"Theano is a Python library for defining, optimizing and simulating the computation of mathematical expressions for efficiently solving computational problems with multidimensional arrays. Theano features: tight integration with Numpy; efficient data-intensive GPU computing; efficient symbolic differentiation Computing; high speed and stable optimization; dynamic c code generation; extensive unit testing and self-verification. Theano has been widely used in scientific computing since 2007. Theano makes it easier to build deep learning models and can quickly implement a variety of models .PS: Theano, a Greek beauty, daughter of Croton's most powerful Milo, later became the wife of Pythagoras."

 

4.6  Pylearn2

Pylearn2 is a machine learning library. Most of its functionality is built on top of Theano. This means you can write Pylearn2 plugins (new models, algorithms, etc) using mathematical expressions, and theano will optimize and stabilize those expressions for you, and compile them to a backend of your choice (CPU or GPU).

"Pylearn2 is built on theano and partly relies on scikit-learn. Currently, Pylearn2 is under development. It will be able to process data such as vectors, images, and videos, and provide deep learning models such as MLP, RBM, and SDA."

​Copyright statement: If copyright issues are involved, please contact the author with the ownership certificate

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324639369&siteId=291194637