机器学习常用Python扩展包

在Ubuntu下安装Python模块通常有3种方法:1)使用apt-get;2)使用pip命令(推荐);3)easy_instal

可安装方法参考:【转】linux和windows下安装python集成开发环境及其python包 ——【二、安装】

参考:【Install Python packages on Ubuntu 14.04】

使用pip安装以下包时可能会出现问题(某些基础库缺失),导致安装失败,所以可确定系统中是否存在以下基础库:

Ubuntu dependencies

A variety of Ubuntu-specific packages are needed by Python packages. These are libraries, compilers, fonts, etc. I’ll detail these here along with install commands. Depending on what you want to install you might not need all of these.

General development/build:

$ sudo apt-get install build-essential python-dev

Compilers/code integration:

$ sudo apt-get install gfortran
$ sudo apt-get install swig

Numerical/algebra packages:

$ sudo apt-get install libatlas-dev
$ sudo apt-get install liblapack-dev

Fonts (for matplotlib)

$ sudo apt-get install libfreetype6 libfreetype6-dev

More fonts (for matplotlib on Ubuntu Server 14.04– see comment at end of post) – added 2015/03/06

$ sudo apt-get install libxft-dev

Graphviz for pygraphviz, networkx, etc.

$ sudo apt-get install graphviz libgraphviz-dev

IPython require pandoc for document conversions, printing, etc.

$ sudo apt-get install pandoc

Tinkerer dependencies

$ sudo apt-get install libxml2-dev libxslt-dev zlib1g-dev

That’s it, now we start installing the Python packages.

【安装列表】

1、numpy、scipy

2、pandas:Powerful data structures for data analysis, time series,and statistics

3、statsmodels

4、matplotlib、pyplot、pylab

5、libsvm

6、jieba分词

7、scikit-learn工具包

8、Theano深度学习

9、wikipedia :Wikipedia API for Python

10、gensim

11、Pattern

12、NLTK——Natural Language Toolkit, 自然语言处理工具包

1、numpy: Python的语言扩展,定义了数字的数组和矩阵。提供了存储单一数据类型的多维数组(ndarray)和矩阵(matrix)。
scipy:其在numpy的基础上增加了众多的数学、科学以及工程计算中常用的模块,例如线性代数、常微分方程数值求解、信号处理、图像处理、稀疏矩阵等等。

2、pandas: 直接处理和操作数据的主要package,提供了dataframe等方便处理表格数据的数据结构
安装如下:(pip方法)

import pandas as pd
df = pd.read_csv(‘https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data’, header=None)
print df

3、statsmodels: 统计和计量经济学的package,包含了用于参数评估和统计测试的实用工具
View Code

4、matplotlib、pyplot、pylab: 用于生成统计图。pyplot 和 pylab属于matplotlib的子模块,所以只需安装matplotlib,就会有pyplot和pylab的了。(The difference between pylab and pyplot is that the former imports numpy into its namespace. This was to make it behave more similarly with matlab. Using pyplot instead of pylab is preferred now because it is cleaner. )
  Python一般使用Matplotlib制作统计图形,用它自己的说法是‘让简单的事情简单,让复杂的事情变得可能’。(你说国外的“码农”咋这么会说,我就整不出来这工整的句子!)用它可以制作折线图,直方图,条形图,散点图,饼图,谱图等等你能想到的和想不到的统计图形,这些图形可以导出为多种具有出版质量的格式。此外,它和ipython结合使用,确实方便,谁用谁知道!

在Matplotlib里面经常使用到的是pylab和pyplot,它之间的区别在于pyplot是封装好的调用matplotlib底层制图库的接口,制图时用户不用关心底层的实现,而pylab则是为了使用者的方便,将numpy和pyplot的功能集中在了一个命名空间中。这么解释,可能还是不太清楚,因此在此次举个例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pylab

import numpy as np

import matplotlib.pyplot as plt

pylab.randn(2,3)

array([[ 1.22356117, -0.62786477, -0.02927331],
[ 1.11739661, -1.64112491, 2.24982297]])

np.random.randn(2,3)

array([[-1.41691502, -1.43500335, -0.68452086],
[-0.53925581, -0.18478012, -0.0126745 ]])

pylab.hist([1,1,1,2,3,3])

plt.hist([1,1,1,2,3,3])
从上面的例子可以看在pylab中可以使用numpy中的一些方法,而在pyplot中不能使用numpy的方法;pylab和pyplot都可以制作统计图形。

5、libsvm:svm模型的一个库。附安装方法:

先从网站下载LibSVM的安装包(http://www.csie.ntu.edu.tw/cjlin/cgi-bin/libsvm.cgi?+http://www.csie.ntu.edu.tw/cjlin/libsvm+tar.gz),然后解压。

从终端进入解压目录,输入 make,例如我下载的是libsvm-3.20.tar.gz

1
2
cd /home/eple/Downloads/libsvm-3.20
make
  然后进入python目录,同样输入make:(该步骤会生成 libsvm.so.2)

cd python/
make

好了,搞定!为了测试是否成功,在终端启动python,输入:(附上官方提供的例子)

Quick Start

There are two levels of usage. The high-level one uses utility functions
in svmutil.py and the usage is the same as the LIBSVM MATLAB interface.

from svmutil import *

Read data in LIBSVM format

y, x = svm_read_problem(’…/heart_scale’)
m = svm_train(y[:200], x[:200], ‘-c 4’)
p_label, p_acc, p_val = svm_predict(y[200:], x[200:], m)

Construct problem in python format

Dense data

y, x = [1,-1], [[1,0,1], [-1,0,-1]]

Sparse data

y, x = [1,-1], [{1:1, 3:1}, {1:-1,3:-1}]
prob = svm_problem(y, x)
param = svm_parameter(’-t 0 -c 4 -b 1’)
m = svm_train(prob, param)

但是,要在Pycharm下,这样做还是不够的。还需要把 libsvm-3.18/python/*py文件放到 /usr/lib/python2.7/dist-packages 中, libsvm.so.2 放到 /usr/lib/python2.7/中

1
2
sudo cp *.py /usr/lib/python2.7/dist-packages
sudo cp /home/eple/Downloads/libsvm-3.20/libsvm.so.2 /usr/lib/python2.7
  OK!

6、jieba:中文分词工具。附安装方法:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
“结巴”中文分词:做最好的 Python 中文分词组件

“Jieba” (Chinese for “to stutter”) Chinese text segmentation: built to be the best Python Chinese word segmentation module.

特点
支持三种分词模式:

精确模式,试图将句子最精确地切开,适合文本分析;
全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;
搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。
支持繁体分词

支持自定义词典
MIT 授权协议

在线演示
http://jiebademo.ap01.aws.af.cm/

网站代码:https://github.com/fxsjy/jiebademo

安装说明
代码对 Python 2/3 均兼容

全自动安装:easy_install jieba 或者 pip install jieba / pip3 install jieba
半自动安装:先下载 http://pypi.python.org/pypi/jieba/ ,解压后运行 python     setup.py install
手动安装:将 jieba 目录放置于当前目录或者 site-packages 目录
通过 import jieba 来引用

7、scikit-learn工具包:是一个基于SciPy和Numpy的开源机器学习模块,包括分类、回归、聚类系列算法,主要算法有SVM、逻辑回归、朴素贝叶斯、Kmeans、DBSCAN等;也提供了一些语料库。

[英文简介]Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

项目主页:

https://pypi.python.org/pypi/scikit-learn/

http://scikit-learn.org/

https://github.com/scikit-learn/scikit-learn

附安装方法1:

在Ubuntu源上可以直接找到该工具包,如下图:

直接安装:

附安装方法2:pip

8、Theano深度学习

Theano是一个机器学习库,允许你定义、优化和评估涉及多维数组的数学表达式,这可能是其它库开发商的一个挫折点。与scikit-learn一样,Theano也很好地整合了NumPy库。GPU的透明使用使得Theano可以快速并且无错地设置,这对于那些初学者来说非常重要。然而有些人更多的是把它描述成一个研究工具,而不是当作产品来使用,因此要按需使用。

Theano最好的功能之一是拥有优秀的参考文档和大量的教程。事实上,多亏了此库的流行程度,使你在寻找资源的时候不会遇到太多的麻烦,比如如何得到你的模型以及运行等。

安装如下:

9、wikipedia :Wikipedia is a Python library that makes it easy to access and parse data from Wikipedia

Search Wikipedia, get article summaries, get data like links and images from a page, and more. Wikipedia wraps the MediaWiki APIso you can focus on using Wikipedia data, not getting it.

下面我说明在Ubuntu下的安装:

10、gensim:依赖NumPy和SciPy这两大Python科学计算工具包,一种简单的安装方法是pip install。gensim的这个官方安装页面很详细的列举了兼容的Python和NumPy, SciPy的版本号以及安装步骤,感兴趣的同学可以直接参考。下面我说明在Ubuntu下的安装:

11、Pattern (Github:http://github.com/clips/pattern)

此库更像是一个“全套”库,因为它不仅提供了一些机器学习算法,而且还提供了工具来帮助你收集和分析数据。数据挖掘部分可以帮助你收集来自谷歌、推特和维基百科等网络服务的数据。它也有一个Web爬虫和HTML DOM解析器。“引入这些工具的优点就是:在同一个程序中收集和训练数据显得更加容易。

在文档中有个很好的例子,使用一堆推文来训练一个分类器,用来区分一个推文是“win”还是“fail”。

复制代码
1 from pattern.web import Twitter
2 from pattern.en import tag
3 from pattern.vector import KNN, count
4
5 twitter, knn = Twitter(), KNN()
6
7 for i in range(1, 3):
8 for tweet in twitter.search(’#win OR #fail’, start=i, count=100):
9 s = tweet.text.lower()
10 p = ‘#win’ in s and ‘WIN’ or ‘FAIL’
11 v = tag(s)
12 v = [word for word, pos in v if pos == ‘JJ’] # JJ = adjective
13 v = count(v) # {‘sweet’: 1}
14 if v:
15 knn.train(v, type=p)
16
17 print knn.classify(‘sweet potato burger’)
18 print knn.classify(‘stupid autocorrect’)
复制代码

首先使用twitter.search()通过标签’#win’和’#fail’来收集推文数据。然后利用从推文中提取的形容词来训练一个K-近邻(KNN)模型。经过足够的训练,你会得到一个分类器。仅仅只需15行代码,还不错。

擅长:自然语言处理(NLP)和分类。

12、NLTK:Natural Language Toolkit

参考:

Installing NLTK
Installing NLTK Data
FAQ
Wiki
API
  NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”

入门指导:  Natural Language Processing with Python provides a practical introduction to programming for language processing. Written by the creators of NLTK, it guides the reader through the fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic structure, and more. The book is being updated for Python 3 and NLTK 3. (The original Python 2 version is still available at http://nltk.org/book_1ed.)

安装NLTK:
Install NLTK: run sudo pip install --user -U nltk
Install Numpy (optional): run sudo pip install -U numpy
Test installation: run python then type import nltk
For older versions of Python it might be necessary to install setuptools (seehttp://pypi.python.org/pypi/setuptools) and to install pip (sudo easy_install pip).

安装NLTK语料:
因为标注数据等功能需要调用数据,所以需要下载NLTK数据包

For central installation on a multi-user machine, do the following from an administrator account.

Run the Python interpreter and type the commands:

import nltk
nltk.download()
A new window should open, showing the NLTK Downloader. Click on the File menu and select Change Download Directory. For central installation, set this to C:\nltk_data (Windows),/usr/local/share/nltk_data (Mac), or /usr/share/nltk_data (Unix). Next, select the packages or collections you want to download.

输入nltk.download()就会弹出窗口供选择,一般选择book可安装所有语料和包等:

Some simple things you can do with NLTK:
Tokenize and tag some text:

import nltk
sentence = “”“At eight o’clock on Thursday morning
… Arthur didn’t feel very good.”""

tokens = nltk.word_tokenize(sentence)
tokens
[‘At’, ‘eight’, “o’clock”, ‘on’, ‘Thursday’, ‘morning’,
‘Arthur’, ‘did’, “n’t”, ‘feel’, ‘very’, ‘good’, ‘.’]

tagged = nltk.pos_tag(tokens)
tagged[0:6]
[(‘At’, ‘IN’), (‘eight’, ‘CD’), (“o’clock”, ‘JJ’), (‘on’, ‘IN’),
(‘Thursday’, ‘NNP’), (‘morning’, ‘NN’)]
Identify named entities:

entities = nltk.chunk.ne_chunk(tagged)
entities
Tree(‘S’, [(‘At’, ‘IN’), (‘eight’, ‘CD’), (“o’clock”, ‘JJ’),
(‘on’, ‘IN’), (‘Thursday’, ‘NNP’), (‘morning’, ‘NN’),
Tree(‘PERSON’, [(‘Arthur’, ‘NNP’)]),
(‘did’, ‘VBD’), (“n’t”, ‘RB’), (‘feel’, ‘VB’),
(‘very’, ‘RB’), (‘good’, ‘JJ’), (’.’, ‘.’)])
Display a parse tree:

from nltk.corpus import treebank
t = treebank.parsed_sents(‘wsj_0001.mrg’)[0]
t.draw()
_images/tree.gif

猜你喜欢

转载自blog.csdn.net/pengshengli/article/details/84951837