python自然语言处理工具NLTK各个包的意思和作用总结

【转】http://www.myexception.cn/perl-python/464414.html

【原】Python NLP实战之一：环境准备
最近正在学习Python，看了几本关于Python入门的书和用Python进行自然语言处理的书，如《Python编程实践》、《Python基础教程》（第2版）、《Python自然语言处理》（影印版）。因为以前是学Java的，有着良好的面向对象的思维方式，所以刚开始看Python的语法，觉得Pyhon太随意了，很别扭，有不正规之嫌。而且，Python自己也正在向面向对象（OO）靠拢。但是，后来看到Python有强大的类库，尤其在自然语言处理方面有着强大的NLTK支持，我逐渐改变了对它的看法。不得不承认，Python非常简洁和清晰，很容易上手，对于有编程经验的人来说，可以快速编写程序来实现某个应用。下面是本人学习中的一些心得，与大家分享。

Python NLP实战之一：环境准备

要下载和安装的软件和资源有：

Python
PyYAML
NLTK
NLTK-Data
NumPy
Matplotlib

（一）下载地址和版本：

Python：http://www.python.org/getit/releases/2.7.2/ 版本：Python 2.7.2 （注：现在是2.7.3。Python已经发布3.3版了，之所以下载2.7，是因为2.x比较稳定，兼容的第三方软件多。Python官网提示：如果你不知道用哪个版本的话，就从2.7开始吧！）
PyYAML：http://pypi.python.org/pypi/PyYAML/ 版本：PyYAML 3.10 功能：YAML的解析工具
NLTK： http://www.nltk.org 版本：nltk-2.0.1 功能：自然语言工具包
NumPy： http://pypi.python.org/pypi/numpy 版本：numpy 1.6.1 功能：支持多维数组和线性代数
Matplotlib： http://sourceforge.net/projects/matplotlib/files/matplotlib/matplotlib-1.1.0/ 版本：matplotlib-1.1.0 功能：用于数据可视化的二维图库

安装都很简单，我是在Window下安装的。

（二）运行Python IDLE
Python安装完成后，运行Python集成开发环境IDLE：开始->所有程序->Python 2.7 ->IDLE (Python GUI)，打开一个新的窗口，显示如下信息，表明安装成功。

[java] view plain copy

Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>>

（三）下载NLTK数据包
接下来，导入NLTK工具包，然后，下载NLTK数据源。

[java] view plain copy

>>> import nltk
>>> nltk.download()

注意：在导入MLTK工具包时，如果显示如下信息，表明没有安装PyYAML。

[java] view plain copy

>>> import nltk
Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
import nltk
File "C:\Python27\lib\site-packages\nltk\__init__.py", line 107, in <module>
from yamltags import *
File "C:\Python27\lib\site-packages\nltk\yamltags.py", line 10, in <module>
import yaml
ImportError: No module named yaml

按照（一）所列的地址下载、安装完PyYAML后，再打开Python IDLE，导入NLTK，执行nltk.download()，我的界面出现的是文字提示，书上和网上有同学说是图形界面，两者都可以吧。

[java] view plain copy

Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------
Downloader>

选择d) Download，敲入d，再敲入l，然后按提示敲几次回车，显示的是将要下载的各种不同的数据包。

[java] view plain copy

Downloader> d
Download which package (l=list; x=cancel)?
Identifier> l
Packages:
[ ] maxent_ne_chunker... ACE Named Entity Chunker (Maximum entropy)
[ ] abc................. Australian Broadcasting Commission 2006
[ ] alpino.............. Alpino Dutch Treebank
[ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
Extraction Systems in Biology)
[ ] brown_tei........... Brown Corpus (TEI XML Version)
[ ] cess_esp............ CESS-ESP Treebank
[ ] chat80.............. Chat-80 Data Files
[ ] brown............... Brown Corpus
[ ] cmudict............. The Carnegie Mellon Pronouncing Dictionary (0.6)
[ ] city_database....... City Database
[ ] cess_cat............ CESS-CAT Treebank
[ ] comtrans............ ComTrans Corpus Sample
[ ] conll2002........... CONLL 2002 Named Entity Recognition Corpus
[ ] conll2007........... Dependency Treebanks from CoNLL 2007 (Catalan
and Basque Subset)
[ ] europarl_raw........ Sample European Parliament Proceedings Parallel
Corpus
[ ] dependency_treebank. Dependency Parsed Treebank
[ ] conll2000........... CONLL 2000 Chunking Corpus
Hit Enter to continue:
[ ] floresta............ Portuguese Treebank
[ ] names............... Names Corpus, Version 1.3 (1994-03-29)
[ ] gazetteers.......... Gazeteer Lists
[ ] genesis............. Genesis Corpus
[ ] gutenberg........... Project Gutenberg Selections
[ ] inaugural........... C-Span Inaugural Address Corpus
[ ] jeita............... JEITA Public Morphologically Tagged Corpus (in
ChaSen format)
[ ] movie_reviews....... Sentiment Polarity Dataset Version 2.0
[ ] ieer................ NIST IE-ER DATA SAMPLE
[ ] nombank.1.0......... NomBank Corpus 1.0
[ ] indian.............. Indian Language POS-Tagged Corpus
[ ] paradigms........... Paradigm Corpus
[ ] kimmo............... PC-KIMMO Data Files
[ ] knbc................ KNB Corpus (Annotated blog corpus)
[ ] langid.............. Language Id Corpus
[ ] mac_morpho.......... MAC-MORPHO: Brazilian Portuguese news text with
part-of-speech tags
[ ] machado............. Machado de Assis -- Obra Completa
[ ] pe08................ Cross-Framework and Cross-Domain Parser
Evaluation Shared Task
Hit Enter to continue:
[ ] pl196x.............. Polish language of the XX century sixties
[ ] pil................. The Patient Information Leaflet (PIL) Corpus
[ ] nps_chat............ NPS Chat
[ ] reuters............. The Reuters-21578 benchmark corpus, ApteMod
version
[ ] qc.................. Experimental Data for Question Classification
[ ] rte................. PASCAL RTE Challenges 1, 2, and 3
[ ] ppattach............ Prepositional Phrase Attachment Corpus
[ ] propbank............ Proposition Bank Corpus 1.0
[ ] problem_reports..... Problem Report Corpus
[ ] sinica_treebank..... Sinica Treebank Corpus Sample
[ ] verbnet............. VerbNet Lexicon, Version 2.1
[ ] state_union......... C-Span State of the Union Address Corpus
[ ] semcor.............. SemCor 3.0
[ ] senseval............ SENSEVAL 2 Corpus: Sense Tagged Text
[ ] smultron............ SMULTRON Corpus Sample
[ ] shakespeare......... Shakespeare XML Corpus Sample
[ ] stopwords........... Stopwords Corpus
[ ] swadesh............. Swadesh Wordlists
[ ] switchboard......... Switchboard Corpus Sample
[ ] toolbox............. Toolbox Sample Files
Hit Enter to continue:
[ ] unicode_samples..... Unicode Samples
[ ] webtext............. Web Text Corpus
[ ] timit............... TIMIT Corpus Sample
[ ] ycoe................ York-Toronto-Helsinki Parsed Corpus of Old
English Prose
[ ] treebank............ Penn Treebank Sample
[ ] udhr................ Universal Declaration of Human Rights Corpus
[ ] sample_grammars..... Sample Grammars
[ ] book_grammars....... Grammars from NLTK Book
[ ] spanish_grammars.... Grammars for Spanish
[ ] wordnet............. WordNet
[ ] wordnet_ic.......... WordNet-InfoContent
[ ] words............... Word Lists
[ ] tagsets............. Help on Tagsets
[ ] basque_grammars..... Grammars for Basque
[ ] large_grammars...... Large context-free and feature-based grammars
for parser comparison
[ ] maxent_treebank_pos_tagger Treebank Part of Speech Tagger (Maximum entropy)
[ ] rslp................ RSLP Stemmer (Removedor de Sufixos da Lingua
Portuguesa)
[ ] hmm_treebank_pos_tagger Treebank Part of Speech Tagger (HMM)
Hit Enter to continue:
[ ] punkt............... Punkt Tokenizer Models
Collections:
[ ] all-corpora......... All the corpora
[ ] all................. All packages
[ ] book................ Everything used in the NLTK Book
([*] marks installed packages)

你可以选择敲入 all-corpora，或all，或book，我选的是all。保持网络畅通，下载可能需要一段时间。显示信息如下：

再分享一下我老师大神的人工智能教程吧。零基础！通俗易懂！风趣幽默！还带黄段子！希望你也加入到我们人工智能的队伍中来！https://blog.csdn.net/jiangjunshow

python自然语言处理工具NLTK各个包的意思和作用总结

猜你喜欢