【转】http://www.myexception.cn/perl-python/464414.html
【原】Python NLP实战之一:环境准备
最近正在学习Python,看了几本关于Python入门的书和用Python进行自然语言处理的书,如《Python编程实践》、《Python基础教程》(第2版)、《Python自然语言处理》(影印版)。因为以前是学Java的,有着良好的面向对象的思维方式,所以刚开始看Python的语法,觉得Pyhon太随意了,很别扭,有不正规之嫌。而且,Python自己也正在向面向对象(OO)靠拢。但是,后来看到Python有强大的类库,尤其在自然语言处理方面有着强大的NLTK支持,我逐渐改变了对它的看法。不得不承认,Python非常简洁和清晰,很容易上手,对于有编程经验的人来说,可以快速编写程序来实现某个应用。下面是本人学习中的一些心得,与大家分享。
Python NLP实战之一:环境准备
要下载和安装的软件和资源有:
- Python
- PyYAML
- NLTK
- NLTK-Data
- NumPy
- Matplotlib
(一)下载地址和版本:
- Python:http://www.python.org/getit/releases/2.7.2/ 版本:Python 2.7.2 (注:现在是2.7.3。Python已经发布3.3版了,之所以下载2.7,是因为2.x比较稳定,兼容的第三方软件多。Python官网提示:如果你不知道用哪个版本的话,就从2.7开始吧!)
- PyYAML:http://pypi.python.org/pypi/PyYAML/ 版本:PyYAML 3.10 功能:YAML的解析工具
- NLTK: http://www.nltk.org 版本:nltk-2.0.1 功能:自然语言工具包
- NumPy: http://pypi.python.org/pypi/numpy 版本:numpy 1.6.1 功能:支持多维数组和线性代数
- Matplotlib: http://sourceforge.net/projects/matplotlib/files/matplotlib/matplotlib-1.1.0/ 版本:matplotlib-1.1.0 功能:用于数据可视化的二维图库
安装都很简单,我是在Window下安装的。
(二)运行Python IDLE
Python安装完成后,运行Python集成开发环境IDLE:开始->所有程序->Python 2.7 ->IDLE (Python GUI),打开一个新的窗口,显示如下信息,表明安装成功。
- Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
- Type "copyright", "credits" or "license()" for more information.
- >>>
(三)下载NLTK数据包
接下来,导入NLTK工具包,然后,下载NLTK数据源。
- >>> import nltk
- >>> nltk.download()
注意:在导入MLTK工具包时,如果显示如下信息,表明没有安装PyYAML。
- >>> import nltk
- Traceback (most recent call last):
- File "<pyshell#0>", line 1, in <module>
- import nltk
- File "C:\Python27\lib\site-packages\nltk\__init__.py", line 107, in <module>
- from yamltags import *
- File "C:\Python27\lib\site-packages\nltk\yamltags.py", line 10, in <module>
- import yaml
- ImportError: No module named yaml
按照(一)所列的地址下载、安装完PyYAML后,再打开Python IDLE,导入NLTK,执行nltk.download(),我的界面出现的是文字提示,书上和网上有同学说是图形界面,两者都可以吧。
- Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
- Type "copyright", "credits" or "license()" for more information.
- >>> import nltk
- >>> nltk.download()
- NLTK Downloader
- ---------------------------------------
- d) Download l) List u) Update c) Config h) Help q) Quit
- ---------------------------------------
- Downloader>
选择d) Download,敲入d,再敲入l,然后按提示敲几次回车,显示的是将要下载的各种不同的数据包。
- Downloader> d
- Download which package (l=list; x=cancel)?
- Identifier> l
- Packages:
- [ ] maxent_ne_chunker... ACE Named Entity Chunker (Maximum entropy)
- [ ] abc................. Australian Broadcasting Commission 2006
- [ ] alpino.............. Alpino Dutch Treebank
- [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
- Extraction Systems in Biology)
- [ ] brown_tei........... Brown Corpus (TEI XML Version)
- [ ] cess_esp............ CESS-ESP Treebank
- [ ] chat80.............. Chat-80 Data Files
- [ ] brown............... Brown Corpus
- [ ] cmudict............. The Carnegie Mellon Pronouncing Dictionary (0.6)
- [ ] city_database....... City Database
- [ ] cess_cat............ CESS-CAT Treebank
- [ ] comtrans............ ComTrans Corpus Sample
- [ ] conll2002........... CONLL 2002 Named Entity Recognition Corpus
- [ ] conll2007........... Dependency Treebanks from CoNLL 2007 (Catalan
- and Basque Subset)
- [ ] europarl_raw........ Sample European Parliament Proceedings Parallel
- Corpus
- [ ] dependency_treebank. Dependency Parsed Treebank
- [ ] conll2000........... CONLL 2000 Chunking Corpus
- Hit Enter to continue:
- [ ] floresta............ Portuguese Treebank
- [ ] names............... Names Corpus, Version 1.3 (1994-03-29)
- [ ] gazetteers.......... Gazeteer Lists
- [ ] genesis............. Genesis Corpus
- [ ] gutenberg........... Project Gutenberg Selections
- [ ] inaugural........... C-Span Inaugural Address Corpus
- [ ] jeita............... JEITA Public Morphologically Tagged Corpus (in
- ChaSen format)
- [ ] movie_reviews....... Sentiment Polarity Dataset Version 2.0
- [ ] ieer................ NIST IE-ER DATA SAMPLE
- [ ] nombank.1.0......... NomBank Corpus 1.0
- [ ] indian.............. Indian Language POS-Tagged Corpus
- [ ] paradigms........... Paradigm Corpus
- [ ] kimmo............... PC-KIMMO Data Files
- [ ] knbc................ KNB Corpus (Annotated blog corpus)
- [ ] langid.............. Language Id Corpus
- [ ] mac_morpho.......... MAC-MORPHO: Brazilian Portuguese news text with
- part-of-speech tags
- [ ] machado............. Machado de Assis -- Obra Completa
- [ ] pe08................ Cross-Framework and Cross-Domain Parser
- Evaluation Shared Task
- Hit Enter to continue:
- [ ] pl196x.............. Polish language of the XX century sixties
- [ ] pil................. The Patient Information Leaflet (PIL) Corpus
- [ ] nps_chat............ NPS Chat
- [ ] reuters............. The Reuters-21578 benchmark corpus, ApteMod
- version
- [ ] qc.................. Experimental Data for Question Classification
- [ ] rte................. PASCAL RTE Challenges 1, 2, and 3
- [ ] ppattach............ Prepositional Phrase Attachment Corpus
- [ ] propbank............ Proposition Bank Corpus 1.0
- [ ] problem_reports..... Problem Report Corpus
- [ ] sinica_treebank..... Sinica Treebank Corpus Sample
- [ ] verbnet............. VerbNet Lexicon, Version 2.1
- [ ] state_union......... C-Span State of the Union Address Corpus
- [ ] semcor.............. SemCor 3.0
- [ ] senseval............ SENSEVAL 2 Corpus: Sense Tagged Text
- [ ] smultron............ SMULTRON Corpus Sample
- [ ] shakespeare......... Shakespeare XML Corpus Sample
- [ ] stopwords........... Stopwords Corpus
- [ ] swadesh............. Swadesh Wordlists
- [ ] switchboard......... Switchboard Corpus Sample
- [ ] toolbox............. Toolbox Sample Files
- Hit Enter to continue:
- [ ] unicode_samples..... Unicode Samples
- [ ] webtext............. Web Text Corpus
- [ ] timit............... TIMIT Corpus Sample
- [ ] ycoe................ York-Toronto-Helsinki Parsed Corpus of Old
- English Prose
- [ ] treebank............ Penn Treebank Sample
- [ ] udhr................ Universal Declaration of Human Rights Corpus
- [ ] sample_grammars..... Sample Grammars
- [ ] book_grammars....... Grammars from NLTK Book
- [ ] spanish_grammars.... Grammars for Spanish
- [ ] wordnet............. WordNet
- [ ] wordnet_ic.......... WordNet-InfoContent
- [ ] words............... Word Lists
- [ ] tagsets............. Help on Tagsets
- [ ] basque_grammars..... Grammars for Basque
- [ ] large_grammars...... Large context-free and feature-based grammars
- for parser comparison
- [ ] maxent_treebank_pos_tagger Treebank Part of Speech Tagger (Maximum entropy)
- [ ] rslp................ RSLP Stemmer (Removedor de Sufixos da Lingua
- Portuguesa)
- [ ] hmm_treebank_pos_tagger Treebank Part of Speech Tagger (HMM)
- Hit Enter to continue:
- [ ] punkt............... Punkt Tokenizer Models
- Collections:
- [ ] all-corpora......... All the corpora
- [ ] all................. All packages
- [ ] book................ Everything used in the NLTK Book
- ([*] marks installed packages)
你可以选择敲入 all-corpora,或all,或book,我选的是all。保持网络畅通,下载可能需要一段时间。显示信息如下: