python 自然语言处理 第三章

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/hangzuxi8764/article/details/72903079

flag:暑假之前把《python 自然语言处理》学完

本章的目的:

1、从本地和网络上获取文件

2、把文档分割成单独的词和标点符号

3、把文本分析的结果保存在文件中

from  __future__ import division
import nltk,re,pprint

1、从网上获取电子书

from urllib import urlopen
url="http://www.gutenberg.org/files/2554/2554.txt"
raw=urlopen(url).read()

查看raw的信息

print type(raw)
print len(raw)
print raw[:75]
<type 'str'>
1176896
The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky

2、分词

将raw分成词汇和标点符号,存储在链表中

tokens=nltk.word_tokenize(raw)
print type(tokens)
print len(tokens)
print tokens[:12]
<type 'list'>
254352
['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by', 'Fyodor', 'Dostoevsky']

3、读取本地文件

f=open('E:\python2\Harry Potter1.txt')
raw=f.read()

二、字符串

用in操作符测试一个字符串是否包含一个特定的子字符串 返回布尔值 区分大小写

用find()找到一个子字符串在字符串内的位置 空格占位

phrase='Harry Potter and the Sorcerer\'s Stone'
print 'Potter' in phrase
print 'potter' in phrase
phrase.find('Potter')
True
False
6

Unicode

python的codecs模块提供了将编码数据读入为Unicode字符串和将Unicode字符串以编码形式写出的函数
f=codecs.open(path,encoding=’utf-8’)

正则表达式

re.findall()找出所有匹配制定正则表达式

import re
sentence="Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly\
normal, thank you very much. They were the last people you'd expect to be involved in anything \
strange or mysterious, because they just didn't hold with such nonsense."
sentence_raw=sentence.split()
print sentence_raw
print len(sentence_raw)
['Mr.', 'and', 'Mrs.', 'Dursley,', 'of', 'number', 'four,', 'Privet', 'Drive,', 'were', 'proud', 'to', 'say', 'that', 'they', 'were', 'perfectlynormal,', 'thank', 'you', 'very', 'much.', 'They', 'were', 'the', 'last', 'people', "you'd", 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious,', 'because', 'they', 'just', "didn't", 'hold', 'with', 'such', 'nonsense.']
44

用正则表达式为文本分词

\w 匹配字符 相当于[a-zA-z0-9_]

\W 代表所有 字母、数字、下划线 以外的字符

r=[]
sentence_re=re.split(r'\W+',sentence)
print sentence_re
print len(sentence_re)
['Mr', 'and', 'Mrs', 'Dursley', 'of', 'number', 'four', 'Privet', 'Drive', 'were', 'proud', 'to', 'say', 'that', 'they', 'were', 'perfectlynormal', 'thank', 'you', 'very', 'much', 'They', 'were', 'the', 'last', 'people', 'you', 'd', 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious', 'because', 'they', 'just', 'didn', 't', 'hold', 'with', 'such', 'nonsense', '']
47

猜你喜欢

转载自blog.csdn.net/hangzuxi8764/article/details/72903079