《NLTK基础教程》读书笔记 001期

（开一个新的系列）
第一章算是introduction和各种环境的熟悉，应该不算太麻烦，这里需要注意几个问题，这本书的核心还是在用python，而不是python3，所以有些坑还是得自己踩一踩。两个版本最最基本的问题像是print后面括号的有无，不在这里过多强调。

第一个坑，urllib2
在书正文的第12页，python2中import了这个库，但是在python3中，该库已经整合进入urllib。
想要使用同样的urlopen命令，参见博文：https://blog.csdn.net/qq_32623363/article/details/78768636。

只要from urllib.request import urlopen即可。

目前结果：

48799
Total no of tokens: 2917
[b'<!doctype', b'html>', b'<!--[if', b'lt', b'IE', b'7]>', b'<html', b'class="no-js', b'ie6', b'lt-ie7', b'lt-ie8', b'lt-ie9">', b'<![endif]-->', b'<!--[if', b'IE', b'7]>', b'<html', b'class="no-js', b'ie7', b'lt-ie8', b'lt-ie9">', b'<![endif]-->', b'<!--[if', b'IE', b'8]>', b'<html', b'class="no-js', b'ie8', b'lt-ie9">', b'<![endif]-->', b'<!--[if', b'gt', b'IE', b'8]><!--><html', b'class="no-js"', b'lang="en"'...]

可以注意到相比于书，这里在每个标签前面都有一个b字母，这是因为实际上我们读进来的其实是byte类型，而如果要做到像书上那样没有b的话，需要加一个decode，如下：

response = urlopen('http://python.org')
html = response.read().decode('utf-8')

这样的话，后面正则也就不会出现TypeError: cannot use a string pattern on a bytes-like object这样的错误了。下面是正则的结果：

6213
['', 'doctype', 'html', 'if', 'lt', 'IE', '7', 'html', 'class', 'no', 'js', 'ie6', 'lt', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '7', 'html', 'class', 'no', 'js'...]

第二个坑，nltk.clean_html()
在nltk 3.3版本中直接使用此函数，会导致

NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function

该错误，参考博客https://blog.csdn.net/qq_33394807/article/details/50836226，得知我们需要安装一个BeautifulSoup的library即可，参考网页：https://cuiqingcai.com/1319.html
直接运行之后可能会收到一个warning：

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

按照指示，将代码改成

clean = BeautifulSoup(html, 'lxml').get_text()

即可，结果如下：

['Welcome', 'to', 'Python.org', '{', '"@context":', '"http://schema.org",', '"@type":', '"WebSite",', '"url":', '"https://www.python.org/",', '"potentialAction":', '{', '"@type":', '"SearchAction",', '"target":', '"https://www.python.org/search/?q={search_term_string}",', '"query-input":', '"required', 'name=search_term_string"', '}', '}', 'var', '_gaq', '=', '_gaq', '||'...]

备注：后面的token都跟的是最后一次NLTK的处理结果走的。
直接打印FreqDist结果，应该会是：<FreqDist with 613 samples and 1119 outcomes>

再次打印的时候，可能会出现下列错误：

UnicodeEncodeError: 'gbk' codec can't encode character '\xa9' in position 0: illegal multibyte sequence

参考博客：https://blog.csdn.net/jim7424994/article/details/22675759
在代码开头添加

import sys, io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')

从而避免该错误，不过会出现中文乱码的情况，同样参考上面的博文，进行修改encoding的操作即可。

关于停用词，请参考：http://www.codeforge.cn/read/197290/English_stopwords.txt__html

以上为第一章的全部内容。

《NLTK基础教程》读书笔记 001期

猜你喜欢