初识NLTK

需要用处理英文文本，于是用到python中nltk这个包

1 f = open(r"D:\Postgraduate\Python\Python爬取美国商标局专利\s_exp.txt")
2 text = f.read()
3 sentences = nltk.sent_tokenize(text)
4 tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
5 tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]

依次过程是：

1、分句；2、分词；3、词性标注

然后4、命名实体识别

for sent in tagged_sentences:
    print(nltk.ne_chunk(sent))

当然，词性标注和命名实体识别这两部可以使用Standford的词性标注和命名实体识别库

>>> stan_tagger = StanfordPOSTagger(r'D:\Postgraduate\Python\Python自然语言处理\stanford-postagger-full-2018-02-27\stanford-postagger-full-2018-02-27\models\english-bidirectional-distsim.tagger','D:\Postgraduate\Python\Python自然语言处理\stanford-postagger-full-2018-02-27\stanford-postagger-full-2018-02-27\stanford-postagger.jar')

Warning (from warnings module):
  File "C:\Program Files\Python36\lib\site-packages\nltk\tag\stanford.py", line 149
    super(StanfordPOSTagger, self).__init__(*args, **kwargs)
DeprecationWarning: 
The StanfordTokenizer will be deprecated in version 3.2.5.
Please use [91mnltk.tag.corenlp.CoreNLPPOSTagger[0m or [91mnltk.tag.corenlp.CoreNLPNERTagger[0m instead.
>>> s = "I was watching TV"
>>> tokens = nltk.word_tokenize(s)
>>> stan_tagger.tag(tokens)
[('I', 'PRP'), ('was', 'VBD'), ('watching', 'VBG'), ('TV', 'NN')]

接着是命名实体识别：

from nltk.tag.stanford import StanfordNERTagger
# https://nlp.stanford.edu/software/stanford-ner-2018-02-27.zip
st = StanfordNERTagger(r'D:\Postgraduate\Python\Python自然语言处理\stanford-ner-2017-06-09\stanford-ner-2017-06-09\classifiers\english.all.3class.distsim.crf.ser.gz','D:\Postgraduate\Python\Python自然语言处理\stanford-ner-2017-06-09\stanford-ner-2017-06-09\stanford-ner.jar')
st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
>>[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]

但是效果似乎不好。。

猜你喜欢