从一段NLP程序看list和generator的用法

之前在学习Python自然语言处理的时候，碰到了有关下位词的一道习题：

What percentage of noun synsets have no hyponyms? You can get all noun synsets using wn.all_synsets(‘n’).

当时我写的代码是这样的：

import nltk
from nltk.corpus import wordnet as wn

all_noun = wn.all_synsets('n')
print(all_noun)
print(wn.all_synsets('n'))
all_num = len(set(all_noun))
noun_have_hypon = [word for word in wn.all_synsets('n') if len(word.hyponyms()) >= 1]
noun_have_num = len(noun_have_hypon)
print('There are %d nouns, and %d nouns without hyponyms, the percentage is %f' % (all_num, noun_have_num, (all_num-noun_have_num)/all_num*100))

运行之后的结果为

<generator object all_synsets at 0x10927b1b0>
<generator object all_synsets at 0x10e6f0bd0>
There are 82115 nouns, and 16693 nouns without hyponyms, the percentage is 79.671193

之后我尝试把其中一句noun_have_hypon = [word for word in wn.all_synsets('n') if len(word.hyponyms()) >= 1]替换为noun_have_hypon = [word for word in all_noun if len(word.hyponyms()) >= 1]，输出结果就变成了

<generator object all_synsets at 0x10917b1b0>
<generator object all_synsets at 0x10e46aab0>
There are 82115 nouns, and 0 nouns without hyponyms, the percentage is 100.000000

作为Python小白的我，自然不清楚这是什么原因，便在stackoverflow进行了提问，得到了一个十分详细解答。
原来我这个问题和NLP并没有太大的关系，主要是Python中关于generator的使用问题。
在说明generator之前，先要了解一下generator和list使用区别。

Use list comprehensions when the result needs to be iterated over multiple times, or where speed is paramount. Use generator expressions where the range is large or infinite.

当你的数据需要多次使用的时候，应该选择list类型，当你的数据量十分大或者无穷的时候，应该使用generator类型。
list和generator在生成的时候是存在差别的：

>>> mylist = [x*x for x in range(3)]
>>> for i in mylist:
...    print(i)
0
1
4

>>> mygenerator = (x*x for x in range(3))
>>> for i in mygenerator:
...    print(i)
0
1
4

各位应该注意到了，list在生成的时候使用的是[]，而generator在生成的时候使用的是()。
同时要注意的是，generator只能使用一次！
可以吧上面的for i in mygenerator的循环语句再运行一遍，你就会发现无法输出正确的结果了。generator先计算了0，然后就把他抛弃，再去计算1再抛弃，最后得到4。此时的mygenerator已经不存在了。
此外，generator在使用时，是不支持index和slice的，generator和list也无法相加。再看一个[stackoverflow]上的例子：

def gen():
    return (something for something in get_some_stuff())

print gen()[:2]     # generators don't support indexing or slicing
print [5,6] + gen() # generators can't be added to lists

那么，如果在一个函数中返回一个generator怎么做呢，这时候就需要用到Yield语句了。Yield是一个类似于return的语句：

>>> def createGenerator():
...    mylist = range(3)
...    for i in mylist:
...        yield i*i
...
>>> mygenerator = createGenerator() # create a generator
>>> print(mygenerator) # mygenerator is an object!
<generator object createGenerator at 0xb7555c34>
>>> for i in mygenerator:
...     print(i)
0
1
4

第一段函数代码中如果要是想要返回list型，需要将代码改为：

def createList():
...     myList = []
...     for i in range(3):
...             list_to_be_returned.append(i)
...     return myList

需要注意的是

To master yield, you must understand that when you call the function, the code you have written in the function body does not run. The function only returns the generator object.

最后，关于最开始的那段代码，还可以使用filter()函数：

import nltk
from nltk.corpus import wordnet as wn

all_noun_dict = wn.all_synsets('n')
all_noun_num = len(set(all_noun_dict))
noun_have_hypon = filter(lambda ss: len(ss.hyponyms()) <= 0, wn.all_synsets('n'))
noun_have_num = len(list(noun_have_hypon))
print('There are %d nouns, and %d nouns without hyponyms, the percentage is %f' %
      (all_noun_num, noun_have_num, noun_have_num/all_noun_num*100))

需要注意的是filter()函数生成的结果不能直接使用len()函数，也需要先list化。

从一段NLP程序看list和generator的用法

猜你喜欢