gensim

一、构建词典

一般构建词典会把低频词过滤掉，可以使用defaultdict对词频进行统计，在分词时过滤掉停用词和低频词。

from gensim import corpora
from gensim import corpora

documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system"]

corpus = [text.lower().split() for text in documents]

stopwords = "for a of the and to in".split()

]
corpus = [[word for word in line if word not in stopwords] for line in corpus]

corpus
[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system']]

dictionary = corpora.Dictionary(corpus)

print(dictionary.token2id)
{'abc': 0, 'applications': 1, 'computer': 2, 'human': 3, 'interface': 4, 'lab': 5, 'machine': 6, 'opinion': 7, 'response': 8, 'survey': 9, 'system': 10, 'time': 11, 'user': 12, 'eps': 13, 'management': 14}
# 返回一个字典，为每个单独的词语分配一个id

test = "human computer system mmm".split()

dictionary.doc2bow(test)
[(2, 1), (3, 1), (10, 1)]
# bag of words， 词袋模型，统计每个id对应词语出现的频率。 是一个稀疏向量，其它的全为0。

dictionary.doc2idx(test)
[3, 2, 10, -1]
# 按顺序集合每个词语对应的id

dictionary.dfs
{0: 1, 3: 1, 1: 2, 2: 1, 6: 1, 9: 2, 4: 1, 7: 2, 5: 1, 8: 1, 10: 1, 11: 1}
# 字典的词频

构建词典后，还有很多过滤操作。

filter_tokens(self, bad_ids=None, good_ids=None)
将bad_ids这个id集合对应的词语从字典中删去，保留good_ids对应的。

filter_extremes(self, no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)
1.去掉出现次数低于no_below的
2.去掉出现次数高于no_above的，小数指的是百分数
3.在1和2的基础上，保留出现频率前keep_n的单词

dictionary.filter_n_most_frequent(N)
过滤掉出现频率最高的N个词语

在使用filter方法过滤之后，常常需要使用 dictionary.compactify() 将间隙消除。

二、保存和读取

1.corpus的保存读取

corpora.MmCorpus.serialize('mycorpus.mm', corpus)

read = corpora.MmCorpus("mycorpus.mm")
MmCorpus(3 documents, 12 features, 15 non-zero entries)

print(list(read))
[[(0, 1.0), (1, 1.0), (2, 1.0), (3, 1.0)], [(4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0), (8, 1.0), (9, 1.0)], [(1, 1.0), (7, 1.0), (9, 1.0), (10, 1.0), (11, 1.0)]]

2.dictionary的保存读取

dictionary.save("my.dict")
dictionary.load("my.dict")

corpus和dictionary是有区别的，corpus是每篇文档的BOW，可以直接转成词向量，这个和sklearn的CountVectorizer()是有点像的。

三、文本模型建立

1.TF-IDF

tfidf = models.TfidfModel(corpus)
# 一个生成器
for i in tfidf[corpus]:
    print(i)
    
[(0, 0.4355066251613605), (1, 0.4355066251613605), (2, 0.16073253746956623), (3, 0.4355066251613605), (4, 0.16073253746956623), (5, 0.4355066251613605), (6, 0.4355066251613605)]
[(2, 0.17577487118585033), (7, 0.47626399821390897), (8, 0.47626399821390897), (9, 0.47626399821390897), (10, 0.17577487118585033), (11, 0.47626399821390897), (12, 0.17577487118585033)]
[(4, 0.23780622519852498), (10, 0.23780622519852498), (12, 0.23780622519852498), (13, 0.6443386523290703), (14, 0.6443386523290703)]

2.LSI

model  = models.LsiModel(tfidf[corpus], id2word=dictionary, num_topics=2)
model.print_topics(2)

[(0,
  '0.402*"eps" + 0.402*"management" + 0.287*"opinion" + 0.287*"response" + 0.287*"time" + 0.287*"survey" + 0.254*"system" + 0.254*"user" + 0.211*"interface" + 0.170*"lab"'),
 (1,
  '-0.399*"machine" + -0.399*"human" + -0.399*"lab" + -0.399*"abc" + -0.399*"applications" + 0.174*"response" + 0.174*"time" + 0.174*"survey" + 0.174*"opinion" + 0.142*"eps"')]

https://radimrehurek.com/gensim/tut2.html