gensim documentation - similarity query

Original link

http://cloga.info/python/2014/01/28/Gensim_Similarity_Queries/

28 January 2014

Don't forget to set if you want to see logging events.

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Similarity interface

In the previous tutorials on corpora and vector spaces and topics and transformations, we covered what it is to create a corpus in vector spaces and how to convert between different vector spaces. The reason for going in such a circle is that we want to judge the similarity of a bunch of documents, or the similarity of a particular document to a set of other documents (like user queries vs. indexed documents).

To show how gensim does this, let's look at the corpus from the previous example (originally from the "Indexing by Latent Semantic Analysis" seminal 1990 article by Deerwester et al. ):

from gensim import corpora, models, similarities
dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
corpus = corpora.MmCorpus('/tmp/deerwester.mm') # 来自一篇教程“从字符到向量”
print corpus

MmCorpus(9 documents, 12 features, 28 non-zero entries)

Following Deerwester's example, we first define a two-dimensional LSI space using this small sample corpus:

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

Now suppose the user enters the query "Human computer interaction". We would want to rank our 9 corpora in descending order of similarity to the query. Unlike modern search engines, here we only focus on one aspect that may be similar - the apparent depression correlation of documents (words). No hyperlinks, random walks for static rankings, only semantic expansion on boolean keyword matches:

doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow] # convert the query to LSI space
print vec_lsi

[(0, -0.461821), (1, 0.070028)]

Additionally, we will use cosine similarity to determine the similarity between two vectors. Cosine similarity is the standard measure in vector space models, however, when the vectors represent probability distributions, a different similarity measure may be more suitable.

Initialize query structure

To prepare a similarity query, we need to enter all documents that we want to compare with the subsequent query. In this case, the 9 documents that are also used to train LSI are transformed into 2D space. However, this is only by chance, we may also index different corpora.

index = similarities.MatrixSimilarity(lsi[corpus]) # 将语料转换为LSI,并索引

warn

similarities.MatrixSimilarity This class is only suitable for all the corpus can fit into memory. For example, using this class, 1 million documents in a 256-dimensional LSI space would require 2G of memory.

如果没有2G的可用内存,你需要使用similarities.Similarity类。这个类通过在硬盘的多个文件上分割索引,这些文件称为 shards,使用固定内存运行。它在内部使用similarities.MatrixSimilarity及similarities.SparseMatrixSimilarity,因此,仍然很快,尽管有点更加复杂。

索引的持久化通过标准的save()和load()函数处理:

index.save('/tmp/deerwester.index')
index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')

所有的索引类都是这样的(similarities.Similarity, similarities.MatrixSimilarity和similarities.SparseMatrixSimilarity)。接下来这些也是,索引可以是这类中的任何一个对象。如果不确定,使用similarities.Similarity,因为这是扩展性最好的版本,并且它还支持后续为索引添加更多的文档。

进行查询

获得查询文档对9个索引文档的相似性:

sims = index[vec_lsi] # 进行语料的相似查询
print list(enumerate(sims)) # 打印(document_number, document_similarity) 2-tuples

[(0, 0.99809301), (1, 0.93748635), (2, 0.99844527), (3, 0.9865886), (4, 0.90755945),
(5, -0.12416792), (6, -0.1063926), (7, -0.098794639), (8, 0.05004178)]

余弦度量返回的的相似性在<-1,1>之前(越大越相似),因此,第一个文档的总分为0.99809301。

使用类似的标准Python魔法,我们可以将相似性降序排列,获得“Human computer interaction”查询的最终答案:

sims = sorted(enumerate(sims), key=lambda item: -item[1])
print sims # 打印排序的 (document number, similarity score) 2-tuples

[(2, 0.99844527), # The EPS user interface management system
(0, 0.99809301), # Human machine interface for lab abc computer applications
(3, 0.9865886), # System and human system engineering testing of EPS
(1, 0.93748635), # A survey of user opinion of computer system response time
(4, 0.90755945), # Relation of user perceived response time to error measurement
(8, 0.050041795), # Graph minors A survey
(7, -0.098794639), # Graph minors IV Widths of trees and well quasi ordering
(6, -0.1063926), # The intersection graph of paths in trees
(5, -0.12416792)] # The generation of random binary unordered trees

(为了更清晰,我在输出中用评论添加了原始文档的”字符形式”。)

这里注意的是第二号("The EPS user interface management system")及第四号("Relation of user perceived response time to error measurement")文档永远也不会返回一个标准的布尔全文搜索,因为它们与"Human computer interaction"没有相同的词。但是,应用LSI后,我们可以看到他们获得了很高的相似性分数(二号实际上是最相似!),更好的反映了我们的直觉,他们与查询都是关于“计算机-人类”这个话题。事实上,这句语言概括也是首先应用主题建模的原因。

接下来是什么?

祝贺你,你完成了教程-现在你知道了gensim如何工作:-)要研究更多细节,你需要看一下API文档,查看维基百科实验或者看一下gensim中分布计算

Gensim是相当成熟的包,被许多个人和公司成功应用,无论是快速原型还是在生产环境。 但是,这不意味这它是完美的:

  • 有许多部分应该更有效的实现(比如说用C),或者使用更好的并行机制(多核)

  • 新算法层出不穷;通过讨论帮助gensim保持更新并且贡献代码

  • 非常欢迎和感激你的反馈(不仅仅是代码!):贡献思想、报告bug或者考虑共享用户故事和一般问题

Gensim没有野心称为一个无所不包的框架,涉及所有NLP(甚至机器学习)的领域。它的使命是帮助NLP从业者轻松在大数据集上尝试流行主题建模算法,并且帮助研究者设计算法原型。


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325651131&siteId=291194637