NLP (V): keyword extraction supplement (corpus and vector space)

First, the corpus into a vector (gensim)

After treatment of the basic corpus (word, to stop words), it is sometimes necessary to be quantized, to facilitate subsequent work.

from gensim Import in Corpora, similarities, Models
 Import jieba
 # Step 1: Determine the corpus corpus and sentence to the judge: 
# wordlist as Corpus, corpus there are three sentences, the equivalent of three articles. 
wordlist = [ ' I like programming ' , ' I want to become beautiful ' , ' today for lunch yet ' ] 
sentenses = ' what I like ' 
# Step 2: use establish the dictionary corpus, which is marked on the serial number of each word to be expected in the library, like this: { 'I': 1, 'like': 2 'programming': 3, ....} is first Chinese word 
text = [[Word for Word   in jieba.cut (words)] for words in wordlist] 
Dictionary =corpora.Dictionary (text)
 Print (the Dictionary)
 # The third step of the corpus of each word word frequency statistics, doc2bow every word is a word frequency statistics, is passed in a List 
# Corpus get is a two-dimensional array [[(0, 1), (1, 1), (2, 1)], [(3, 1), (4, 1)], [(5, 1), (6, 1), ( 7, 1), (8, 1), (9, 1)]], meaning that frequency word number 0 appearing times are 1, 2 numbered word occurrence frequency is 1 times 
corpus = [dictionary.doc2bow (Word) for Word in text]
 Print (Corpus) # get a two-dimensional array, the smallest elements (ID number of words, word frequency)

 

Code results:

We use gensim.corpora.dictionary.Dictionary class is assigned a unique integer number to each word appears in the corpus. This operation is a collection of word count and other statistics. At the end, we see that there are 10 different corpus of words, suggesting that each document will be represented by 10 digits

doc2bow function is mainly used to make dic into a bow bag of words model, the number of occurrences of each different words were counted, and the word into its numbers, and returns the result as a sparse vector. The code word corpus is a corpus of the bag model, in which each sub-list have expressed an article.

TFIDF said front can be used for keyword extraction, because it believed that the larger the word, the more value tfidf to reflect its importance to this article. But TF-IDF also be used to find similar articles to article digest extraction, feature selection (extract important features) work. If the next step is to sentenses of three articles and Corpus similarity comparison, the following code:

# Step four: Use the training corpus tfidf model 
Model = models.TfidfModel (corpus)
 # If you want to see the value of tfidf, then you can:  
tfidf = Model [corpus]
 '' ' 
result is tfidf corpus tfidf value of each word 
[(0, 0.5773502691896258), (1, 0.5773502691896258), (2, 0.5773502691896258)] 
[(3, 0.7071067811865475), (4, 0.7071067811865475)] 
[(5, 0.4472135954999579), (6, 0.4472135954999579), (7, 0.4472135954999579) , (8, 0.4472135954999579), (9, 0.4472135954999579)] 
'' ' 
# step Five: indexing each sentence model tfidf facilitate similarity query, the incoming value when corpus tfidf 
similarity = similarities. MatrixSimilarity (TFIDF)
 # the sixth step process to compare the sentences, first word, received the next word frequency, jieba only incoming string 
SEN = [Word for Wordin jieba.cut (sentenses)] 
SEN2 = dictionary.doc2bow (SEN)
 # then calculate its value tfidf 
sen_tfidf = Model [SEN2]
 # obtained similarity with all the sentences, sim is an array output 
sim = similarity [sen_tfidf]

https://blog.csdn.net/Lau_Sen/article/details/80436819

Tfidf code, and sen_tfidf corpus and the results are expressed tfidf new sentences to quantization. Many model is based on tf-idf to do, such as lsi, lda and so on.

Now every sentence becomes [(word id number, idf value), (word id number, idf value) ....] such a sparse representation.

Guess you like

Origin www.cnblogs.com/liuxiangyan/p/12481903.html