Text classification problems related principles

1 bag of words model

from sklearn import feature_extraction
f = feature_extraction.text.CountVectorizer()
CountVectorizer the text into a word word frequency matrix, a value of 0/1, compared with 1 keyword.
from sklearn.feature_extraction.text Import CountVectorizer 

Corpus = [
     ' This IS The First Document. ' ,
     ' This IS SECOND The SECOND Document. ' ,
     ' And The THIRD One. ' ,
     ' Is the this Document The First? ' , 
] 

'' ' CountVectorizer by fit_transform function to convert text words as word frequency matrix 
get_feature_names () to see all the text keywords 
vocabulary_ see keywords of all the text and its location 
toArray () can see the results of word frequency matrix ' '' 
Vectorizer = CountVectorizer () 
COUNT =vectorizer.fit_transform (Corpus)
 Print (vectorizer.get_feature_names ())  
 Print (vectorizer.vocabulary_)
 Print (count.toarray ())
 Print (count.toarray () Shape.)
 # word frequency matrix: vector length (transverse direction in each row): the number of all keywords (set m) values: 0/1 value == appears, does not appear longitudinal length: number of documents 
# Note that keyword has its own position, so there is a certain document m long, the sentence is a location in the keyword will be marked as 1. 

[ ' and ' , ' Document ' , ' First ' , ' iS ' , ' One ' , ' SECOND ' , ' the ', 'third', 'this']
{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]
(4, 9)

 

 
 

 

 

Guess you like

Origin www.cnblogs.com/DHuifang004/p/11224763.html