1 bag of words model
from sklearn import feature_extraction
f = feature_extraction.text.CountVectorizer()
CountVectorizer the text into a word word frequency matrix, a value of 0/1, compared with 1 keyword.
from sklearn.feature_extraction.text Import CountVectorizer Corpus = [ ' This IS The First Document. ' , ' This IS SECOND The SECOND Document. ' , ' And The THIRD One. ' , ' Is the this Document The First? ' , ] '' ' CountVectorizer by fit_transform function to convert text words as word frequency matrix get_feature_names () to see all the text keywords vocabulary_ see keywords of all the text and its location toArray () can see the results of word frequency matrix ' '' Vectorizer = CountVectorizer () COUNT =vectorizer.fit_transform (Corpus) Print (vectorizer.get_feature_names ()) Print (vectorizer.vocabulary_) Print (count.toarray ()) Print (count.toarray () Shape.) # word frequency matrix: vector length (transverse direction in each row): the number of all keywords (set m) values: 0/1 value == appears, does not appear longitudinal length: number of documents # Note that keyword has its own position, so there is a certain document m long, the sentence is a location in the keyword will be marked as 1. [ ' and ' , ' Document ' , ' First ' , ' iS ' , ' One ' , ' SECOND ' , ' the ', 'third', 'this'] {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4} [[0 1 1 1 0 0 1 0 1] [0 1 0 1 0 2 1 0 1] [1 0 0 0 1 0 1 1 0] [0 1 1 1 0 0 1 0 1]] (4, 9)