[Python3 machine learning] sklearn in CountVectorizer and TfidfTransformer

Original link: https://blog.csdn.net/qq_36134437/article/details/103057909

words CountVectorizer will be converted to text term frequency matrix, the number of times each word appears which is calculated by fit_transform function.

CountVectorizer(input='content', encoding='utf-8',  decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, 
token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)


Many parameters CountVectorizer class is divided into three process steps: preprocessing, tokenizing, n-grams generation.

Parameters to be set is typically: ngram_range, max_df, min_df, max_features the like, specific conditions

参数表            作用
input            一般使用默认即可,可以设置为"filename’或’file’
encodeing        使用默认的utf-8即可,分析器将会以utf-8解码raw document
decode_error    默认为strict,遇到不能解码的字符将报UnicodeDecodeError错误,设为ignore将会忽略解        码错误,还可以设为replace,作用尚不明确
strip_accents    默认为None,可设为ascii或unicode,将使用ascii或unicode编码在预处理步骤去除raw document中的重音符号
analyzer        一般使用默认,可设置为string类型,如’word’, ‘char’, ‘char_wb’,还可设置为callable类型,比如函数是一个callable类型
preprocessor    设为None或callable类型
tokenizer        设为None或callable类型
ngram_range        词组切分的长度范围,待详解
stop_words        设置停用词,设为english将使用内置的英语停用词,设为一个list可自定义停用词,设为None不使用停用词,设为None且max_df∈[0.7, 1.0)将自动根据当前的语料库建立停用词表
lowercase        将所有字符变成小写
token_pattern    过滤规则,表示token的正则表达式,需要设置analyzer == ‘word’,默认的正则表达式选择2个及以上的字母或数字作为token,标点符号默认当作token分隔符,而不会被当作token
max_df            可以设置为范围在[0.0 1.0]的float,也可以设置为没有范围限制的int,默认为1.0。这个参数的作用是作为一个阈值,当构造语料库的关键词集的时候,如果某个词的document frequence大于max_df,这个词不会被当作关键词。如果这个参数是float,则表示词出现的次数与语料库文档数的百分比,如果是int,则表示词出现的次数。如果参数中已经给定了vocabulary,则这个参数无效
min_df            类似于max_df,不同之处在于如果某个词的document frequence小于min_df,则这个词不会被当作关键词
max_features    默认为None,可设为int,对所有关键词的term frequency进行降序排序,只取前max_features个作为关键词集
vocabulary        默认为None,自动从输入文档中构建关键词集,也可以是一个字典或可迭代对象?
binary            默认为False,一个关键词在一篇文档中可能出现n次,如果binary=True,非零的n将全部置为1,这对需要布尔值输入的离散概率模型的有用的
dtype            使用CountVectorizer类的fit_transform()或transform()将得到一个文档词频矩阵,dtype可以设置这个矩阵的数值类型


Role property sheet

vocabulary_              词汇表;字典型
get_feature_names()      所有文本的词汇;列表型
stop_words_              返回停用词表


Method table action

fit_transform(X)      拟合模型,并返回文本矩阵
fit(raw_documents[, y])     Learn a vocabulary dictionary of all tokens in the raw documents.
fit_transform(raw_documents[, y])     Learn the vocabulary dictionary and return term-document matrix.


Data input in the form of a list, a list of the elements to a string representative of the article, an article on behalf of a string, the string is already a good split. CountVectorizer same applies to the Chinese;

CountVectorizer by fit_transform function to convert text words as word frequency matrix, the matrix elements A [i] [j] represents a j word frequency word in the i-th text. That is the number of times each word appears, by get_feature_names () keyword can see all the text by toarray () can see the results of word frequency matrix. Examples

from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird", 'bird'] # “dog cat fish” 为输入列表元素,即代表一个文章的字符串
cv = CountVectorizer()#创建词袋数据结构
cv_fit=cv.fit_transform(texts)
#上述代码等价于下面两行
#cv.fit(texts)
#cv_fit=cv.transform(texts)

print(cv.get_feature_names())    #['bird', 'cat', 'dog', 'fish'] 列表形式呈现文章生成的词典

print(cv.vocabulary_    )              # {‘dog’:2,'cat':1,'fish':3,'bird':0} 字典形式呈现,key:词,value:词频

print(cv_fit)
# (0,3) 1   第0个列表元素,**词典中索引为3的元素**, 词频
#(0,1)1
#(0,2)1
#(1,1)2
#(1,2)1
#(2,0)1
#(2,3)1
#(3,0)1

print(cv_fit.toarray()) #.toarray() 是将结果转化为稀疏矩阵矩阵的表示方式;
#[[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]

print(cv_fit.toarray().sum(axis=0))  #每个词在所有文档中的词频
#[2 3 2 2]



Calculating the IDF-TF
1.CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

#语料
corpus = [
    'This is the first document.',
    'This is the this second second document.',
    'And the third one.',
    'Is this the first document?'
]

#将文本中的词转换成词频矩阵
vectorizer = CountVectorizer()
print(vectorizer)
#计算某个词出现的次数
X = vectorizer.fit_transform(corpus)
print(type(X),X)
#获取词袋中所有文本关键词
word = vectorizer.get_feature_names()
print(word)
#查看词频结果
print(X.toarray())




result

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
<class 'scipy.sparse.csr.csr_matrix'>   (0, 1)  1
  (0, 2)    1
  (0, 6)    1
  (0, 3)    1
  (0, 8)    1
  (1, 5)    2
  (1, 1)    1
  (1, 6)    1
  (1, 3)    1
  (1, 8)    2
  (2, 4)    1
  (2, 7)    1
  (2, 0)    1
  (2, 6)    1
  (3, 1)    1
  (3, 2)    1
  (3, 6)    1
  (3, 3)    1
  (3, 8)    1
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 2]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]


Can be seen from the results, including a total of nine words features, namely:
[ 'and', 'Document', 'First', 'IS', 'One', 'SECOND', 'The', 'THIRD', ' this']
while the number of output characteristics of each sentence contains words. For example, the first sentence "This is the first document." , Which corresponds to the word frequency is [0, 1, 1, 1, 0, 0, 1, 0, 1], the initial count number starting from 0, the word frequency indicates presence of the first position of the word "document" 1 time, word second position "first" 1 time, word third position "is" 1 time, word position 8 of "this" co 1 words. So, each sentence will get a word frequency vector.

2.TfidfTransformer
TfidfTransformer TF-IDF value for each of the words in the statistics vectorizer. Specific usage is as follows:

from sklearn.feature_extraction.text import TfidfTransformer
#类调用
transformer = TfidfTransformer()
print(transformer)
#将词频矩阵统计成TF-IDF值
tfidf = transformer.fit_transform(X)
#查看数据结构tfidf[i][j]表示i类文本中tf-idf权重
print(tfidf.toarray())


result

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)
[[ 0.          0.43877674  0.54197657  0.43877674  0.          0.
   0.35872874  0.          0.43877674]
 [ 0.          0.24628357  0.          0.24628357  0.          0.77170162
   0.20135296  0.          0.49256715]
 [ 0.55280532  0.          0.          0.          0.55280532  0.
   0.28847675  0.55280532  0.        ]
 [ 0.          0.43877674  0.54197657  0.43877674  0.          0.
   0.35872874  0.          0.43877674]]


= True smooth_idf:
idf = log ((1 + total number of documents) / (1 + ti contain several documents)) +1
smooth_idf = False:
idf = log log ((the total number of documents) / (ti contains the number of documents) ) + 1'd
tf-idf = TF * IDF
after the need for the Euclidean (L2) norm, to give a final weight of tf-idf
 

Published 44 original articles · won praise 16 · views 10000 +

Guess you like

Origin blog.csdn.net/YYIverson/article/details/104281104