【Python3机器学习】sklearn中的CountVectorizer和TfidfTransformer

原文链接：https://blog.csdn.net/qq_36134437/article/details/103057909

CountVectorizer会将文本中的词语转换为词频矩阵，它通过fit_transform函数计算各个词语出现的次数。

CountVectorizer(input='content', encoding='utf-8',  decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, 
token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)

CountVectorizer类的参数很多，分为三个处理步骤：preprocessing、tokenizing、n-grams generation.

一般要设置的参数是:ngram_range,max_df，min_df，max_features等，具体情况具体分析

参数表            作用
input            一般使用默认即可，可以设置为"filename’或’file’
encodeing        使用默认的utf-8即可，分析器将会以utf-8解码raw document
decode_error    默认为strict，遇到不能解码的字符将报UnicodeDecodeError错误，设为ignore将会忽略解        码错误，还可以设为replace，作用尚不明确
strip_accents    默认为None，可设为ascii或unicode，将使用ascii或unicode编码在预处理步骤去除raw document中的重音符号
analyzer        一般使用默认，可设置为string类型，如’word’, ‘char’, ‘char_wb’，还可设置为callable类型，比如函数是一个callable类型
preprocessor    设为None或callable类型
tokenizer        设为None或callable类型
ngram_range        词组切分的长度范围，待详解
stop_words        设置停用词，设为english将使用内置的英语停用词，设为一个list可自定义停用词，设为None不使用停用词，设为None且max_df∈[0.7, 1.0)将自动根据当前的语料库建立停用词表
lowercase        将所有字符变成小写
token_pattern    过滤规则，表示token的正则表达式，需要设置analyzer == ‘word’，默认的正则表达式选择2个及以上的字母或数字作为token，标点符号默认当作token分隔符，而不会被当作token
max_df            可以设置为范围在[0.0 1.0]的float，也可以设置为没有范围限制的int，默认为1.0。这个参数的作用是作为一个阈值，当构造语料库的关键词集的时候，如果某个词的document frequence大于max_df，这个词不会被当作关键词。如果这个参数是float，则表示词出现的次数与语料库文档数的百分比，如果是int，则表示词出现的次数。如果参数中已经给定了vocabulary，则这个参数无效
min_df            类似于max_df，不同之处在于如果某个词的document frequence小于min_df，则这个词不会被当作关键词
max_features    默认为None，可设为int，对所有关键词的term frequency进行降序排序，只取前max_features个作为关键词集
vocabulary        默认为None，自动从输入文档中构建关键词集，也可以是一个字典或可迭代对象？
binary            默认为False，一个关键词在一篇文档中可能出现n次，如果binary=True，非零的n将全部置为1，这对需要布尔值输入的离散概率模型的有用的
dtype            使用CountVectorizer类的fit_transform()或transform()将得到一个文档词频矩阵，dtype可以设置这个矩阵的数值类型

属性表作用

vocabulary_              词汇表；字典型
get_feature_names()      所有文本的词汇；列表型
stop_words_              返回停用词表

方法表作用

fit_transform(X)      拟合模型，并返回文本矩阵
fit(raw_documents[, y])     Learn a vocabulary dictionary of all tokens in the raw documents.
fit_transform(raw_documents[, y])     Learn the vocabulary dictionary and return term-document matrix.

用数据输入形式为列表，列表元素为代表文章的字符串，一个字符串代表一篇文章，字符串是已经分割好的。CountVectorizer同样适用于中文;

CountVectorizer是通过fit_transform函数将文本中的词语转换为词频矩阵，矩阵元素a[i][j] 表示j词在第i个文本下的词频。即各个词语出现的次数，通过get_feature_names()可看到所有文本的关键字，通过toarray()可看到词频矩阵的结果。示例

from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird", 'bird'] # “dog cat fish” 为输入列表元素,即代表一个文章的字符串
cv = CountVectorizer()#创建词袋数据结构
cv_fit=cv.fit_transform(texts)
#上述代码等价于下面两行
#cv.fit(texts)
#cv_fit=cv.transform(texts)

print(cv.get_feature_names())    #['bird', 'cat', 'dog', 'fish'] 列表形式呈现文章生成的词典

print(cv.vocabulary_    )              # {‘dog’:2,'cat':1,'fish':3,'bird':0} 字典形式呈现，key：词，value:词频

print(cv_fit)
# （0,3） 1   第0个列表元素，**词典中索引为3的元素**， 词频
#（0,1）1
#（0,2）1
#（1,1）2
#（1,2）1
#（2,0）1
#（2,3）1
#（3,0）1

print(cv_fit.toarray()) #.toarray() 是将结果转化为稀疏矩阵矩阵的表示方式；
#[[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]

print(cv_fit.toarray().sum(axis=0))  #每个词在所有文档中的词频
#[2 3 2 2]

计算TF-IDF
1.CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

#语料
corpus = [
    'This is the first document.',
    'This is the this second second document.',
    'And the third one.',
    'Is this the first document?'
]

#将文本中的词转换成词频矩阵
vectorizer = CountVectorizer()
print(vectorizer)
#计算某个词出现的次数
X = vectorizer.fit_transform(corpus)
print(type(X),X)
#获取词袋中所有文本关键词
word = vectorizer.get_feature_names()
print(word)
#查看词频结果
print(X.toarray())

结果

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
<class 'scipy.sparse.csr.csr_matrix'>   (0, 1)  1
  (0, 2)    1
  (0, 6)    1
  (0, 3)    1
  (0, 8)    1
  (1, 5)    2
  (1, 1)    1
  (1, 6)    1
  (1, 3)    1
  (1, 8)    2
  (2, 4)    1
  (2, 7)    1
  (2, 0)    1
  (2, 6)    1
  (3, 1)    1
  (3, 2)    1
  (3, 6)    1
  (3, 3)    1
  (3, 8)    1
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 2]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]

从结果中可以看到，总共包括9个特征词，即：
[‘and’, ‘document’, ‘first’, ‘is’, ‘one’, ‘second’, ‘the’, ‘third’, ‘this’]
同时在输出每个句子中包含特征词的个数。例如，第一句“This is the first document.”，它对应的词频为[0, 1, 1, 1, 0, 0, 1, 0, 1]，初始序号从0开始计数，则该词频表示存在第1个位置的单词“document”共1次、第2个位置的单词“first”共1次、第3个位置的单词“is”共1次、第8个位置的单词“this”共1词。所以，每个句子都会得到一个词频向量。

2.TfidfTransformer
TfidfTransformer用于统计vectorizer中每个词语的TF-IDF值。具体用法如下：

from sklearn.feature_extraction.text import TfidfTransformer
#类调用
transformer = TfidfTransformer()
print(transformer)
#将词频矩阵统计成TF-IDF值
tfidf = transformer.fit_transform(X)
#查看数据结构tfidf[i][j]表示i类文本中tf-idf权重
print(tfidf.toarray())

结果

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)
[[ 0.          0.43877674  0.54197657  0.43877674  0.          0.
   0.35872874  0.          0.43877674]
 [ 0.          0.24628357  0.          0.24628357  0.          0.77170162
   0.20135296  0.          0.49256715]
 [ 0.55280532  0.          0.          0.          0.55280532  0.
   0.28847675  0.55280532  0.        ]
 [ 0.          0.43877674  0.54197657  0.43877674  0.          0.
   0.35872874  0.          0.43877674]]

smooth_idf=True：
idf = log((1+总文档数)/(1+包含ti的文档数)) +1
smooth_idf=False:
idf = log log((总文档数)/(包含ti的文档数)) +1
tf-idf =tf*idf
之后需进行 the Euclidean (L2) norm，得到最终的tf-idf权重

YYIverson

发布了44 篇原创文章 · 获赞 16 · 访问量 1万+

私信关注

【Python3机器学习】sklearn中的CountVectorizer和TfidfTransformer

原文链接：https://blog.csdn.net/qq_36134437/article/details/103057909 CountVectorizer会将文本中的词语转换为词频矩阵，它通过fit_transform函数计算各个词语出现的次数。

猜你喜欢

原文链接：https://blog.csdn.net/qq_36134437/article/details/103057909

CountVectorizer会将文本中的词语转换为词频矩阵，它通过fit_transform函数计算各个词语出现的次数。