sklearn文本向量化工具

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接: https://blog.csdn.net/Yellow_python/article/details/97677183

词频统计向量化

基本用法

from sklearn.feature_extraction.text import CountVectorizer
corpus = ['air air ball call', 'air ball ball', 'air air air']
vectorizer = CountVectorizer()
vectorizer.fit(corpus)
X = vectorizer.transform(corpus)
print(X)
print(type(X[0]))

中文分词器

默认分词器是正则表达式分词,参数是tokenizertoken_pattern
stop_words过滤不需要的词
max_features限制最大词汇量,过滤低频词

import jieba
from sklearn.feature_extraction.text import CountVectorizer
# 语料
texts = ['小米、小米、苹果、华为', '小米和苹果、1+和苹果', '华为和小米']
# 分词器
jieba.add_word('1+', 2, 'nz')
tokenizer = lambda s: jieba.cut(s, HMM=False)
# 停词
stop_words = ['、', '和']
# 向量化
vectorizer = CountVectorizer(tokenizer=tokenizer, stop_words=stop_words, max_features=3)
X = vectorizer.fit_transform(texts)
print(X)
print(list(tokenizer('小米华为苹果1+')))
print(vectorizer.transform(['小米、华为、苹果和1+']))

TF-IDF向量化

TfidfVectorizer继承CountVectorizer

import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
# 语料
texts = ['小米苹果华为小米', '苹果小米苹果', '小米小米']
# 向量化
vectorizer = TfidfVectorizer(tokenizer=jieba.cut)
X = vectorizer.fit_transform(texts)
print(X)
print(X.toarray())

压缩稀疏矩阵

Compressed Sparse Row Matrix
Compressed Sparse Column Matrix

from numpy import array
from scipy.sparse import csr_matrix, csc_matrix
a = array([[0, 0, -2, 0],
           [-1, 0, 3, 0],
           [0, 0, 0, -9]])
csr = csr_matrix(a)
csc = csc_matrix(a)
print(a, csr, type(csr), csc, type(csc), csc.toarray(), sep='\n\n')

猜你喜欢

转载自blog.csdn.net/Yellow_python/article/details/97677183