版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
目录
理论(文本特征提取)
向量空间模型(又称“词向量模型”):将文本文档转为数字向量。权重为文档中单词的频率、平均出现的频率或TF-IDF权重。
● 词袋模型
● TF-IDF模型
● 高级词向量模型
主流:谷歌的word2vec算法,它是一个基于神经网络的实现,使用CBOW(Continuous Bags of Words)和skip-gram两种结构学习单词的分布式向量表示。也可基于Gensim库实现。
部分代码
Gensim_doc2bow+LDA
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)
# Create Corpus
texts = data_lemmatized
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
print()
#构建主题模型
#依然基于gensim
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=2,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
#查看LDA模型中的主题
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]
Gensim_tfidf+LDA
import gensim.downloader as api
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
dct = Dictionary(data_lemmatized) # fit dictionary
corpus = [dct.doc2bow(line) for line in data_lemmatized] # convert corpus to BoW format
model = TfidfModel(corpus) # fit model
vector = model[corpus] # apply model to the first corpus document
#构建主题模型
#依然基于gensim
lda_model = gensim.models.ldamodel.LdaModel(corpus=vector,
id2word=dct,
num_topics=2,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
#查看LDA模型中的主题
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]
结果对比
-
data=("I LOVE apples# & 3241","he likes PIG3s","she do not like anything,except apples.\.") 主题数=2
- 词袋模型doc2bow结果:[(0, '0.271*"pig" + 0.255*"like" + 0.096*"love" + 0.096*"apple" + ''0.095*"anything" + 0.094*"do" + 0.094*"not"'), (1,'0.235*"apple" + 0.153*"like" + 0.141*"not" + 0.141*"do" + 0.141*"anything" ''+ 0.140*"love" + 0.050*"pig"')]
- TF-IDF模型结果:[(0,'0.254*"pig" + 0.160*"like" + 0.124*"love" + 0.116*"apple" + ''0.116*"anything" + 0.115*"do" + 0.115*"not"'), (1,'0.201*"love" + 0.148*"not" + 0.148*"do" + 0.148*"anything" + 0.147*"apple" ''+ 0.112*"like" + 0.096*"pig"')]