Python learning-112-KMeans clustering method for natural language processing through sklearn (clear version)

Foreword:

         kmeans clustering is a very commonly used clustering method. Because of its simple understanding and efficient operation, it is widely used. Today we implement kmeans through the powerful sklearn package, and realize the function through natural language processing text clustering . Only the clear code implementation process is shown here, and the theoretical knowledge and process are not repeated here.

Data set address used:

https://download.csdn.net/download/u013521274/11080094

For those who are new to python, it is really a headache to find code that can be understood and cannot run without data. So I give the data set here so that more friends can learn by memory and make progress together.

Code:

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfTransformer,CountVectorizer,HashingVectorizer

#-----------------------------------------第一步计算TFIDF---------------------------------------
corpus = []
for line in open('data/Result.txt', 'r',encoding='utf-8',errors='ignore').readlines():
    corpus.append(line.strip())
print('库有几行:',len(corpus)) #一行代表一个文档(也可以代表一类据情况而定)

vectorizer = CountVectorizer()#将文本中的词语转换为词频矩阵 矩阵元素a[i][j] 表示j词在i类文本下的词频

transformer = TfidfTransformer()#该类会统计每个词语的tf-idf权值

#第一个fit_transform是计算tf-idf 第二个fit_transform是将文本转为词频矩阵
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))#获得tfidf值

word = vectorizer.get_feature_names()#获取词袋模型中的所有词语->词表

weight = tfidf.toarray()#将tf-idf矩阵抽取出来,元素w[i][j]表示j词在i类文本中的tf-idf权重
num = np.array(weight)  #这个权重其实即使大的稀疏矩阵
print("矩阵的列是",num.shape[1])#列数=词表的个数
print( '词表大小:'+str(len(word)))#可以作为比对


#-----------------------------------------第二步 聚类kmeans------------------------------------
from sklearn.cluster import KMeans
clf = KMeans(n_clusters=4) #参数4代表聚成4类,还有其他参数可以设置
s=clf.fit(weight) #调用fit()方法进行聚类 输入:weight=Tfidf的稀疏矩阵
print(s)          #可以查看kmeans()其他参数

#中心点
print(len(clf.cluster_centers_))#对于矩阵来说len是行
# print(clf.cluster_centers_)#4类 每一类的中心点

#用来评估簇的个数是否合适,距离越小说明簇分的越好,选取临界点的簇个数  958.137281791
# print(clf.inertia_)#    958.1511490717049

#---------------------------------------第三步文档所属簇的索引(标号)提取----------------------

#每一个样本所属于的簇-------最终要的东西---我们聚类的目的就是要知道每一个文档所属于的簇----
# print(clf.labels_)
#我们就是把clf.labels_中的信息提取出来以便于画图
listlabel=[]
i=0
while i<len(clf.labels_):
    listlabel.append([i,clf.labels_[i]])
    i=i+1

frame = pd.DataFrame(listlabel,columns=['index','class'])#使用pandas处理方便

'''属于0簇的文档索引index保存在list0中'''
list0=[]                                                 #簇是从0开始计数的第0类,第1类...
data0=frame[(frame[u'class']==0)].ix[:,0]
for m in data0:
    list0.append(m)

list1=[]
data1=frame[(frame[u'class']==1)].ix[:,0]
for m in data1:
    list1.append(m)

list2=[]
data2=frame[(frame[u'class']==2)].ix[:,0]
for m in data2:
    list2.append(m)

list3=[]
data3=frame[(frame[u'class']==3)].ix[:,0]
for m in data3:
    list3.append(m)

#---------------------------------------第四步 图形输出 PCA降维-------------------------------#
from sklearn.decomposition import PCA
pca = PCA(n_components=2)#输出2维
newData = pca.fit_transform(weight)
print('降维后有几行?',len(newData))
# print(newData)

#5A景区
x1=[]          #存储x坐标
y1=[]          #存储y坐标
for j in list0:
    x1.append(newData[j][0])
    y1.append(newData[j][1])

#动物
x2 = []
y2 = []
for j in list1:
    x2.append(newData[j][0])
    y2.append(newData[j][1])

#人物
x3 = []
y3 = []
for j in list2:
    x3.append(newData[j][0])
    y3.append(newData[j][1])

#国家
x4 = []
y4 = []
for j in list3:
    x4.append(newData[j][0])
    y4.append(newData[j][1])

#四种颜色 红 绿 蓝 黑
plt.plot(x1, y1, 'or')
plt.plot(x2, y2, 'og')
plt.plot(x3, y3, 'ob')
plt.plot(x4, y4, 'ok')
plt.show()

Clustering results:

 

Guess you like

Origin blog.csdn.net/u013521274/article/details/87924876