机器学习(周志华) 西瓜书 第九章课后习题9.10—— Python实现

版权声明:个人原创,禁止私自转载,如需转载引用请私信联系——Zetrue_Li https://blog.csdn.net/weixin_37922777/article/details/90138139

机器学习(周志华) 西瓜书 第九章课后习题9.10—— Python实现 

  • 实验题目

试设计一个能自动确定聚类数的改进k均值算法,编程实现并在西瓜数据集4.0上运行。

  • 实验原理

K均值算法原理

K均值算法

自动确定k值的度量指标,最小化E:

       E值越小则簇内样本相似度越高,簇间样本相似度越低,且k值保证是较小的值,即簇类尽可能保证是大型簇类(这里考虑样本只有两种类别,所以k值应趋近于2);

  • 实验过程

数据集获取

将西瓜数据集3.0保存为data_3.txt

编号,色泽,根蒂,敲声,纹理,脐部,触感,密度,含糖率,好瓜
1,青绿,蜷缩,浊响,清晰,凹陷,硬滑,0.697,0.46,是
2,乌黑,蜷缩,沉闷,清晰,凹陷,硬滑,0.774,0.376,是
3,乌黑,蜷缩,浊响,清晰,凹陷,硬滑,0.634,0.264,是
4,青绿,蜷缩,沉闷,清晰,凹陷,硬滑,0.608,0.318,是
5,浅白,蜷缩,浊响,清晰,凹陷,硬滑,0.556,0.215,是
6,青绿,稍蜷,浊响,清晰,稍凹,软粘,0.403,0.237,是
7,乌黑,稍蜷,浊响,稍糊,稍凹,软粘,0.481,0.149,是
8,乌黑,稍蜷,浊响,清晰,稍凹,硬滑,0.437,0.211,是
9,乌黑,稍蜷,沉闷,稍糊,稍凹,硬滑,0.666,0.091,否
10,青绿,硬挺,清脆,清晰,平坦,软粘,0.243,0.267,否
11,浅白,硬挺,清脆,模糊,平坦,硬滑,0.245,0.057,否
12,浅白,蜷缩,浊响,模糊,平坦,软粘,0.343,0.099,否
13,青绿,稍蜷,浊响,稍糊,凹陷,硬滑,0.639,0.161,否
14,浅白,稍蜷,沉闷,稍糊,凹陷,硬滑,0.657,0.198,否
15,乌黑,稍蜷,浊响,清晰,稍凹,软粘,0.36,0.37,否
16,浅白,蜷缩,浊响,模糊,平坦,硬滑,0.593,0.042,否
17,青绿,蜷缩,沉闷,稍糊,稍凹,硬滑,0.719,0.103,否

算法实现

读取数据

计算两样本向量的的欧式距离

为给定的簇类计算均值向量

静态K均值算法,获得划分为k簇类集

对划分后的结果进行误差计算,基于自动确定k值的度量指标

动态K均值算法,返回最佳的k值

main函数,调用上述函数,输出自动确定k值后的划分结果

  • 实验结果

  • 程序清单:

import random as rd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def loadData(filename):
    dataSet = pd.read_csv(filename)
    dataSet.drop(columns=['编号'], inplace=True)
    return dataSet

#计算每个向量和均值向量之间的距离
def calc_distance(x, mu):
	distance = 0
	for xi, yi in zip(x, mu):
		distance += (xi-yi)**2

	return distance**(.5)

#根据目前的簇类计算出均值向量
def calc_mu(dataSet, indexs):
	Ci = dataSet.loc[indexs]
	return np.array(Ci.mean())

def k_means(dataSet, k, iterate=100):
	Mu_indexs = rd.sample(range(dataSet.shape[0]), k)
	Mu = [np.array(dataSet.loc[index]) for index in Mu_indexs]
	now, flag = 0, True

	while flag and now < iterate:
		C = [[] for _ in range(k)]
		for index, row in dataSet.iterrows():
			distances = []
			for mu in Mu:
				# x = np.array(dataSet.loc[index])
				distance = calc_distance(row, mu)
				distances.append(distance)

			label = np.argmin(distances)
			C[label].append(index)

		flag = False
		for i in range(len(Mu)):
			new_mu = calc_mu(dataSet, C[i])
			if (new_mu!=Mu[i]).any():
				flag = True
			Mu[i] = new_mu

		now += 1

	return C, Mu

def calc_E(dataSet, C, Mu, k):
	E_inside, E_outside, size= 0, 0, dataSet.shape[0]
	#簇内
	for Ci, mu in zip(C, Mu):
		for index in Ci:
			distance = calc_distance(dataSet.loc[index], mu)
			E_inside += distance**2
	# 正则化保持权重
	E_inside /= size

	# 簇间
	for a in range(k):
		for b in range(k):
			if a == b:
				continue
			distance = calc_distance(Mu[a], Mu[b])
			E_outside += distance**2

	E_outside /= k
	return E_inside - E_outside + 2*k/size

def Dynamic_K_means(dataSet):
	size, before_E = len(dataSet), 9999
	for k in range(2, size):
		Es = []
		# 计算多次k均值,取平方误差平均值
		for time in range(10):
			C, Mu = k_means(dataSet, k)
			E = calc_E(dataSet, C, Mu, k)
			Es.append(E)
		Best_E = sum(Es)/len(Es)

		if before_E <= Best_E:
			return k-1
		else:
			before_E = Best_E

	return 1

if __name__=='__main__':
    filename = 'data_4.txt'
    dataSet = loadData(filename)
    k = Dynamic_K_means(dataSet)

    Best_E = 9999
    # 多次计算,取最好结果
    for _ in range(10):
    	C, Mu= k_means(dataSet, k)
    	E = calc_E(dataSet, C, Mu, k)
    	if E < Best_E:
    		Best_E = E
    		Best_C = C

    print('k =', k)
    for Ci in Best_C:
    	print(Ci)

猜你喜欢

转载自blog.csdn.net/weixin_37922777/article/details/90138139
今日推荐