K-means clustering algorithm principle and c ++ realize

Clustering refers to the data itself based on the characteristics of the data classification, no manual tagging, it is a kind of unsupervised learning. k-means clustering algorithm is one of the most simple algorithm.

k-means algorithm to n data objects into k clusters in order to satisfy such clusters obtained: the same cluster objects high similarity; different cluster similarity smaller objects. Clustering the similarity is the use of objects in each cluster mean a "central object" (center of gravity) is obtained by calculation.

Based on this assumption, let us derive the k-means objective function to be optimized: We set up a total of N data points need to be divided into the K cluster, and k-means to do this is to minimize the objective function

                                                                       J=\sum_{n=1}^{N}\sum_{k=1}^{K}r_{nk}\left \| x_{n}-\mu_{k} \right \|^{2}

\ {K} mu_For the k-th clustering centers, r_ {nk}when the n-th data belonging to the category k is 1, and 0 otherwise.

Process is as follows:

1. First data objects selected from n objects as arbitrary initial k cluster centers; the left for other objects, according to their similarity with those of the cluster center (distance), respectively assigned to them most similar (the cluster centers represented) cluster;

2. then calculate the new cluster center of each cluster obtained (all objects in the cluster mean ); repeats this process until start standard measurement function converges.

Are generally used as the standard variance measure function, k clusters has the following characteristics: each cluster itself as compact as possible, as far as possible among the separated clusters.

Each cluster centers will update the objective function is reduced, thus ultimately J iterations reaches a minimum value, we can not guarantee a global minimum. k-means is very sensitive to noise.

c ++ implementation:

class ClusterMethod
{
private:
	double **mpSample;//输入
	double **mpCenters;//储存聚类中心
	double **pDistances;//距离矩阵
	int mSampleNum;//样本数
	int mClusterNum;//聚类数
	int mFeatureNum;//每个样本特征数
	int *ClusterResult;//聚类结果
	int MaxIterationTimes;//最大迭代次数

public:
	void GetClusterd(vector<std::vector<std::vector<double> > >&v, double** feateres, int ClusterNum, int SampleNum, int FeatureNum);//外部接口


private:
	void Initialize(double** feateres, int ClusterNum, int SampleNum, int FeatureNum);//类初始化
	void k_means(vector<vector<vector<double> > >&v);//算法入口
	void k_means_Initialize();//聚类中心初始化
	void k_means_Calculate(vector<vector<vector<double> > >&v);//聚类计算
};

Functions implemented within the class:

//param@v 保存分类结果 v[i][j][k]表示第i类第j个数据第k个特征(从0开始)
//param@feateres 输入数据 feateres[i][j]表示第i个数据第j个特征(i,j从0开始)
//param@ClusterNum 分类数
//param@SampleNum 数据数量
//param@FeatureNum 数据特征数
void ClusterMethod::GetClusterd(vector<std::vector<std::vector<double> > >&v, double** feateres, int ClusterNum, int SampleNum, int FeatureNum)
{
	Initialize(feateres, ClusterNum, SampleNum, FeatureNum);
	k_means(v);
}


//类内数据初始化
void ClusterMethod::Initialize(double** feateres, int ClusterNum, int SampleNum, int FeatureNum)
{
	mpSample = feateres;
	mFeatureNum = FeatureNum;
	mSampleNum = SampleNum;
	mClusterNum = ClusterNum;
	MaxIterationTimes = 50;

	mpCenters = new double*[mClusterNum];
	for (int i = 0; i < mClusterNum; ++i)
	{
		mpCenters[i] = new double[mFeatureNum];
	}

	pDistances = new double*[mSampleNum];
	for (int i = 0; i < mSampleNum; ++i)
	{
		pDistances[i] = new double[mClusterNum];
	}

	ClusterResult = new int[mSampleNum];
}


//算法入口
void ClusterMethod::k_means(vector<vector<vector<double> > >&v)
{
	k_means_Initialize();
	k_means_Calculate(v);
}


//初始化聚类中心
void ClusterMethod::k_means_Initialize()
{
	for (int i = 0; i < mClusterNum; ++i)
	{
		//mpCenters[i] = mpSample[i];

		for (int k = 0; k < mFeatureNum; ++k)
		{
			mpCenters[i][k] = mpSample[i][k];
		}
	}
}

The above initialization cluster center is (i is the number of cluster centers) before command data points i i-th cluster centers. (Note that must not be used mpCenters [i] = mpSample [i ] is initialized, which is a pointer.)

I may also be randomly selected data so that the center of the cluster, so that the same data multiple times of operation results may be different. Because the k-means results are not necessarily reach the global minimum point, the easiest solution is to run multiple times ( here refers to rerun the entire function of the number of iterations when clustering different ) take the most objective function clustering of childhood. If, as selected from the foregoing that each time before the i-th data initializing cluster centers, multiple runs the local minimum point does not solve the problem.

Clustering and implement updated cluster centers are as follows:

//聚类过程
void ClusterMethod::k_means_Calculate(vector<vector<vector<double> > >&v)
{

	double J = DBL_MAX;//目标函数
	int time = MaxIterationTimes;

	while (time)

	{
		double now_J = 0;//上次更新距离中心后的目标函数
		--time;
                
                //距离初始化
		for (int i = 0; i < mSampleNum; ++i)
		{
			for (int j = 0; j < mClusterNum; ++j)
			{
				pDistances[i][j] = 0;

			}
		}
                //计算欧式距离
		for (int i = 0; i < mSampleNum; ++i)
		{
			for (int j = 0; j < mClusterNum; ++j)
			{
				for (int k = 0; k < mFeatureNum; ++k)
				{
					pDistances[i][j] += abs(pow(mpSample[i][k], 2) - pow(mpCenters[j][k], 2));
				}
				now_J += pDistances[i][j];
			}
		}
	
		if (J - now_J < 0.01)//目标函数不再变化结束循环
		{	
			break;
		}
		J = now_J;

                //a存放临时分类结果
		vector<vector<vector<double> > > a(mClusterNum);
		for (int i = 0; i < mSampleNum; ++i)
		{
			
			double min = DBL_MAX;
			for (int j = 0; j < mClusterNum; ++j)
			{
				if (pDistances[i][j] < min)
				{
					min = pDistances[i][j];
					ClusterResult[i] = j;
				}
			}

			vector<double> vec(mFeatureNum);
			for (int k = 0; k < mFeatureNum; ++k)
			{
				vec[k] = mpSample[i][k];
			}
			a[ClusterResult[i]].push_back(vec);
		//	v[ClusterResult[i]].push_back(vec);这里不能这样给v输入数据,因为v没有初始化大小
		}
		v = a;

		//计算新的聚类中心
		for (int j = 0; j < mClusterNum; ++j)
		{
			for (int k = 0; k < mFeatureNum; ++k)
			{

				mpCenters[j][k] = 0;
			}
		}


		for (int j = 0; j < mClusterNum; ++j)
		{
			for (int k = 0; k < mFeatureNum; ++k)
			{
				for (int s = 0; s < v[j].size(); ++s)
				{
					mpCenters[j][k] += v[j][s][k];
				}
				if (v[j].size() != 0)
				{
					mpCenters[j][k] /= v[j].size();
				}
			}
		}
	}

        //输出聚类中心
	for (int j = 0; j < mClusterNum; ++j)
	{
		for (int k = 0; k < mFeatureNum; ++k)
		{
			cout << mpCenters[j][k] << " ";
		}
		cout << endl;
	}
}

Generating random data function:

//param@datanum 数据数量
//param@featurenum 每个数据特征数
double** createdata(int datanum, int featurenum)
{
	srand((int)time(0));
	double** data = new  double*[datanum];
	for (int i = 0; i < datanum; ++i)
	{
		data[i] = new double[featurenum];
	}
	cout << "输入数据:" << endl;
	for (int i = 0; i < datanum ; ++i)
	{
		for (int j = 0; j < featurenum; ++j)
		{
			data[i][j] = ((int)rand() % 30) / 10.0;
			cout << data[i][j] << " ";
		}
		cout << endl;
	}

	return data;
}

The main function:

int main()
{
	vector<std::vector<std::vector<double> > >v;
	double** data = createdata(10, 2);
	ClusterMethod a;
	a.GetClusterd(v, data, 3, 10, 2);
	for (int i = 0; i < v.size(); ++i)
	{
		cout << "第" << i+1 << "类" << endl;
		for (int j = 0; j < v[i].size(); ++j)
		{
			for (int k = 0; k < v[i][j].size(); ++k)
			{
				cout << v[i][j][k] << " ";
			}
			cout << endl;
		}	
	}
}

 Results are as follows:

 

Algorithm Improvement

Mean of each class using the updating cluster centers k-means, and therefore more sensitive to noise.

k-medoids to the foregoing average median affect individual avoid too large or too small cluster center points.

Speaking of the median, I am reminded of Tang master. . . No, I thought of writing before the selection algorithm does not need to sort all the data, faster.

But this does not mean that k-medoids will certainly be better than k-means, you know, the time when a large amount of data, the algorithm is also key. Averaging time required to seek smaller than the median . So, which one is better need to look at specific needs.

 

Published 33 original articles · won praise 148 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_40692109/article/details/103863675