Clustering refers to the data itself based on the characteristics of the data classification, no manual tagging, it is a kind of unsupervised learning. k-means clustering algorithm is one of the most simple algorithm.
k-means algorithm to n data objects into k clusters in order to satisfy such clusters obtained: the same cluster objects high similarity; different cluster similarity smaller objects. Clustering the similarity is the use of objects in each cluster mean a "central object" (center of gravity) is obtained by calculation.
Based on this assumption, let us derive the k-means objective function to be optimized: We set up a total of N data points need to be divided into the K cluster, and k-means to do this is to minimize the objective function
For the k-th clustering centers, when the n-th data belonging to the category k is 1, and 0 otherwise.
Process is as follows:
1. First data objects selected from n objects as arbitrary initial k cluster centers; the left for other objects, according to their similarity with those of the cluster center (distance), respectively assigned to them most similar (the cluster centers represented) cluster;
2. then calculate the new cluster center of each cluster obtained (all objects in the cluster mean ); repeats this process until start standard measurement function converges.
Are generally used as the standard variance measure function, k clusters has the following characteristics: each cluster itself as compact as possible, as far as possible among the separated clusters.
Each cluster centers will update the objective function is reduced, thus ultimately J iterations reaches a minimum value, we can not guarantee a global minimum. k-means is very sensitive to noise.
c ++ implementation:
class ClusterMethod
{
private:
double **mpSample;//输入
double **mpCenters;//储存聚类中心
double **pDistances;//距离矩阵
int mSampleNum;//样本数
int mClusterNum;//聚类数
int mFeatureNum;//每个样本特征数
int *ClusterResult;//聚类结果
int MaxIterationTimes;//最大迭代次数
public:
void GetClusterd(vector<std::vector<std::vector<double> > >&v, double** feateres, int ClusterNum, int SampleNum, int FeatureNum);//外部接口
private:
void Initialize(double** feateres, int ClusterNum, int SampleNum, int FeatureNum);//类初始化
void k_means(vector<vector<vector<double> > >&v);//算法入口
void k_means_Initialize();//聚类中心初始化
void k_means_Calculate(vector<vector<vector<double> > >&v);//聚类计算
};
Functions implemented within the class:
//param@v 保存分类结果 v[i][j][k]表示第i类第j个数据第k个特征(从0开始)
//param@feateres 输入数据 feateres[i][j]表示第i个数据第j个特征(i,j从0开始)
//param@ClusterNum 分类数
//param@SampleNum 数据数量
//param@FeatureNum 数据特征数
void ClusterMethod::GetClusterd(vector<std::vector<std::vector<double> > >&v, double** feateres, int ClusterNum, int SampleNum, int FeatureNum)
{
Initialize(feateres, ClusterNum, SampleNum, FeatureNum);
k_means(v);
}
//类内数据初始化
void ClusterMethod::Initialize(double** feateres, int ClusterNum, int SampleNum, int FeatureNum)
{
mpSample = feateres;
mFeatureNum = FeatureNum;
mSampleNum = SampleNum;
mClusterNum = ClusterNum;
MaxIterationTimes = 50;
mpCenters = new double*[mClusterNum];
for (int i = 0; i < mClusterNum; ++i)
{
mpCenters[i] = new double[mFeatureNum];
}
pDistances = new double*[mSampleNum];
for (int i = 0; i < mSampleNum; ++i)
{
pDistances[i] = new double[mClusterNum];
}
ClusterResult = new int[mSampleNum];
}
//算法入口
void ClusterMethod::k_means(vector<vector<vector<double> > >&v)
{
k_means_Initialize();
k_means_Calculate(v);
}
//初始化聚类中心
void ClusterMethod::k_means_Initialize()
{
for (int i = 0; i < mClusterNum; ++i)
{
//mpCenters[i] = mpSample[i];
for (int k = 0; k < mFeatureNum; ++k)
{
mpCenters[i][k] = mpSample[i][k];
}
}
}
The above initialization cluster center is (i is the number of cluster centers) before command data points i i-th cluster centers. (Note that must not be used mpCenters [i] = mpSample [i ] is initialized, which is a pointer.)
I may also be randomly selected data so that the center of the cluster, so that the same data multiple times of operation results may be different. Because the k-means results are not necessarily reach the global minimum point, the easiest solution is to run multiple times ( here refers to rerun the entire function of the number of iterations when clustering different ) take the most objective function clustering of childhood. If, as selected from the foregoing that each time before the i-th data initializing cluster centers, multiple runs the local minimum point does not solve the problem.
Clustering and implement updated cluster centers are as follows:
//聚类过程
void ClusterMethod::k_means_Calculate(vector<vector<vector<double> > >&v)
{
double J = DBL_MAX;//目标函数
int time = MaxIterationTimes;
while (time)
{
double now_J = 0;//上次更新距离中心后的目标函数
--time;
//距离初始化
for (int i = 0; i < mSampleNum; ++i)
{
for (int j = 0; j < mClusterNum; ++j)
{
pDistances[i][j] = 0;
}
}
//计算欧式距离
for (int i = 0; i < mSampleNum; ++i)
{
for (int j = 0; j < mClusterNum; ++j)
{
for (int k = 0; k < mFeatureNum; ++k)
{
pDistances[i][j] += abs(pow(mpSample[i][k], 2) - pow(mpCenters[j][k], 2));
}
now_J += pDistances[i][j];
}
}
if (J - now_J < 0.01)//目标函数不再变化结束循环
{
break;
}
J = now_J;
//a存放临时分类结果
vector<vector<vector<double> > > a(mClusterNum);
for (int i = 0; i < mSampleNum; ++i)
{
double min = DBL_MAX;
for (int j = 0; j < mClusterNum; ++j)
{
if (pDistances[i][j] < min)
{
min = pDistances[i][j];
ClusterResult[i] = j;
}
}
vector<double> vec(mFeatureNum);
for (int k = 0; k < mFeatureNum; ++k)
{
vec[k] = mpSample[i][k];
}
a[ClusterResult[i]].push_back(vec);
// v[ClusterResult[i]].push_back(vec);这里不能这样给v输入数据,因为v没有初始化大小
}
v = a;
//计算新的聚类中心
for (int j = 0; j < mClusterNum; ++j)
{
for (int k = 0; k < mFeatureNum; ++k)
{
mpCenters[j][k] = 0;
}
}
for (int j = 0; j < mClusterNum; ++j)
{
for (int k = 0; k < mFeatureNum; ++k)
{
for (int s = 0; s < v[j].size(); ++s)
{
mpCenters[j][k] += v[j][s][k];
}
if (v[j].size() != 0)
{
mpCenters[j][k] /= v[j].size();
}
}
}
}
//输出聚类中心
for (int j = 0; j < mClusterNum; ++j)
{
for (int k = 0; k < mFeatureNum; ++k)
{
cout << mpCenters[j][k] << " ";
}
cout << endl;
}
}
Generating random data function:
//param@datanum 数据数量
//param@featurenum 每个数据特征数
double** createdata(int datanum, int featurenum)
{
srand((int)time(0));
double** data = new double*[datanum];
for (int i = 0; i < datanum; ++i)
{
data[i] = new double[featurenum];
}
cout << "输入数据:" << endl;
for (int i = 0; i < datanum ; ++i)
{
for (int j = 0; j < featurenum; ++j)
{
data[i][j] = ((int)rand() % 30) / 10.0;
cout << data[i][j] << " ";
}
cout << endl;
}
return data;
}
The main function:
int main()
{
vector<std::vector<std::vector<double> > >v;
double** data = createdata(10, 2);
ClusterMethod a;
a.GetClusterd(v, data, 3, 10, 2);
for (int i = 0; i < v.size(); ++i)
{
cout << "第" << i+1 << "类" << endl;
for (int j = 0; j < v[i].size(); ++j)
{
for (int k = 0; k < v[i][j].size(); ++k)
{
cout << v[i][j][k] << " ";
}
cout << endl;
}
}
}
Results are as follows:
Algorithm Improvement
Mean of each class using the updating cluster centers k-means, and therefore more sensitive to noise.
k-medoids to the foregoing average median affect individual avoid too large or too small cluster center points.
Speaking of the median, I am reminded of Tang master. . . No, I thought of writing before the selection algorithm does not need to sort all the data, faster.
But this does not mean that k-medoids will certainly be better than k-means, you know, the time when a large amount of data, the algorithm is also key. Averaging time required to seek smaller than the median . So, which one is better need to look at specific needs.