Personal Interpretation of K-means algorithm

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/Meteoraki/article/details/100624986

Personal Interpretation of K-means algorithm

I. Introduction

K-means algorithm is very typical clustering algorithm based on distance, what clustering, feather flock together when we all know the truth, people in groups; the same identical properties suitable person who would be classified as a class of people, they even each other attraction, which has some specific characteristics to distinguish between data and objects of a certain type of crowd as well. Use K-means algorithm is used to find a batch of data in different categories and distinguish between different data. The algorithm is based on the distance between the data abstraction, which is a now one-dimensional, two-dimensional and even three-dimensional point cloud from the clustering achieved. The algorithm uses distance as the similarity evaluation index, i.e., that the closer the distance the two objects, the greater the degree of similarity. K-means algorithm is simple, effective, and when the data is distributed densely in a certain area, a good clustering result.
Advantage ①K-means algorithm
algorithm fast and simple, strong scale for large data clusters stretchable, more practical in the spherical distribution of the data set of the data, and with the distance calculation method for optimizing the clustering effect will be optimized.
Disadvantage ②K-means algorithm
This algorithm is simple and efficient, but it also has some conditions, the more prominent features is the selected value of K is very difficult to estimate, and the initial random "pseudo center point" If the initial selection of the bad location, clustering effect will have an impact on more gradually and with the vast amounts of data, the algorithm time overhead is too large.

FIG clustering result shows as follows :( from Baidu library)FIG clustering result shows (from Baidu library

Second, the algorithm ideas

The characteristics and limitations of K-means algorithm is necessary to enter into K classes clustered predetermined, set and initial K "pseudo center point", and then based on constantly updated iteration until all changes are no longer the center point become a "real central point."
The algorithm steps are as follows:
a first input value of k, i.e., the need to develop our own group obtained by the K clusters;.
2 randomly generated from the data in the k data points randomly selected as the initial "pseudo center point";.
3 wherein for the K data in the vicinity of the center point of the other, by calculating for each point and each of the "pseudo-center point" of the distances to compare, on the near point from which the dummy data is classified as a pseudo point this polymeric class.
4. Then each "pseudo central point" manages all other similar data, which is then the use of the algorithm, constantly updated iteration pseudo point, the new generation of pseudo point to be near the center of the class, and ultimately determine the "true center point".
5. If the minimum error "pseudo center point" and smaller than a certain distance between the front of a "pseudo-center point", i.e. a set point change before and after the stabilization, can be considered to reach a desired value, the algorithm terminates.

Third, the usual distance formula (most straightforward) Euclidean algorithm

① is the distance between two points, two-dimensional Euclidean distance formula is two-dimensional and three-dimensional space
D = sqrt ((X1-X2) 2 + (Y1-Y2) 2 )

② dimensional equation is
D = sqrt ((X1-X2) 2 + (Y1-Y2) 2 + (Z1-Z2) 2 )

③ extended to n-dimensional space, the Euclidean distance equation is
D = sqrt ([Sigma (xi1-XI2) 2 ) where i = 1,2 ... n

xi1 represents a point of i-dimensional coordinates, xi2 second dimension i represents the coordinate point

Fourth, the test code is as follows

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#define N 12 is
#define K 2
typedef struct
{
Double X;
Double Y;
} point;
int Center [N]; /// Analyzing each point belongs to which cluster
point point [N] = {
{2.0, 3.0},
{2.5, 2.2},
{8.0, 9.0},
{8.9, 9.0 },
{1.0, 3.0},
{7.0, 10.0},
{1.0, 2.0},
{9.0, 9.5},
{7.0, 3.0},
{1.0, 2.1},
{8.9, 9.8},
{2.1,1.5}
} ;
Point Mean [K]; /// stored center points K
// Euclidean distance
Double OustyleDistance (Point P1, P2 Point)
{
D Double;
D = sqrt ((P1.x - P2.x) * (P1.x - P2.x) + (p1.y - p2.y) * (p1.y - p2.y));
return D ;
}
/// calculated center point of each cluster
void getMean (int Center [N])
{
Point TEP;
int I, J, COUNT = 0;
for (I = 0; I <K; I ++)
{
COUNT 0 =;
tep.x = 0.0; calculated center point value after each /// a clear cluster 0
tep.y = 0.0;
for (J = 0; J <N; J ++)
{
IF (I == Center [J])
{
COUNT ++;
tep.x Point = + [J] .x;
tep.y Point = + [J] .y;
}
}
tep.x / = COUNT;
tep.y / = COUNT;
Mean [I ] = TEP;
}
for (I = 0; I <K; I ++)
{
printf(“The new center point of %d is : \t( %f, %f )\n”, i+1, mean[i].x, mean[i].y);
}
}
// 计算平方误差函数
float getE()
{
int i, j;
float cnt = 0.0, sum = 0.0;
for(i = 0; i < K; ++i)
{
for(j = 0; j < N; ++j)
{
if(i == center[j])
{
cnt = (point[j].x - mean[i].x) * (point[j].x - mean[i].x) + (point[j].y - mean[i].y) * (point[j].y - mean[i].y);
sum += cnt;
}
}
}
return sum;
}
// 把N个点聚类
void cluster()
{
int i, j, q;
double min;
double dis[N][K];
for(i = 0; i < N; ++i)
{
= 999999.0 min;
for (J = 0; J <K; J ++)
{
DIS [I] [J] = OustyleDistance (Point [I], Mean [J]);
// the printf ( "% F \ n-" , distance [i] [j] ); // can be used to test for the distance between each point and the center point 3
}
for (Q = 0; Q <K; Q ++)
{
IF (DIS [I ] [Q] <min)
{
min DIS = [I] [Q];
Center [I] = Q;
}
}
the printf ( "(% .0f, .0f%) \ T% in Cluster-D \ n-", Point [I] .x, Point [I] .y, Center [I] +. 1);
}
the printf ( "------------------------ ----- \ n-");
}
// main function
int main ()
{
int I, J, n-= 0;
a float temp1;
a float temp2 of, T;
the printf (" ---------- ---------- sets the Data \ n-");
for (I = 0; I <N;++i)
{
the printf ( "\ T (% .0f, .0f%) \ n-", Point [I] .x, Point [I] .y);
}
the printf ( "------------- ---------------- \ n-");
/ *
may select the current time is a random number
srand ((unsigned int) time (NULL));
for (I = 0; I < K; I ++)
{
J = RAND ()% K;
Mean [I] .x = Point [J] .x;
Mean [I] .y = Point [J] .y;
}
* /
Mean [0] .x = point [0] .x; /// center point of the k-th initialization
Mean [0] .y = Point [0] .y;
Mean [. 1] .x = Point [2] .x;
Mean [. 1] point = .y [2] .y;
Cluster (); /// first cluster k points according to a preset
temp1 = getE (); /// first square error
n ++; /// n calculating a final clusters with how many times
the printf ( "of the E1 iS: F% \ n-\ n-", temp1);
getMean (Center);
cluster ();
temp2 of GeTe = (); /// formed according to the new cluster center point, and calculates the squared error
++ n-;
the printf ( "of The E2 of IS: F% \ n-\ n-", temp2 of);
the while (! FABS (temp2 of - temp1) = 0) /// comparing the square error is determined whether the two are equal, unequal continue iterating
{
temp1 = temp2 of;
getMean (Center);
Cluster ();
temp2 of = GeTe ();
n-++;
the printf ( "of The E% D IS:% F \ n-", n-, temp2 of);
}
the printf ( "of The Total Number of Cluster IS :% d \ n \ n " , n); /// statistics iterations
System (" PAUSE ");
return 0;
}
Here Insert Picture Description
Code reference blog: http: //blog.csdn.net/triumph92/article/details/ 41,128,049

Guess you like

Origin blog.csdn.net/Meteoraki/article/details/100624986