Pre-made data visualization, found that there are some isolated point (noise point) tag data, the impact kmeans clustering.
Processing follows:
Use kmeans 10 iterations to get the cluster center
Euclidean Distance calculated mean and variance of the data to all of its cluster center
By fitting a normal distribution, the mean value is greater than the distance to the cluster center + 1.96 * Excluding the variance of the training set of points (normal distribution area calculated 0.95)
Get new training set and anchor