The code from Reference: https://github.com/lawlite19/MachineLearning_Python/blob/master/K-Means/K-Menas.py
1. Initialize the cluster center, randomly selected from the sample point of K as the initial cluster centers
kMeansInitCentroids DEF (X-, K): m = X.shape [0] m_arr np.arange = (0, m) #. 1-m-generating 0 centroids of np.zeros = ((K, X.shape [. 1])) np.random.shuffle (m_arr) # disrupted m_arr order rand_indices = m_arr [: K] # take the first K centroids of X-= [rand_indices ,:] return centroids of
2. Find the distance from which a recent class center of each sample, and returns
findClosestCentroids DEF (X, inital_centroids): m = x.shape [0] # sample number k = inital_centroids.shape [0] # category number dis = np.zeros ((m, k )) # store each point the distance to the k clusters idx = np.zeros ((m, 1 )) # data to return each of which classes "" "calculation of each point from the center of each cluster", "" for I in Range (m): for J in Range (K): DIS [I, J] = np.dot ((X [I ,:] - inital_centroids [J,:].) the RESHAPE (. 1, -1), (X [ . I ,:] - inital_centroids [J,:]) the RESHAPE (-1,1)) '' 'returns dis each row corresponding to the column number of the minimum value, namely the corresponding category - np.min (dis, axis = 1) returns the minimum value of each line - np.where (dis == np.min (dis , axis = 1) .reshape (-1,1)) corresponding to the minimum value of the coordinates returned - Note: the coordinates corresponding to the minimum possible a plurality, WHERE will find out, is returned to the first m time required to return (as a minimum value for a plurality, which classes are can) ''' dummy,idx = np.where(dis == np.min(dis,axis=1).reshape(-1,1)) return idx[0:dis.shape[0]]
3. Update class center
computerCentroids DEF (X, IDX, K): n-x.shape = [. 1] # for each sample dimension centroids = np.zeros ((k, n )) # define the shape of each of the center point, and wherein each of the dimensions as the dimensions of the samples for I in Range (K): # If a dimensional index, axis = 0 for each column, idx == i find out which type of time, and then calculate the mean centroids [i ,:] = np .mean (X [np.ravel (IDX == I) ,:], Axis = 0) .reshape (. 1, -1) return centroids of
4. K-Means algorithm
runKMeans DEF (X, initial_centroids, max_iters, plot_process): m, and n-dimensional x.shape # = the number of samples k = initial_centroids.shape [0] # Number class clustering centroids = initial_centroids # current category records center previous_centroids = recorded on a class centroids # center idx = np.zeros ((m, 1 )) # belong to which category each of the data for I in Range (max_iters): Print ( "number of times an iterative calculation:% d"% (i + . 1)) IDX = findClosestCentroids (X, centroids of) IF plot_process: if the rendered image # plt = plotProcessKMeans (X, centroids, previous_centroids, movement idx) # Videos cluster centers previous_centroids = centroids # reset plt.show () centroids of = computerCentroids (x, idx, k ) # cluster centers recalculated return centroids, idx # return clustering and data centers which category
The cluster center drawing movement
plotProcessKMeans DEF (X-, centroids of, previous_centroids, IDX): for I in Range (len (IDX)): IF IDX [I] == 0: plt.scatter (X-[I, 0], X-[I,. 1], c = "r") # original data form of two-dimensional scattergram elif IDX [I] ==. 1: plt.scatter (X-[I, 0], X-[I,. 1], C = "B") the else : plt.scatter (X-[I, 0], X-[I,. 1], C = "G") plt.plot (previous_centroids [:, 0], previous_centroids [:,. 1], 'RX', 10 = markersize , linewidth on a cluster centers = 5.0 #) plt.plot (centroids of [:, 0], centroids of [:,. 1], 'RX', markersize = 10, linewidth = 5.0) # current cluster center for j in range (centroids.shape [0]): # traversing each class, draw a straight line moving cluster center P1 = centroids of [J ,:] P2 = previous_centroids [J ,:] plt.plot ([p1 [0], p2 [0]], [p1 [1], p2 [1]], " -> ", linewidth = 2.0) return Plt
6. The main program realization
if __name__ == "__main__": print("聚类过程展示....\n") data = spio.loadmat("./data/data.mat") X = data['X'] K = 3 initial_centroids = kMeansInitCentroids(X,K) max_iters = 10 runKMeans(X,initial_centroids,max_iters,True)
7. results
Clustering process show .... iterative calculation: 1
The number of iterations is calculated: 2
Calculate the number of iterations: 3
Calculate the number of iterations: 4
Calculate the number of iterations: 5
Calculate the number of iterations: 6
Calculate the number of iterations: 7
Calculate the number of iterations: 8
Calculate the number of iterations: 9
The number of iterations is calculated: 10