K-means algorithm notes python3.0

The basic idea of ​​clustering

As the saying goes, "Like attracts like, people in groups"

Clustering --Clustering-- is an unsupervised learning , simply put, is to return to similar objects in the same cluster . The more similar objects within the cluster, the better clustering.

Definition: Given a set of data objects, the clustering divides the data into clusters, and that a divided meet two conditions: (1) Each cluster contains at least one object; (2), and each object belongs to only belong to a cluster.

The basic idea: given , the algorithm first given an initial partitioning method, after the change by the method of iterative division, time division scheme such that each have a better improvement after a previous

K-Means algorithm

K-Means algorithm is the most classic clustering method based on partition, it is one of the ten classic data mining algorithms. Simply put, in the K-Means is no supervisory signal in the case where the data is divided into K parts of a process.

Unsupervised learning clustering algorithm is the most common form, given a set of data, we need to dig clustering algorithms hidden information in the data . Clustering algorithm is very wide: customer behavior clustering, google news clustering.

K value is the number of clusters results in the category. Simply means that we want to divide the number of categories of data

Algorithm

Specific steps of the algorithm are as follows:

  1. K randomly selected center points
  2. Assigned to each data point from its nearest center point ;
  3. Recalculated points of each class to the class from the center point of the average
  4. Assigning each data to its nearest center point ;
  5. Repeat steps 3 and 4 until all of the observations are no longer assigned or maximum number of iterations ( R & lt default to 10 times as the number of iterations )
  6. The following code randomly generated data set
  7. . 1  Import numpy AS NP
     2  Import matplotlib.pyplot AS PLT
     . 3  
    . 4  # distance between two points 
    . 5  DEF Distance (E1, E2):
     . 6      return np.sqrt ((E1 [0] -e2 [0]) ** 2+ (E1 [. 1] -e2 [. 1]) ** 2 )
     . 7  
    . 8  # collection center 
    . 9  DEF means (ARR):
     10      return np.array ([np.mean ([E [0] for E in ARR]), NP. Mean ([E [. 1] for E in ARR])])
     . 11  
    12 is  # ARR a farthest distance element is used to initialize the cluster center 
    13 is  DEF farthest (k_arr, ARR):
    14      F = [0, 0]
     15      max_d = 0
     16      for E in ARR:
     . 17          D = 0
     18 is          for I in Range (k_arr. The __len__ ()):
     . 19              D = D + np.sqrt (Distance (k_arr [I] , E))
     20 is          IF D> max_d:
     21 is              max_d = D
     22 is              F = E
     23 is      return F
     24  
    25  # ARR a nearest distance element for the cluster 
    26 is  DEF Closest (a, ARR):
    27      C = ARR [. 1 ]
     28      min_D = Distance (A, ARR [. 1 ])
     29      ARR = ARR [. 1 :]
     30      for E in ARR:
     31 is          D = Distance (A, E)
     32          IF D < min_D:
     33 is              min_D = D
     34 is              C = E
     35      return C
     36  
    37 [  
    38 is  IF  the __name__ == " __main__ " :
     39      # # generated random two-dimensional coordinates (if the data set better) 
    40     = np.random.randint ARR (100, size = (100,. 1, 2 )) [:, 0,:]
     41 is  
    42 is      # # cluster center and the cluster initialization container 
    43 is      m. 5 =
     44 is      R & lt = np.random. the randint (. ARR the __len__ () -. 1 )
     45      k_arr = np.array ([ARR [R & lt]])
     46 is      cla_arr = [[]]
     47      for I in Range (. 1-m ):
     48          K = farthest (k_arr, ARR )
     49          k_arr = np.concatenate ([k_arr, np.array ([K])])
     50          cla_arr.append ([])
     51 is  
    52 is      # # iterative clustering 
    53     = 20 is n
     54 is      cla_temp = cla_arr
     55      for I in Range (n):     # iteration n times 
    56 is          for E in ARR:     # to set each element in the polyethylene to the nearest class 
    57 is              Ki = 0         # assumes that the first center of the nearest 
    58              min_D = Distance (E, k_arr [Ki])
     59              for J in Range (. 1, k_arr. the __len__ ()):
     60                  IF Distance (E, k_arr [J]) <min_D:     # found nearer the center of the cluster 
    61                      = min_D Distance (E, k_arr [J])
    62 is                      Ki = J
     63 is              cla_temp [Ki] .append (E)
     64          # iteratively updated cluster centers 
    65          for K in Range (k_arr. The __len__ ()):
     66              IF n--. 1 == I:
     67                  BREAK 
    68              k_arr [K] = means (cla_temp [K])
     69              cla_temp [K] = []
     70  
    71 is      # # visual display 
    72      COL = [ ' HotPink ' , ' the Aqua ' , ' Chartreuse ' ,'yellow', 'LightSalmon']
    73     for i in range(m):
    74         plt.scatter(k_arr[i][0], k_arr[i][1], linewidth=10, color=col[i])
    75         plt.scatter([e[0] for e in cla_temp[i]], [e[1] for e in cla_temp[i]], color=col[i])
    76     plt.show()
     
  8. Training results:
  9. Details of K-Means
          1. K value given how? How should I know several categories?
            A: This is really no established practice in several categories depending on personal experience and feelings, the usual practice is to try a few K value, divided into several categories look better interpretation of the results, more in line with the purpose of analysis and so on . The various K values or may be calculated SSE compared, taking the minimum K value of SSE.

          2. The initial K centroid how the election?
            A: The most common method is to randomly selected , selection of initial center of mass of the final clustering affect the results, so the algorithm must execute it several times, which results more reasonable, which will use the results. Of course there are some optimization methods, the first is to select a point farthest away from each other, specifically, to select the first point and the second point is selected from when the first furthest point, and select the third points, the third point to the first, the second and the minimum distance between two points, and so on. The second is to get the clustering results based on other clustering algorithms (such as hierarchical clustering), the results from each category to choose a point.

          3. K-Means will not fall into the election process has been the center of mass, never not stop?
            A: No, there mathematical proof K-Means will converge , the general idea is to use the concept of the SSE (ie, error sum of squares), that is, the distance to each point of the center of mass and the square itself belongs, and this is a square function, and then be able to prove that this function is a function could eventually converge

reference:

https://www.cnblogs.com/nxld/p/6376496.html

https://blog.csdn.net/qq_37509235/article/details/82925781

 

 

 

Guess you like

Origin www.cnblogs.com/daisy99lijing/p/12555939.html