[Clustering model①] k-means clustering algorithm

When studying a large amount of data, when which sets of data are relatively close (for example, which cities have relatively similar consumption habits), this multi-classification algorithm can be used. After watching Qingfeng’s digital simulation tutorial, the author summarizes the following points:

k-means operation process

  1. Select the number of classifications k, set the number of iterations of the algorithm
  2. Select the initial k cluster centers
  3. Divide all data into these k cluster centers according to distance
  4. Adjust the position of the cluster center (adjust to the center of the data under it)
  5. Repeat the above 3-4 steps until the center position no longer changes or the number of iterations is reached.
    Insert picture description here
    In actual modeling papers, the algorithm flow description recommends the use of flowcharts to simplify redundant repetitions and avoid duplicate checks.

The advantages and disadvantages of k-means

advantage

  1. Simple and fast
  2. Efficiently process large data sets

Disadvantage

  1. The number of classes k given in advance is completely specified by the user, which is too subjective and lacks reliable standards
  2. Initial value sensitive
  3. Outlier sensitivity

k-means++: improved algorithm of k-means

In order to avoid the above shortcomings as much as possible, the k-means++ algorithm is proposed.

The basic principle

When selecting the initial cluster centers, make the distance between them as large as possible

Realization of the basic principles

The improvement is only in the selection of cluster centers, and the selection method is as follows:

  1. Randomly select an initial cluster center
  2. Calculate the distance from each data point to the first center, and use this as the weight to calculate the next cluster center [Roulette method]
  3. Repeat the second step until k cluster centers are selected

Why can it be achieved?

When the next cluster center is selected in the second step, the greater the distance between the current data point and the first center, the greater the weight, and the more likely the vicinity of this data point will be selected as the second cluster center. That is: the second cluster center is as far as possible from the first one!

Two Discussions on Means Clustering Algorithm

  1. We want to divide the data into k categories. How to determine this k?

    Generally judged according to the topic, divided into several categories will be better described, just divide into several categories.

    For example, for the question of "Which cities have similar consumption habits", it is more appropriate to choose k=2 or 3. When k=2, the description can be: the consumption level of the first type of city is higher, and the consumption level of the second type is lower. When k=3, the consumption level of each city is divided into three levels: high, medium and low.

  2. What should I do if the data dimensions are inconsistent?

    For example, we encountered a set of data describing the nature of things. One of the data dimensions is length (m) and the other is weight (t). What if the difference between the two is too large/the data directly calculated is meaningless?

    Use the formula X i − X average X standard deviation\frac{X_i-X_{average}}{X_{standard deviation}}XStandard quasi- poorXiXLevel areStandardize the data. Then use the standardized data to cluster.

Guess you like

Origin blog.csdn.net/weixin_44559752/article/details/107847818