K-means of machine learning clustering algorithm

1 Introduction

         What is called clustering, this is a typical unsupervised machine learning algorithm. Simply put, it is to put the similarities in a pile. The focus here is not the relationship between features and labels, but the relationship between samples and samples.

2. K-means clustering

         K-means is the most commonly used algorithm among all clustering algorithms because of its simplicity and good results. Do you feel a little excited to hear the simple two words? If learning is also a food chain, then these easy-to-capture preys are simply the water of energy for the confidence of beginners. Then let's despise it together and ravage it!

         K is the number of clusters that we ultimately want to divide, and what people say is how many piles of data should be divided into. The process is as follows: in a bunch of data, first randomly initialize K cluster centers, and divide all points into K clusters according to the principle of being closest to the cluster center. Choose the most central position in each cluster as the cluster center, and re-cluster the cluster according to the principle of closest to the cluster center; until the cluster to which all the points belong does not change.

 3. Key issues

     1.K's choice

                  Through the above description, we will have a question. How should the number of K in the so-called K-means be selected? A bunch of unfamiliar data is already very big, how would I know how many piles it should be divided into? Sorry, there is no perfect solution to this problem. The feasible solution is to try, try a few more, which one can satisfy your boss, that's fine.

      2. Initialize the cluster center

                    There are generally several methods for initializing cluster centers:

                     1. Random initialization, random initialization of K points directly and willfully, and nothing can be seen when the amount of data is not very large.                           

                     2. The roulette method selects cluster centers. As the name suggests, the roulette method randomly selects cluster centers from the data. The points farther from the existing cluster centers are more likely to be selected. The following code implements the roulette method, which first calculates the sum of the distances from all points to the cluster center, and then multiplies it by a random number between [0,1] to reduce the sum of the distances from the point to the cluster center. Then use this sum to subtract the distance from each point to the cluster center. Because of the above random reduction, and because the distribution of the maximum distance in the data is not certain, it is realized that the farther away from the existing cluster center, the more likely it is to be selected. The picture below is from this blog           

                                             

def kpp_centers(data_set, k) :
    """
    从数据集中返回 k 个对象可作为质心
    """
    cluster_centers = []
    cluster_centers.append(random.choice(data_set))
    d = [0 for _ in range(len(data_set))]
    for _ in range(1, k):
        total = 0.0
        for i, point in enumerate(data_set):
            d[i] = get_closest_dist(point, cluster_centers) # 与最近一个聚类中心的距离
            total += d[i]
        total *= random.random()
        for i, di in enumerate(d): # 轮盘法选出下一个聚类中心;
            total -= di
            if total > 0:
                continue
            cluster_centers.append(data_set[i])
            break
    return cluster_centers

 

                            3. Distance calculation

                                In general, we use Euclidean distance to indicate the distance between two points. It can also be expressed by cosine distance. Euclidean distance is more concerned with the actual distance between points, and cosine distance is more concerned with the difference in direction between two points

 

4. Summary

             In this article, we introduce the K-means clustering algorithm. Its working principle is to continuously re-cluster and adjust the cluster center according to the distance from the point to the cluster center until the cluster no longer changes. In this way, the distance between clusters is minimized and the distance between clusters is maximized. And introduced two methods of selecting cluster centers, random method and roulette method. And two ways to calculate distance, and their differences.

              Lu Fei said that as long as you live, something beautiful will happen. Believe that all the things that happen to oneself should happen to oneself, not even one extra thing. After thinking it through, I will no longer worry about the mistakes made in the past, nor will I be proud of some short-term glory. Don't be discouraged by the world, and won't persuade the world! Yesterday I was crazy because of some things, but I told myself to be crazy for dozens of minutes by myself! When the time is up, go to work and restore rationality. In this way, when the thieves are uncomfortable and lonely, I will not take them to heart.

        

Guess you like

Origin blog.csdn.net/gaobing1993/article/details/108451444