Clustering in Machine Learning--k-Means Clustering Algorithm--Interpretation of Vernacular

Table of contents

There are supervised learning and unsupervised learning in machine learning. What is unsupervised learning?

How to do unsupervised learning on the sample set?

What is clustering?

How to measure the similarity between samples?

What is K-means?

for example:


Introduction: The K-means algorithm is a kind of unsupervised learning. We introduce the K-means algorithm through the idea of ​​unsupervised learning.

There are supervised learning and unsupervised learning in machine learning. What is unsupervised learning?

        Unsupervised learning is to analyze the potential connections and internal structural features of the sample set by learning the sample set without knowing the category label of the sample.

How to do unsupervised learning on the sample set?

       clustering

What is clustering?

        'Birds of a feather flock together, people are divided into groups', similar things are put together, and dissimilar things are not. Clustering is to divide the sample set into several disjoint sub-sample sets by measuring the similarity between samples, and the intersection of the sub-sample sets constitutes the original sample set. Each of its sub-sample sets is called a cluster, and the specific category or attribute of this cluster needs to be defined by us.

How to measure the similarity between samples?

       Generally speaking, the similarity can be measured by calculating the distance between samples. We know that each sample x is composed of several attributes, namely x = {x1, x2, ..., xn}. If we have two samples, then we can calculate the distance through the attribute values ​​between them. The more commonly used one is the Euclidean distance. Of course, the Euclidean distance is not the only one, and there is also the Minkowski distance. The calculation method of the distance leads to the K-means algorithm.

What is K-means?

       K means the number of clusters we want to divide the sample set into, that is, we want to divide the sample set into several subsets. The mean means that the centroid update rule for each cluster is the average of all points in the cluster.

  The algorithm steps:

(1) First randomly select k points as the initial centroid

(2) Calculate the distance from each sample point in the sample set to these k points

(3) Divide each sample point to the centroid closest to it. At this time, all sample points are assigned to the centroid closest to it, forming k clusters (subsets)

(4) Update the centroid of each cluster, which is updated as the average of all sample points in the cluster

(5) Execute steps (2) and (3) in sequence. If the clusters to which all sample points belong have not changed, stop the update of the centroid; or stop the execution after the specified number of update iterations; if some sample points If the cluster to which it belongs has changed, then continue to perform step (4), and then perform step (5), until the clusters to which all sample points belong have not changed, then stop the update of the centroid.

for example:

       Now I have five sample points, each sample point has two attributes, x1(0, ​​1), x2(1, 5), x3(2, 4), x4(3, 0), x5(4, 4 ).

Visualize these five points as shown in the figure below.

       First, I set k to 2, that is, randomly select 2 points as the initial centroid z1(2, 1), z2(3, 1) as shown in the red circle in the figure below.

       Then calculate the distance from each point to the two centroids, and divide each point to the nearest centroid. The division result is shown in the figure below, which can be calculated and verified by yourself.

        Then update the centroid, the update rule is that the new centroid is the mean value of all sample points in the cluster. The two centroids after the update are

z1(1, 10/3), z2(3.5, 2), as shown in the red circle in the figure below.

       Then calculate the distance between each sample point and the two centroids, and assign it to the nearest centroid. After the calculation, the cluster to which each sample point belongs remains unchanged, as shown in the figure below.

        Since the cluster to which each sample point belongs has not changed, the update of the centroid is suspended, and the cluster to which each sample point belongs is marked and the result is returned.

Summary: The above is the idea of ​​​​the K-means algorithm. This algorithm does not require training. The algorithm is to pass a given sample, and then find out the similarity among them, and divide the similar ones into a cluster as much as possible. In practice, a sample may have more than two attributes, and the sample set may have more than a few samples, and there may be tens of thousands of samples. Then it is necessary to consider its computing performance, because we need every time Calculate the distance from each sample point to all centroids. Another thing to note is that the value of k is set by ourselves, so how to get a suitable value of k requires an evaluation index to measure the quality of the divided data set after performing the k-means algorithm, generally used To measure the clustering quality index is SSE (sum of squares of error), that is, the sum of the squares of the distance from each sample point to its centroid. If you increase the value of k, you can reduce SSE, and the effect will be better; if you increase k to take value, if the SSE cannot be reduced, there is no need to increase the value of K. Of course, in addition to this measurement, there is also a binary K-means algorithm , which has a better clustering effect than the K-means algorithm. The so-called dichotomy means that each cluster is divided into two, until the value that can reduce the SSE to the greatest extent.

 

 

Guess you like

Origin blog.csdn.net/BaoITcore/article/details/125312201