Andrew Ng machine learning (xiii) - K-Means clustering algorithm

First, the idea of ​​clustering

Refers to the so-called clustering algorithm automatically divides the data into a stack of labels without method categories belonging to unsupervised learning method, which is to ensure that the same type of data with similar features, as shown below:
Here Insert Picture Description

The distance between the sample or a similarity (closeness resistance), the more similar, smaller the difference samples clustered into one group (cluster), and finally forming a plurality of clusters, the interior of the sample with a high similarity clusters, different differences between cluster high.

Two, k-means clustering algorithm

Related concepts:

The number of clusters to be obtained: K value

Centroid: mean vector of each cluster, i.e., the vector can be averaged for each dimension

Distance measurements: Euclidean distance and used cosine similarity (the first standardization)
Here Insert Picture Description

Algorithmic process:

1, determining a first value of k, i.e., we want to obtain the data set via a set of k clusters.

2, selected randomly as the k data points from the centroid data.

3, the data set for each point, to calculate the distance to each centroid (e.g., Euclidean distance), from which near the centroid, to be divided into the set of centroid belongs.

4, after all the good collection of data normalization, a total of k sets. And then recalculate the centroid of each set.

5, if the distance is less than some between the new calculated centroid and the original centroid of a set threshold (meaning not position recalculated centroid change, stabilize, or convergence), we can assume that the cluster has been reached the desired results, the algorithm terminates.

6, if the original and new centroid centroid distance change greatly, 3-5 requires an iterative step.

Third, the mathematical principles
Here Insert Picture Description
heuristically K-Means uses very simple, with a set of graphs will be described in the following image:
Here Insert Picture Description
figure expressing a set of initial data, it is assumed k = 2. In Figure b, we randomly selected two class k corresponding to the class centroid, i.e. the red centroid of the figure and the blue centroid and each seek distance in this case all points to two centroids, and mark each sample the category of the sample and the smallest centroid distance categories, as shown in FIG C, and after the sample is calculated from the red and blue centroid centroids, we got after the first iteration for the category of all sample points. At this point we find red and blue points on our present mark its new centroids are shown in FIG. D, centroid position of the new red and blue centroid change has occurred. FIGS e and f in FIG c we repeated the process and the d, i.e. all points labeled category category Nearest centroid and the centroid of novelty. Finally we get the two categories is shown in f.

Fourth, examples

Coordinate system has six points:
Here Insert Picture Description

1, we divided into two groups, so K is equal to 2, we randomly selected two points: P1 and P2

2, the remaining points are calculated by the Pythagorean theorem to the two points from this:
Here Insert Picture Description

3, after the first grouping result:

    组A:P1

    组B:P2、P3、P4、P5、P6

4, calculate the A group and the B group centroid:

    A组质心还是P1=(0,0)

    B组新的质心坐标为:P哥=((1+3+8+9+10)/5,(2+1+8+10+7)/5)=(6.2,5.6)

5, each point is calculated again to centroid distance:
Here Insert Picture Description

6, the second grouping result:

    组A:P1、P2、P3

    组B:P4、P5、P6

7, the centroid is computed again:

    P哥1=(1.33,1) 

    P哥2=(9,8.33)

8, again calculated for each point of the distance to the centroid:
Here Insert Picture Description

9, the third grouping results:

    组A:P1、P2、P3

    组B:P4、P5、P6

Can be found, the results of the third group and the second group results consistent explanation has converged, end clustering.

Five, K-Means advantages and disadvantages

advantage:

1, the principle is simple, it is very easy to achieve fast convergence.

2, when the result is a dense cluster, and the difference between the cluster and the cluster obviously, it's better.

3, the main parameters need to adjust the parameters of just the number of clusters k.

Disadvantages:

1, K value required in advance given the estimated value of K in many cases is very difficult.

2, K-Means algorithm to select the initial centroid sensitive, different clustering results obtained random seed point is completely different, a great influence on the results.

3, more sensitive to noise and outliers. For detecting outliers.

4, iterative method, may only get partial optimal solution, but can not get the optimal solution overall.

Sixth, details
1, K value given how?

A: several categories depending on personal experience and feelings, the usual practice is to try a few K value, divided into several categories look better interpretation of the results, more in line with the purpose of analysis and so on. The various K values ​​or may be calculated to compare E, K takes a minimum value of E.

The teacher can refer to Andrew Ng said the video inside the elbow method, the cost of each function to get drawn into a function, when the curve becomes gentle, the number of K can be a constant
Here Insert Picture Description

2, the initial K centroid how the election?

A: The most common method is the random selection, selection of initial center of mass of the final clustering affect the results, so the algorithm must execute it several times, which results more reasonable, which will use the results. Of course there are some optimization methods, the first is to select a point farthest away from each other, specifically, to select the first point and the second point is selected from when the first furthest point, and select the third points, the third point to the first, the second and the minimum distance between two points, and so on. The second is to get the clustering results based on other clustering algorithms (such as hierarchical clustering), the results from each category to choose a point.

3, on the outliers?

A: The outlier is far from the whole, very unusual, very specific data points prior to the clustering of these outliers should be "great," "very small" and the like are removed, otherwise the results clustering influential. However, outliers very often on the value of their own analysis, outliers can be analyzed separately as a class.

4, the unit to be consistent!

A: For example, the unit is m X, Y is the meters, the distance calculated in units or rice, it makes sense. But if X is m, Y t is calculated using the distance formula will be "square meters" with "tons of square" to open the square, the last thing is not calculated mathematical sense, which is a problem.

5, standardization

A: If the data X as a whole are relatively small, such as the number is between 1 and 10, Y is large, such as the number is more than 1000, then, in calculating the distance Y when the role than the big X many, X distance of the impact is almost negligible, it is also a problem. Thus, if the selected Euclidean distance calculation from the K-Means clustering, the data set has appeared in the case of the above, it must be normalized data (Normalization), scaling the data about to make it fall a small specific section.
Reference K-Means clustering algorithm
[subtitles in English] Andrew Ng machine learning courses

Published 80 original articles · won praise 140 · views 640 000 +

Guess you like

Origin blog.csdn.net/linjpg/article/details/104265085