[] Machine learning algorithm derived in detail the principles and implementation (six): k-means algorithm

[] Machine learning algorithm derived in detail the principles and implementation (six): k-means algorithm

Several chapters are introduced before the supervised learning, this section describes unsupervised learning, which is referred to as a k-meansclustering algorithm, also known as k-means clustering algorithm .

Clustering Algorithm

In supervised learning to speak, it usually will draw a picture like this:

This time need to use logisticregression or SVMthese data into positive and negative categories, this process is called supervised learning , because for each training sample are given the correct class label.

In unsupervised learning, often at some different questions. If a number of points given data set consisting of:

All the points are not as given class labels and the so-called learning samples as supervised learning, this time need to rely on the algorithm itself to discover the structure of the data. Above this figure, we can clearly find these data are divided into two clusters, so a unsupervised learning algorithm will be a clustering algorithm, the algorithm will aggregate such data into several different classes.

Many clustering scenarios, to name a few of the most common:

  1. In biological applications, often need to cluster different things, assuming that there are a lot of genetic data, you want them clustered in order to better understand the genes corresponding to different types of biological functions
  2. In the market survey, suppose you have a database, which holds a different customer behavior, if the clustering of these data, the market can be divided into several different sections so you can specify the appropriate marketing strategies for different parts
  3. In the application of the picture, a picture can be divided into several consistent subset of pixels, to try to understand the content of the photo
  4. and many more...

The basic idea of ​​clustering is: Given a set of data collection, aggregation into several identical attribute class.

k-means clustering

This algorithm is called k-means clustering algorithm is used to find the set of data classes, input to the algorithm is a set of unmarked data \ ({x ^ {(1 )}, x ^ {(2)} , ..., X ^ {(m)}} \) , because this is an unsupervised learning algorithm, in the collection can only see \ (X \) , no class mark \ (Y \) . k-meansIs a clustering algorithm to cluster the sample \ (K \) clusters (cluster), the specific algorithm steps as follows:

step 1 randomly selected kclusters centroid point (cluster centroids), it is equivalent to the presence of the \ (K \) clusters \ (^ {C (K)} \) :

\ [\ Mu_1, \ mu_2, ..., \ mu_k \ R ^ n \

step 2 for each \ (X ^ {(I)} \) , needs to be calculated each centroid \ (\ mu_j \) distance, \ (X ^ {(I)} \) belongs to his nearest centroid \ (\ mu_j \) cluster \ (^ {C (J)} \) :

\[ c^{(j)}:=argmin_j||x^{(i)}-\mu_j||^2,j \in 1,2,...,k \]

step 3 for each class \ (^ {C (J)} \) , recalculating the values of the cluster centroid:

\[ \mu_j:=\frac{\sum^m_{i=1}l\{c^{(i)}=j\}x^{(i)}}{\sum^m_{i=1}l\{c^{(i)}=j\}} \]

After Repeat step 2 and step 3 until the algorithm converges, the following steps in FIG explained above, the presence of the data points are as follows:

Suppose we take \ (K = 2 \) , then the two points randomly selected as the centroid in the data set, i.e., the red dots in FIG \ (\ mu_1 \) and blue dot \ (\ mu_2 \) :

They were calculated for each \ (x ^ {(i) } \) and the centroid \ (\ mu_1 \) , \ (\ mu_2 \) distance, \ (x ^ {(i) } \) which from \ (\ mu_j \) more recently, the \ (x ^ {(i) } \) belongs to which \ (c ^ {(J)} \) , i.e. which points from the red \ (\ mu_1 \) near belongs \ (c ^ {(1)} \) , it belongs to the past from the blue \ (C ^ {(2)} \) . The first \ (x ^ {(i) } \) after the classification results as follows:

The next step is to update the cluster \ (c ^ {(j) } \) centroid, computing the average of all points of the red, to give a new centroid \ (\ {mu_. 1 \ _new} \) ; calculating an average of all the blue points value, get new centroid \ (\ mu_ {2 \ _new} \) , as shown below:

Repeated again calculates each \ (x ^ {(i) } \) from the centroid, the centroid of the updated value. After several iterative convergence, even if more iterations, \ (^ {X (I)} \) values and the centroid of the categories will no longer changed:

This involves the question of how to ensure the k-means is convergent? The previous algorithm emphasized that the end of that convergence, in order to ensure complete convergence algorithm, which is incorporated herein distortion function (distortion function):

\[ J(c,\mu)=\sum^m_{i=1}||x^{(i)}-\mu_{c^{(i)}}||^2 \]

\ (J (c, \ mu ) \) represents each sample point \ (x ^ {(i) } \) to its centroid square of the distance and, when \ (J (c, \ mu ) \) does not reach the minimum value can be fixed \ (c ^ {(j) } \) to update each cluster centroid \ (\ mu_j \) value, the centroid changes fixed centroid \ (\ mu_j \) reclassified cluster \ (c ^ {( j)} \) also changes constantly iteration. When \ (J (c, \ mu ) \) reaches the minimum value, \ (\ mu_j \) and \ (c ^ {(j) } \) also converge. (The foregoing procedure and [] Machine learning algorithms derived in detail the principle and implementation (e): support vector machine (under) the SMO algorithm optimization process of the algorithm is very similar, it is fixed value or a set value, update another group or one value, so as to optimize the function extremum, this process is called the coordinate increased , there is not derived). In fact there may be a plurality of sets \ (\ MU \) and \ (C \) enables \ (J (c, \ mu ) \) to obtain the minimum value, but this is rare.

Since the distortion function \ (J (c, \ mu ) \) non-convex function, meaning it can not guarantee the minimum global minimum is taken, that is k-meanssensitive to the initial position of the centroid randomly. General reach local optimum has to meet the needs of classification, if the comparison mind, you can run a few k-meansalgorithms, and then take \ (J (c, \ mu ) \) minimum \ (\ mu \) and \ (c \) .

determined value of k

Many people do not know how to determine a set number of classes (clusters) need points because the data is unsupervised learning algorithm, kvalues need to think of the set. So here it provides two methods to determine kthe value.

The first method :

Observation, the examples herein may be seen, the data sets shown in the drawing figures, it should be able to clearly see the divided 2classes (clusters):

Not all data is the same as the above data, one can see minute 2classes (clusters), we introduce the second method is generally more common.

The second method :

Profile coefficients (Silhouette Coefficient), it is a good or bad clustering effect evaluation methods. It was first proposed by Peter J. Rousseeuw in 1986. It combines two factors cohesion and separation. It can be used to evaluate the effects of different for different algorithms, or algorithms for clustering operation results generated on the basis of the same original data.

As calculated above, k-meansafter the steps of the algorithm calculation, assuming \ (x ^ {(i) } \) belonging to the cluster \ (^ {C (I)} \) , then the two values need to be calculated as follows:

  1. \ (A (I) \) , calculated \ (x ^ {(i) } \) average distance from the same cluster in the other points, the representative \ (x ^ {(i) } \) is not the same cluster other points the degree of similarity
  2. \ (B (I) \) , calculated \ (x ^ {(i) } \) and the other of \ (x ^ {(i) } \) nearest cluster \ (C \) , the cluster \ (C \) all points within the average distance, the representative \ (x ^ {(i) } \) and the nearest neighbor clusters \ (C \) is not the degree of similarity

The final calculated profile of coefficients is:

\[ S(i)=\frac{b(i)-a(i)}{max(a(i),b(i))} \]

Range coefficients is outline \ ([- 1, 1] \) , more close to 1the representatives of cohesion and separation are relatively better.

It may be k-meansthe beginning of the algorithm, the first set kvalue range \ (K \ in [2, n-] \) , thereby calculating ktaking each profile coefficient values, the coefficient profile of the minimum kvalue is the optimal number of taxa.

Examples

Assumed that there is a following data set style:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn import metrics

# figsize绘图的宽度和高度,也就是像素
plt.figure(figsize=(8, 10))
x1 = np.array([1, 2, 3, 1, 5, 6, 5, 5, 6, 7, 8, 9, 7, 9])
x2 = np.array([1, 3, 2, 2, 8, 6, 7, 6, 7, 1, 2, 1, 1, 3])
X = np.array(list(zip(x1, x2))).reshape(len(x1), 2)
# print(X)

# x,y轴的绘图范围
plt.xlim([0, 10])
plt.ylim([0, 10])
plt.title('sample')
plt.scatter(x1, x2)
# 点的颜色
colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k', 'b']
# 点的形状
markers = ['o', 's', 'D', 'v', '^', 'p', '*', '+']
plt.show()

While this observation may know the data set is provided as long as the \ (k = 3 \) like, but there still want to use to search for the best profile coefficient kvalues. Assuming not know where to take \ (K \ in [2,. 3,. 4,. 5,. 8] \) :

# 测试的k值
tests = [2, 3, 4, 5, 8]
subplot_counter = 1
for t in tests:
    subplot_counter += 1
    plt.subplot(3, 2, subplot_counter)
    kmeans_model = KMeans(n_clusters=t).fit(X)
    for i, l in enumerate(kmeans_model.labels_):
        plt.plot(x1[i], x2[i], color=colors[l], marker=markers[l], ls='None')
        # 每个点对应的标签值
        # print(kmeans_model.labels_)
        plt.xlim([0, 10])
        plt.ylim([0, 10])
        plt.title('K = %s, Silhouette Coefficient = %.03f' % (t, metrics.silhouette_score(X, kmeans_model.labels_, metric='euclidean')))

The results were as follows:

As can be seen \ (k = 3 \) when the maximum coefficient profile0.722

Please download the data and code number of public concern [ machine learning and data mining big ], [reply back machine learning ] to get

Guess you like

Origin www.cnblogs.com/TTyb/p/12283208.html