Unsupervised learning - K-means algorithm clustering study notes

Unsupervised Learning: Class Unlabeled

1. K-means algorithm:

1. The classic algorithm in Clustering, one of the top ten classic algorithms in data mining

2. Parameter k

The parameter k is known; then the n data objects input in advance are divided into k clusters so that the obtained clusters satisfy: the similarity of objects in the same cluster is higher; the similarity of objects in different clusters is higher little.

3. Algorithm idea: Cluster the k points in the space as the center, and classify the objects closest to them. Through an iterative method, the value of each cluster center is updated successively until the best clustering result is obtained.

4. Algorithm description:

(1) Select the initial centers of c classes arbitrarily and appropriately;
(2) In the k-th iteration, for any sample, find the distance from each center of c, and classify the sample into the class where the center with the shortest distance is located;
(3) Use methods such as mean to update the center value of the class;
(4) For all c cluster centers, if the value remains unchanged after updating by the iterative method of (2) (3), the iteration ends, otherwise Continue to iterate.

5. Algorithm process

    Input: the number of classes k, data data[n];
          (1) Select k initial center points, such as c[0]=data[0],…c[k-1]=data[k-1];
          ( 2) For data[0]….data[n], compare with c[0]…c[k-1] respectively, assuming that the difference with c[i] is the least, mark it as i;
          (3) For all marks For point i, recalculate c[i]={sum of all data[j] marked as i}/number of marked i;
          (4) Repeat (2)(3) until all c[i] values The change is less than a given threshold.

6. Advantages: fast, simple
    Disadvantages: The final result is related to the selection of the initial point, and it is easy to fall into a local optimum, until the value of k is required

2. Examples


Group the above four pills into two categories:

The blue is the tablet, the five-pointed star is the randomly selected center point, the distances from the four points to c1 (1, 1) are 0, 1, 3.61, 5 respectively; the distance from the four points to c2 (2, 1) is 1 , 0, 2.83, 4.24;


Get the first point as a class, the 234th as the second class, and then find the center point again.

New center point: c1>>(1, 1); c2>>(11/3, 8/3) The new diagram is as follows:





The classification is completed and the iteration stops. (Stopping condition: the classification does not change, or the classification change is less than a value, or the specified number of iterations)

Three, python implementation

import numpy as np


def kmeans(x, k, maxIt): # maxIt is the number of iterations

    numPoints, numDim = x.shape # number of rows passed in

    dataSet = np.zeros((numPoints, numDim + 1)) # Add one more column as a marker
    dataSet[:,:-1] = x # same as x except last column

    centroids = dataSet[np.random.randint(numPoints, size= k), :] # Select k rows as the center point
    centroids[:, -1] = range(1, k+1)

    iterations = 0 # how many times to loop
    oldCentroids = None # old centroids

    while not shouldstop(oldCentroids, centroids, iterations, maxIt):
        print("iteration: ", iterations)
        print("dataSet: \n", dataSet)
        print("centroids: \n", centroids)
        oldCentroids = np.copy(centroids)
        iterations += 1

        updataLabels(dataSet, centroids) # reclassify label

        centroids = getCentroids(dataSet, k) # update the center point

    return dataSet


def shouldstop(oldCentroids, centroids, iterations, maxIt):
    if iterations > maxIt: # Whether to the preset number of iterations
        return True
    return np.array_equal(oldCentroids, centroids) # Compare whether the center points of the values ​​are equal


def updataLabels(dataSet, centroids):

    numPoints, numDim = dataSet.shape
    for i in range(0, numPoints):       # 计算
        dataSet[i, -1] = getLabelFromClosestCentroid(dataSet[i,:-1], centroids) # Compare the distance and return the label of the nearest center point


def getLabelFromClosestCentroid(dataSetRow, centroids):

    label = centroids[0, -1]
    minDist = np.linalg.norm(dataSetRow - centroids[0,:-1]) # Returns the distance between two vectors
    for i in range(1, centroids.shape[0]):
        dist = np.linalg.norm(dataSetRow - centroids[i,:-1])
        if dist < minDist:
            minDist = dist
            label = centroids[i, -1]
        print("-"*10)
    print("minDistance: ", minDist)
    return label


def getCentroids(dataSet, k):
    result = np.zeros((k, dataSet.shape[1])) # Initialize shape[1] as the number of columns
    for i in range(1, k+1): # Find all points with the same label and find the mean
        oneCluster = dataSet[dataSet[:, -1] == i, :-1] # equal to all labels of a column to find out
        result[i - 1, :-1] = np.mean(oneCluster, axis=0) # Find the mean, assign it to all but the last column, axis=0 to all columns in each row
        result[i - 1, -1] = i # assign labels

    return result


x1 = np.array([1, 1])
x2 = np.array([2, 1])
x3 = np.array([4, 3])
x4 = np.array([5, 4])

testX = np.vstack((x1, x2, x3, x4)) # stack four points into a matrix

result = kmeans(testX, 2, 10)

print("final result:\n", result)


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324693397&siteId=291194637