Machine learning classical classification algorithms - k- means algorithm (attached python implementation code and data set)

working principle

Clustering is an unsupervised learning, similar objects will return to the same cluster. Similar to the automatic classification (automatically mean that even the category is constructed automatically). K- k-means algorithm can be found in different clusters, and the center of each cluster using the cluster mean is calculated from values ​​contained. It workflow represented in pseudo code as follows:

创建k个点作为起始质心
当任意一个点的簇分配结果发生改变时
    对数据集中的每个数据点
        对每个质心
            计算质心与数据点之间的距离
        将数据点分配到距其最近的簇
    对每一个簇,计算簇中所有点的均值并将均值作为质心
    

python achieve

First, two distance function, generally the Euclidean distance using

def distEclud(self, vecA, vecB):
    return np.linalg.norm(vecA - vecB)
def distManh(self, vecA, vecB):
    return np.linalg.norm(vecA - vecB,ord = 1)

Then randcent (), the function is to build to the point data set comprising a set of k random centroid

def randCent(self, X, k):
    n = X.shape[1]  # 特征维数,也就是数据集有多少列
    centroids = np.empty((k, n))  # k*n的矩阵,用于存储每簇的质心
    for j in range(n):  # 产生质心,一维一维地随机初始化
        minJ = min(X[:, j])
        rangeJ = float(max(X[:, j]) - minJ)
        centroids[:, j] = (minJ + rangeJ * np.random.rand(k, 1)).flatten()
    return centroids

For the realization kMeans and biKmeans, reference is made to achieve the kMeans scikit-learn, and encapsulating them into categories.

  • n_clusters - the number of clusters, that is, k
  • initCent - method for generating an initial centroid, 'random' represents a randomly generated, can also specify an array
  • max_iter - maximum number of iterations
class kMeans(object):
    def __init__(self, n_clusters=10, initCent='random', max_iter=300):
        if hasattr(initCent, '__array__'):
            n_clusters = initCent.shape[0]
            self.centroids = np.asarray(initCent, dtype=np.float)
        else:
            self.centroids = None
        self.n_clusters = n_clusters
        self.max_iter = max_iter
        self.initCent = initCent
        self.clusterAssment = None
        self.labels = None
        self.sse = None
    # 计算两个向量的欧式距离
    def distEclud(self, vecA, vecB):
        return np.linalg.norm(vecA - vecB)

    # 计算两点的曼哈顿距离
    def distManh(self, vecA, vecB):
        return np.linalg.norm(vecA - vecB, ord=1)

    # 为给点的数据集构建一个包含k个随机质心的集合
    def randCent(self, X, k):
        n = X.shape[1]  # 特征维数,也就是数据集有多少列
        centroids = np.empty((k, n))  # k*n的矩阵,用于存储每簇的质心
        for j in range(n):  # 产生质心,一维一维地随机初始化
            minJ = min(X[:, j])
            rangeJ = float(max(X[:, j]) - minJ)
            centroids[:, j] = (minJ + rangeJ * np.random.rand(k, 1)).flatten()
        return centroids

    def fit(self, X):
    # 聚类函数
    # 聚类完后将得到质心self.centroids,簇分配结果self.clusterAssment    
        if not isinstance(X, np.ndarray):
            try:
                X = np.asarray(X)
            except:
                raise TypeError("numpy.ndarray required for X")
        m = X.shape[0]  # 样本数量
        self.clusterAssment = np.empty((m, 2))  # m*2的矩阵,第一列表示样本属于哪一簇,第二列存储该样本与质心的平方误差(Squared Error,SE)
        if self.initCent == 'random':   # 可以指定质心或者随机产生质心
            self.centroids = self.randCent(X, self.n_clusters)
        clusterChanged = True
        for _ in range(self.max_iter):# 指定最大迭代次数
            clusterChanged = False
            for i in range(m):  # 将每个样本分配到离它最近的质心所属的簇
                minDist = np.inf
                minIndex = -1
                for j in range(self.n_clusters):    #遍历所有数据点找到距离每个点最近的质心
                    distJI = self.distEclud(self.centroids[j, :], X[i, :])
                    if distJI < minDist:
                        minDist = distJI
                        minIndex = j
                if self.clusterAssment[i, 0] != minIndex:
                    clusterChanged = True
                    self.clusterAssment[i, :] = minIndex, minDist ** 2
            if not clusterChanged:  # 若所有样本点所属的簇都不改变,则已收敛,提前结束迭代
                break
            for i in range(self.n_clusters):  # 将每个簇中的点的均值作为质心
                ptsInClust = X[np.nonzero(self.clusterAssment[:, 0] == i)[0]]  # 取出属于第i个族的所有点
                if(len(ptsInClust) != 0):
                    self.centroids[i, :] = np.mean(ptsInClust, axis=0)

        self.labels = self.clusterAssment[:, 0]
        self.sse = sum(self.clusterAssment[:, 1])   # Sum of Squared Error,SSE

kMeans disadvantage is that - may converge to a local minimum. Using SSE (Sum of Squared Error, error sum of squares) to measure the effect of clustering. SSE smaller the value of data points closer to their centroids, the better the clustering effect.
In order to overcome kMeans will converge to the local minimum problem, it was suggested that an algorithm called binary K- mean. The following pseudocode:

将所有点看成一个簇
当簇数目小于k时
对于每个簇
    计算总误差
    在给定的簇上面进行K-均值聚类(k=2)
    计算将该簇一分为二之后的总误差
选择使得误差最小的那个簇进行划分操作

python code is as follows:

class biKMeans(object):
    def __init__(self, n_clusters=5):
        self.n_clusters = n_clusters
        self.centroids = None
        self.clusterAssment = None
        self.labels = None
        self.sse = None
    # 计算两点的欧式距离
    def distEclud(self, vecA, vecB):
        return np.linalg.norm(vecA - vecB)
    
    # 计算两点的曼哈顿距离
    def distManh(self, vecA, vecB):
        return np.linalg.norm(vecA - vecB,ord = 1)
    def fit(self, X):
        m = X.shape[0]
        self.clusterAssment = np.zeros((m, 2))
        if(len(X) != 0):
            centroid0 = np.mean(X, axis=0).tolist()
        centList = [centroid0]
        for j in range(m):  # 计算每个样本点与质心之间初始的SE
            self.clusterAssment[j, 1] = self.distEclud(np.asarray(centroid0), X[j, :]) ** 2

        while (len(centList) < self.n_clusters):
            lowestSSE = np.inf
            for i in range(len(centList)):  # 尝试划分每一族,选取使得误差最小的那个族进行划分
                ptsInCurrCluster = X[np.nonzero(self.clusterAssment[:, 0] == i)[0], :]
                clf = kMeans(n_clusters=2)
                clf.fit(ptsInCurrCluster)
                centroidMat, splitClustAss = clf.centroids, clf.clusterAssment  # 划分该族后,所得到的质心、分配结果及误差矩阵
                sseSplit = sum(splitClustAss[:, 1])
                sseNotSplit = sum(self.clusterAssment[np.nonzero(self.clusterAssment[:, 0] != i)[0], 1])
                if (sseSplit + sseNotSplit) < lowestSSE:
                    bestCentToSplit = i
                    bestNewCents = centroidMat
                    bestClustAss = splitClustAss.copy()
                    lowestSSE = sseSplit + sseNotSplit
            # 该族被划分成两个子族后,其中一个子族的索引变为原族的索引,另一个子族的索引变为len(centList),然后存入centList
            bestClustAss[np.nonzero(bestClustAss[:, 0] == 1)[0], 0] = len(centList)
            bestClustAss[np.nonzero(bestClustAss[:, 0] == 0)[0], 0] = bestCentToSplit
            centList[bestCentToSplit] = bestNewCents[0, :].tolist()
            centList.append(bestNewCents[1, :].tolist())
            self.clusterAssment[np.nonzero(self.clusterAssment[:, 0] == bestCentToSplit)[0], :] = bestClustAss
        self.labels = self.clusterAssment[:, 0]
        self.sse = sum(self.clusterAssment[:, 1])
        self.centroids = np.asarray(centList)

These functions run multiple cluster will converge to the global minimum, while the original kMeans () function occasionally fall into local minimum.

Algorithm combat

Clustering sets of data mnist

From the Internet to find data sets data.pkl. The data set is selected mnist in FIG 1000, reduced to a two-dimensional t_sne dimension.

Code reading file is as follows:

dataSet, dataLabel = pickle.load(open('data.pkl', 'rb'), encoding='latin1')
    print(type(dataSet))
    print(dataSet.shape)
    print(dataSet)
    print(type(dataLabel))
    print(dataLabel.shape)
    print(dataLabel)

Print out the results are as follows:

<class 'numpy.ndarray'>
(1000, 2)
[[ -0.48183008 -22.66856528]
 [ 11.5207274   10.62315075]
 [  4.76092787   5.20842437]
 ...
 [ -8.43837464   2.63939773]
 [ 20.28416829   1.93584107]
 [-21.19202119  -4.47293397]]
<class 'numpy.ndarray'>
(1000,)
[0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0  9 5 5 6 5 0
 9 8 9 8 4 1 7 7 3 5 1 0 0 2 2 7 8 2 0 1 2 6 3 3 7 3 3 4 6 6 6 ...
 3 7 3 3 4 6 6 6 4 9 1 5 0 9 5 2 8 2 0 0 1 7 6 3 2 1 4 6 3 1 3 9 1 7 6 8 4 3]

Start using clustering algorithm written before, and save multiple runs a minimum of sse the resulting map.

def main():
    dataSet, dataLabel = pickle.load(open('data.pkl', 'rb'), encoding='latin1')
    k = 10
    clf = biKMeans(k)
    lowestsse = np.inf
    for i in range(10):
        print(i)
        clf.fit(dataSet)
        cents = clf.centroids
        labels = clf.labels
        sse = clf.sse
        visualization(k, dataSet, dataLabel, cents, labels, sse, lowestsse)
        if(sse < lowestsse):
            lowestsse = sse
if __name__ == '__main__':
    main()

summary

Clustering is an unsupervised learning method. The so-called unsupervised learning refers not know beforehand're looking for, that there is no target variable. Clustering data points attributed to the plurality of clusters, wherein similar data points in the same cluster, rather similar data points in different clusters. Cluster can use several different methods to calculate the similarity (such as this is to use a distance metric)

K- means algorithm is the most widely used clustering algorithm, where k is the number of clusters specified by the user to be created. K- k-means clustering algorithm starts with a random centroid. Algorithm calculates the distance to each point of the center of mass. Each point will be assigned to its closest cluster centroid, then followed based on the new distribution points to update the cluster of cluster centroid. The above process is repeated several times until the cluster centroid does not change. This method is easy to implement, but is susceptible to initial cluster centroid, and converges to a local optimum rather than a global optimal solution.

There is also a half K- Means algorithm can get better clustering effect. First, as a cluster all points, and then use the K- means algorithm (k = 2) for its division. When the next iteration, choose a maximum error of cluster divided. This process is repeated until k clusters creation is successful.

appendix

Text codes and data sets: https://github.com/Professorchen/Machine-Learning/tree/master/kMeans

Guess you like

Origin www.cnblogs.com/multhree/p/11279140.html