K-means clustering machine learning

  The core concept of clustering is the similarity or distance, there is a lot of similarity or distance methods, such as Euclidean distance, Mahalanobis distance, the correlation coefficient, the law of cosines, hierarchical clustering and K-means clustering, etc.

1. K-means clustering idea

  K-means clustering basic idea is to find a way of dividing the K clusters by an iterative scheme, such that the minimum cost function corresponding to the clustering result, in particular, the cost function can be defined for each error sample from the center of the cluster belongs and squares \ [J (c, \ sigma ) = \ sum_ {i = 1} ^ M {|| x_i - \ sigma_ {c_i} ||} ^ 2 \]

Wherein \ (x_i \) represents the i-th point, \ (C_i \) is \ (x_i \) belongs to a cluster, \ (\ sigma_ C_i} {\) representative cluster corresponding to the center point, M is the total number of samples

step:

(1) randomly determining a center point of K

(2) determine the K cluster around a central point according to the Euclidean distance

(3) re-establish the center point of K clusters according to the current points

(4) iteration step (2) (3) until the minimum loss

2. K-Mean_Cluster algorithm

Use '/datasets/kmeansTestSet.txt' test data set K-means clustering algorithm

def loadDataset(file):
    dataset = []
    with open(file,'r') as pf:
        for line in pf:
            dataset.append([float(x) for x in line.strip().split('\t')])
    return dataset
#加载数据
dataset = loadDataset('./datasets/kmeansTestSet.txt')
print(len(dataset))
print(dataset[:5])
80
[[1.658985, 4.285136], [-3.453687, 3.424321], [4.838138, -1.151539], [-5.379713, -3.362104], [0.972564, 2.924086]]
import numpy as np
#对数据进行归一化处理
def normalize(dataset):
    dataMat = np.mat(dataset)
    mean = np.mean(dataMat)
    var = np.var(dataMat)
    meanDataMat = (dataMat - mean) / var #将数据进行归一化处理
    return meanDataMat
import matplotlib.pyplot as plt
import pandas as pd

#以图形方式展现数据
def plotDataset(meanDataMat,pointCenter = False,centerPointerMat=None):
    dataFrame = pd.DataFrame(meanDataMat) #转换为DataFrame,方便查看数据属性
    print(dataFrame.describe())
    #plt.axis([0,1,0,1])
    plt.plot(meanDataMat[:,0],meanDataMat[:,1],'r*')
    if pointCenter:
        plt.plot(centerPointerMat[:,0],centerPointerMat[:,1],'b*')
    plt.show()
meanDataMat = normalize(dataset)
plotDataset(meanDataMat)
               0          1
count  80.000000  80.000000
mean   -0.008614   0.008614
std     0.331392   0.333145
min    -0.584224  -0.459075
25%    -0.306803  -0.325977
50%     0.005867   0.019362
75%     0.290189   0.323504
max     0.530519   0.568950

def distance(vectA,vectB):
    Power = np.power((vectA - vectB),2)
    Sum = np.sum(Power,axis = 1)
    return float(np.sqrt(Sum)) 
import random
def kcluster(meanDataMat,k=4):
    row,col = meanDataMat.shape
    print(row,col)
    featureRange = []
    for i in range(col):
        Min = np.min(meanDataMat[:,i])
        Max = np.max(meanDataMat[:,i])
        featureRange.append((Min,Max))
    centerPoints = []  #中心点
    classPoints = []   #聚类数据点
    classLabels = np.mat(np.zeros((row,2)))
    for i in range(k):
        centerPoints.append([random.uniform(r[0],r[1]) for r in featureRange])
        classPoints.append([])
    centerPointsMat = np.mat(centerPoints)
    clusterChanged = True
    while(clusterChanged):
        clusterChanged = False
        for i in range(row):
            minDis = np.inf
            bestK = -1
            for j in range(k):
                dis = distance(meanDataMat[i,:],centerPointsMat[j,:])
                if dis < minDis:
                    minDis = dis
                    bestK = j
            if classLabels[i,0] != bestK:
                clusterChanged = True
            classLabels[i,:] = bestK,minDis
        for center in range(k):
            ptsInClust = meanDataMat[np.nonzero(classLabels[:,0] == center)[0]]  #该步骤需要特别注意,精妙
            centerPointsMat[center,:] = np.mean(ptsInClust,axis=0)
    print('已完成聚类')
    return centerPointsMat,classLabels
centerPointsMat,classLabels = kcluster(meanDataMat)
80 2
已完成聚类
plotDataset(meanDataMat,pointCenter=True,centerPointerMat = centerPointsMat)
               0          1
count  80.000000  80.000000
mean   -0.008614   0.008614
std     0.331392   0.333145
min    -0.584224  -0.459075
25%    -0.306803  -0.325977
50%     0.005867   0.019362
75%     0.290189   0.323504
max     0.530519   0.568950

Shortcoming 3. K-means algorithm

(1) the need for manual initial value K value determined in advance, and the value and the actual data distribution does not necessarily coincide

(2) K-means only converge to a local optimum, the initial value is affected by the effect of

(3) susceptible to noise and outliers

(4) sample points can be divided into a single class

4. K-means clustering algorithm Improved Model

4.1 K-means ++ algorithm

Improved K-means algorithm types and improved types of initial value selection is a very important part, but this type of algorithm, the most influential was undoubtedly the K-means ++ algorithm, the beginning of the original K-means algorithm randomly chosen K centers point, the K-means ++ K cluster centers selected according to the following thought:

Probability assumed to have been selected n-th initial cluster centers (0 <n <k), then select the first n + 1 cluster centers, the farther the distance to the n points have a higher cluster centers are selected the n + 1 for the first cluster center. Of course, each cluster center is farther away, the better. After selecting the initial point when the same K-means ++ and subsequent execution classical K-means algorithm

4.2 ISODATA algorithm

In the K-means algorithm species, the number of cluster K value of the need to pre-think determined and can not be changed throughout the course of the algorithm species, but when it comes to high-dimensional, massive data sets, it is often difficult to accurately estimate the K size, ISODATA is called the iterative self-organizing data analysis method is improved for this problem, it's very intuitive idea, mainly for splitting and merging of classification by an iterative approach, which requires the development of three thresholds:

The minimum number of samples (1) to each desired class \ (n_min \) , will lead to splitting if less than the threshold number of samples, it will not perform the operation of splitting a sub-category category included

(2) maximum variance Sigma, for controlling the degree of dispersion of the sample in a category, when the degree of dispersion of the sample exceeds a certain threshold, to divide the operation

(3) between two cluster centers allowed by the minimum distance \ (D_min \) , if the test is very close to two classes (i.e. the distance between these two categories corresponding to the cluster center is very small), less than the threshold value , the combined operation of these two classes

Guess you like

Origin www.cnblogs.com/xiaobingqianrui/p/11245053.html