[Clustering] K-modes and K-prototypes - clustering methods suitable for discrete data

Application scenarios:

Assume a batch of data. Each sample has attributes such as unique identification (id), category (cate_id), audience (users, children, elderly, middle-aged, etc.). We hope to find some samples from them so that these samples cover The widest range of categories and audiences.

analyze:

The idea is to use clustering and select one sample from each cluster. Observed data are all categorical features. The commonly used kmeans clustering method uses Euclidean distance to calculate the distance between two samples to determine whether the sample belongs to the cluster. For categorical features, even if they are expressed as 0, 1, 2, these numbers have no significance and only represent a certain attribute. Therefore, we cannot use distance judgment to divide clusters.

After research, two new clustering methods were recognized: K-modes and K-prototypes. The following two methods are introduced respectively.

K-modes

Suitable for discrete data, using Hamming distance

The K-modes algorithm is modified according to the core content of the k-means algorithm, mainly including the following two points:

1.Measurement method. The distance D between samples has the same attribute as 0 and different attributes as 1, and all attributes are added together. Therefore, the larger D is, the stronger its degree of irrelevance (the same meaning as the Euclidean distance);

汉明距离:Hamming Distance也能用来计算两个向量的相似度,通过比较向量每一位是否相同,若不同则汉明距离加1,这样得到汉明距离。向量相似度越高,对应的汉明距离越小。如10001001和10110001有3位不同。

2. Update modes and use the attribute value with the highest frequency of occurrence of each attribute in a cluster as the attribute value representing the cluster (such as {[a,b] [a,c] [c,b] [b,c]}) The representative mode is [a,b] or [a,c];

from kmodes.kmodes import KModes
 
KM = KModes(n_clusters=i,init='Huang').fit_predict(X)

K-prototypes

Suitable for mixed data (both discrete and continuous)

The K-Prototype algorithm combines the K-Means and K-modes algorithms to solve two core problems for mixed attributes as follows:

1. The method of measuring mixed attributes is to use K-means method to get P1 for numerical attributes, and K-modes method P2 for categorical attributes, then D=P1+a*P2, a is the weight, if you think the categorical attributes are important, increase a, otherwise reduce a. When a=0, there are only numerical attributes.

2. The method of updating the center of a cluster is to combine the update method of K-Means and K-modes


from kmodes.kprototypes import KPrototypes
 
KP = KPrototypes(n_clusters=self.k, init='Cao').fit_predict(X, categorical=self.dis_col)

Guess you like

Origin blog.csdn.net/pearl8899/article/details/134818856