Fuzzy C-means algorithm (FCM)

The code and data address of this article has been uploaded to github: https://github.com/helloWorldchn/MachineLearning

1. Introduction to FCM Algorithm

1. Fuzzy set theory

LAZadeh first proposed the fuzzy set theory in 1965. In this theory, for the strict membership relationship of the traditional hard clustering algorithm whose membership degree value is not 0 or 1, the fuzzy set theory is used to expand the original membership degree to between 0 and 1. Any value in between, a sample can belong to different clusters with different degrees of membership, which greatly improves the processing ability of clustering algorithms for real data sets, thus fuzzy clustering appears in people's field of vision. The FCM algorithm is widely used in data mining, machine learning, computer vision and image processing.

2. FCM algorithm

Fuzzy C-means clustering (Fuzzy C-means) algorithm is referred to as FCM algorithm, which is a kind of soft clustering method. The FCM algorithm was first proposed by Dunn in 1974 and then promoted by Bezdek.

The hard clustering algorithm has a hard standard when classifying. According to the standard, the classification results are either one or the other.
The soft clustering algorithm pays more attention to the degree of membership. The degree of membership is between [0,1]. Each object has a degree of membership belonging to each class, and the sum of all degrees of membership is 1. The higher the degree, the higher the similarity.

3. Algorithm thinking

Fuzzy C-means clustering (FCM) algorithm is a soft clustering method. The idea of ​​this method is to use the
degree of membership to represent the relationship between each data, so as to determine the clustering of the cluster to which each data point belongs. class method. At the same time, the FCM algorithm is also an algorithm based on the objective function. Given a data set containing n data: X = { x 1 , x 2 , … xi , … , xn } X=\left\{\right.x_1,x_2 ,…x_i,…,x_n\left\}\right.X={ x1,x2,xi,,xn} X i X_i Xiright iii feature vectors;X ij X_{ij}XijX i XiThe jjthof X ij attributes. Each sample containsddd attributes. The FCM algorithm can partition this dataset intoKKClass K ,KKK is a positive integer greater than 1, whereKKThe cluster centers of the K classes are[ v 1 , v 2 , … , vn ] [v_1, v _2, …, v_n][v1,v2,,vn] .
The objective function and constraints of FCM are as follows:

J ( U , V ) = ∑ i = 1 n ∑ j = 1 k u i j m d i j 2 J(U,V)=\displaystyle\sum_{i=1}^{n} \displaystyle\sum_{j=1}^{k} u_{ij}^md_{ij}^2 J(U,V)=i=1nj=1kuijmdij2

∑ j = 1 kuij = 1 , uij ∈ [ 0 , 1 ] \displaystyle\sum_{j=1}^{k} u_{ij}=1, u_{ij}∈[0,1]j=1kuij=1,uij[0,1]

Among them, uij u_{ij}uijis the sample point xi x_ixiand cluster center vj v_jvjdegree of membership, m is the fuzzy index (m>1), dij d_{ij}dijis the sample point xi x_ixiand cluster center vj v_jvjThe distance is usually Euclidean distance. Clustering is to find the minimum value of the objective function under the constraints. The FCM algorithm obtains the fuzzy classification of the sample set through iterative optimization of the objective function.

In order to obtain the minimum value of the objective function J, the Lagrange multiplier method is used for the objective function under the condition of satisfying the constraints, and the membership matrix U and the clustering center vj v_j are obtainedvj

u i j = 1 ∑ c = 1 k ( d i j d i k ) 2 m − 1 u_{ij}=\frac{1}{\displaystyle\sum_{c=1}^{k} (\frac{d_{ij}}{d_{ik}}) ^\frac{2}{m-1}} uij=c=1k(didij)m121

v j = ∑ i = 1 n u i j m x i ∑ i = 1 n u i j m v_j=\frac{\displaystyle\sum_{i=1}^{n} u_{ij}^m x_i }{\displaystyle\sum_{i=1}^{n} u_{ij}^m } vj=i=1nuijmi=1nuijmxi

4. Algorithm steps

The specific description of the algorithm is as follows:

Input: number of clusters K, initial cluster center X = { x 1 , x 2 , … xi , … , xn } X=\left\{\right.x_1,x_2,…x_i,…,x_n\left\} \right.X={ x1,x2,xi,,xn}, fuzzy index m, termination error
output: cluster center [ v 1 , v 2 , … , vk ] [v_1,v _2,…,v_k][v1,v2,,vk] , membership degree matrixuij u_{ij}uij
Step1: Initialize parameter values ​​k, m and iteration allowable error ε;
Step2: Initialize number of iterations l=0 and membership degree matrix U(0);
Step3: Calculate or update membership degree matrix and new aggregation matrix according to the formula in the previous step class center.
Step4: Compare J l J^lJl andJ ( l − 1 ) J^{(l-1)}J( l 1 );若∣ ∣ J l − J ( l − 1 ) ∣ ∣ ≤ ε || J^{l} - J^{(l-1)} || ≤ ε∣∣JlJ(l1)∣∣ε , then the iteration stop condition is satisfied, and the iteration stops. Otherwise setl = l + 1 l=l+1l=l+1 , return to Step3, and continue to iterate.

The pseudo code is as follows:

输入:数据集合X, 聚类的类别数k ,迭代次数阈值 T ,迭代次数 t ;
输出:聚类中心V, 隶属度矩阵U u = 进行初始化;
	init U;    //隶属度矩阵U初始化
	calculate v ;//根据公式计算聚类中心点
	calculate u ;//根据公式计算隶属度
	更新并组成隶属度矩阵U
	calculate J;  //计算目标函数 J ;
	t += 1; 
	if t > T
		return C
	else
		返回步骤 2;
	end if

The FCM flow chart is as follows:
FCM flow chart

2. Code implementation (Python3)

The data set used in this article is the UCI data set, and the iris data set Iris, the wine data set Wine, and the wheat seed data set seeds are used for testing. This paper downloads these three data sets from the UCI official website and puts them into python files in the same folder. At the same time, due to the needs of the program, the position of the columns of the data set has been slightly changed. The specific information of the dataset is as follows:

data set Number of samples attribute dimension Number of categories
Iris 150 4 3
Wine 178 3 3
Seeds 210 7 3

The data set is available in my homepage resources, free of points download, if you can't download it, you can private message me.

1. Python3 code implementation

from pylab import *
import pandas as pd
import numpy as np
import operator
import math
import matplotlib.pyplot as plt
import random
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import normalized_mutual_info_score  # NMI
from sklearn.metrics import rand_score  # RI
from sklearn.metrics import accuracy_score  # ACC
from sklearn.metrics import f1_score  # F-measure

# 数据保存在.csv文件中
iris = pd.read_csv("dataset/iris.csv", header=0)  # 鸢尾花数据集 Iris  class=3
wine = pd.read_csv("dataset/wine.csv")  # 葡萄酒数据集 Wine  class=3
seeds = pd.read_csv("dataset/seeds.csv")  # 小麦种子数据集 seeds  class=3
wdbc = pd.read_csv("dataset/wdbc.csv")  # 威斯康星州乳腺癌数据集 Breast Cancer Wisconsin (Diagnostic)  class=2
glass = pd.read_csv("dataset/glass.csv")  # 玻璃辨识数据集 Glass Identification  class=6
df = iris  # 设置要读取的数据集
# print(df)
columns = list(df.columns)  # 获取数据集的第一行,第一行通常为特征名,所以先取出
features = columns[:len(columns) - 1]  # 数据集的特征名(去除了最后一列,因为最后一列存放的是标签,不是数据)
dataset = df[features]  # 预处理之后的数据,去除掉了第一行的数据(因为其为特征名,如果数据第一行不是特征名,可跳过这一步)
class_labels = list(df[columns[-1]])  # 原始标签
attributes = len(df.columns) - 1  # 属性数量(数据集维度)
k = 3  # 聚类簇数
MAX_ITER = 20  # 最大迭代数
n = len(dataset)  # 样本数
m = 2.00  # 模糊参数


# 初始化模糊矩阵U
def initializeMembershipMatrix():
    membership_mat = list()
    for i in range(n):
        random_num_list = [random.random() for i in range(k)]
        summation = sum(random_num_list)
        temp_list = [x / summation for x in random_num_list]  # 首先归一化
        membership_mat.append(temp_list)
    return membership_mat


# 计算类中心点
def calculateClusterCenter(membership_mat):
    cluster_mem_val = zip(*membership_mat)
    cluster_centers = list()
    cluster_mem_val_list = list(cluster_mem_val)
    for j in range(k):
        x = cluster_mem_val_list[j]
        x_raised = [e ** m for e in x]
        denominator = sum(x_raised)
        temp_num = list()
        for i in range(n):
            data_point = list(dataset.iloc[i])
            prod = [x_raised[i] * val for val in data_point]
            temp_num.append(prod)
        numerator = map(sum, zip(*temp_num))
        center = [z / denominator for z in numerator]  # 每一维都要计算。
        cluster_centers.append(center)
    return cluster_centers


# 更新隶属度
def updateMembershipValue(membership_mat, cluster_centers):
    #    p = float(2/(m-1))
    data = []
    for i in range(n):
        x = list(dataset.iloc[i])  # 取出文件中的每一行数据
        data.append(x)
        distances = [np.linalg.norm(list(map(operator.sub, x, cluster_centers[j]))) for j in range(k)]
        for j in range(k):
            den = sum([math.pow(float(distances[j] / distances[c]), 2) for c in range(k)])
            membership_mat[i][j] = float(1 / den)
    return membership_mat, data


# 得到聚类结果
def getClusters(membership_mat):
    cluster_labels = list()
    for i in range(n):
        max_val, idx = max((val, idx) for (idx, val) in enumerate(membership_mat[i]))
        cluster_labels.append(idx)
    return cluster_labels


def fuzzyCMeansClustering():
    # 主程序
    membership_mat = initializeMembershipMatrix()
    curr = 0
    start = time.time()  # 开始时间,计时
    while curr <= MAX_ITER:  # 最大迭代次数
        cluster_centers = calculateClusterCenter(membership_mat)
        membership_mat, data = updateMembershipValue(membership_mat, cluster_centers)
        cluster_labels = getClusters(membership_mat)
        curr += 1

    print("用时:{0}".format(time.time() - start))
    # print(membership_mat)
    return cluster_labels, cluster_centers, data, membership_mat


labels, centers, data, membership = fuzzyCMeansClustering()


def clustering_indicators(labels_true, labels_pred):
    if type(labels_true[0]) != int:
        labels_true = LabelEncoder().fit_transform(df[columns[len(columns) - 1]])  # 如果标签为文本类型,把文本标签转换为数字标签
    f_measure = f1_score(labels_true, labels_pred, average='macro')  # F值
    accuracy = accuracy_score(labels_true, labels_pred)  # ACC
    normalized_mutual_information = normalized_mutual_info_score(labels_true, labels_pred)  # NMI
    rand_index = rand_score(labels_true, labels_pred)  # RI
    return f_measure, accuracy, normalized_mutual_information, rand_index


F_measure, ACC, NMI, RI = clustering_indicators(class_labels, labels)
print("F_measure:", F_measure, "ACC:", ACC, "NMI", NMI, "RI", RI)
# print(centers)
center_array = array(centers)
label = array(labels)
datas = array(data)
if attributes > 2:
    dataset = PCA(n_components=2).fit_transform(dataset)  # 如果属性数量大于2,降维
# 做散点图
plt.scatter(dataset[:, 0], dataset[:, 1], marker='o', c='black', s=7)  # 原图
plt.show()
colors = np.array(["red", "blue", "green", "orange", "purple", "cyan", "magenta", "beige", "hotpink", "#88c999"])
# 循换打印k个簇,每个簇使用不同的颜色
for i in range(k):
    plt.scatter(dataset[nonzero(label == i), 0], dataset[nonzero(label == i), 1], c=colors[i], s=7)
# plt.scatter(center_array[:, 0], center_array[:, 1], marker='x', color='m', s=30)  # 聚类中心
plt.show()

2. Analysis of clustering results

In this paper, F value (F-measure, FM), accuracy rate (Accuracy, ACC), standard mutual information (Normalized Mutual Information, NMI) and Rand index (Rand Index, RI) are selected as evaluation indicators, and their value ranges are [ 0,1], the larger the value, the better the clustering results meet expectations.

The F value combines two indicators of precision (Precision) and recall rate (Recall). Its value is the harmonic mean of precision and recall rate. Its calculation formula is shown in the formula:

P r e c i s i o n = T P T P + F P Precision=\frac{TP}{TP+FP} Precision=TP+FPTP

R e c a l l = T P T P + F N Recall=\frac{TP}{TP+FN} Recall=TP+FNTP

F − m e a s u r e = 2 R e c a l l × P r e c i s i o n R e c a l l + P r e c i s i o n F-measure=\frac{2Recall \times Precision}{Recall+Precision} Fmeasure=Recall+Precision2Recall×Precision

ACC is the ratio of the number of correctly classified samples to the total number of samples in the data set, and the calculation formula is as follows:

A C C = T P + T N T P + T N + F P + F N ACC=\frac{TP+TN}{TP+TN+FP+FN} ACC=TP+TN+FP+FNTP+TN

Among them, TP (True Positive) indicates the number of samples that predict the positive class as the number of positive classes, TN (True Negative) indicates the number of samples that predict the negative class as the number of negative classes, and FP (False Positive) indicates that the negative class is predicted FN (False Negative) represents the number of samples that predict the positive class as the number of negative classes.

NMI is used to quantify the matching degree of clustering results and known category labels. Compared with ACC, the value of NMI is not affected by the arrangement of category labels. Calculated as follows:

N M I = I ( U , V ) H ( U ) H ( V ) NMI=\frac{I\left(U,V\right)}{\sqrt{H\left(U\right)H\left(V\right)}} NMI=H(U)H(V) I(U,V)

Among them, H(U) represents the entropy of the correct classification, and H(V) represents the entropy of the results obtained by the algorithm.

The specific implementation code is as follows:
Since the correct label given in the data set may be a text type instead of a digital label, it is necessary to judge whether the label of the data set is a digital type before calculation, and if not, convert it to a digital type

def clustering_indicators(labels_true, labels_pred):
    if type(labels_true[0]) != int:
        labels_true = LabelEncoder().fit_transform(df[columns[len(columns) - 1]])  # 如果标签为文本类型,把文本标签转换为数字标签
    f_measure = f1_score(labels_true, labels_pred, average='macro')  # F值
    accuracy = accuracy_score(labels_true, labels_pred)  # ACC
    normalized_mutual_information = normalized_mutual_info_score(labels_true, labels_pred)  # NMI
    rand_index = rand_score(labels_true, labels_pred)  # RI
    return f_measure, accuracy, normalized_mutual_information, rand_index


F_measure, ACC, NMI, RI = clustering_indicators(labels_number, labels)
print("F_measure:", F_measure, "ACC:", ACC, "NMI", NMI, "RI", RI)

If you need to calculate the cluster analysis index, just insert the above code into the implementation code.

3. Clustering results

  1. Iris data set Iris
    The original image of the Iris iris data set
The original image of the Iris iris data set

FCM clustering effect diagram of Iris iris data set

FCM clustering effect diagram of Iris iris data set
  1. Wine dataset Wine

Wine wine dataset original image

Wine wine dataset original image

FCM clustering effect diagram of Wine wine data set

FCM clustering effect diagram of Wine wine data set
  1. Wheat seed dataset Seeds

insert image description here

The original image of the Seeds wheat seed dataset

FCM clustering effect diagram of Seeds wheat seed dataset

FCM clustering effect diagram of Seeds wheat seed dataset

4. Insufficiency of FCM algorithm

The core step of the FCM algorithm is to update the cluster center through continuous iteration to achieve the minimum distance within the cluster. The time complexity of the algorithm is very low, so the algorithm has been widely used, but there are many shortcomings in the algorithm, the main shortcomings are as follows:

  1. The number of clusters for FCM clustering needs to be specified by the user. The FCM algorithm first requires the user to specify the K value of the number of clusters. The determination of the K value directly affects the clustering results. Usually, the K value needs to be specified by the user based on his own experience and understanding of the data set, so the specified value may not be ideal. Clustering results are not guaranteed.
  2. The initial center point selection of the FCM algorithm is a random method. The FCM algorithm is extremely dependent on the selection of the initial center point: once the initial center point is selected incorrectly, it will have a great impact on the subsequent clustering process, and it is likely that the optimal clustering result will not be obtained, and the number of clustering iterations may also decrease. Increase. The random selection of the initial center point has great uncertainty, which directly affects the effect of clustering.
  3. FCM uses Euclidean distance for similarity measurement, and it is difficult to achieve good clustering effect in non-convex data sets.

Guess you like

Origin blog.csdn.net/qq_43647936/article/details/130686246