C-means clustering algorithm practice - surface vegetation classification/digital clustering

C-means clustering algorithm practice - surface vegetation classification/digital clustering

1. Introduction to C-means Algorithm

Clustering Algorithm, also known as "unsupervised classification", aims to divide data into meaningful or useful groups (or clusters). This division can be done based on our business needs or modeling needs, or it can simply help us explore the natural structure and distribution of data . For another example, clustering can be used for dimensionality reduction and vector quantization (vector quantization), which can compress high-dimensional features into a column. It is often used for unstructured data such as images, sounds, and videos, and can greatly compress the amount of data.

The C-means algorithm is a very commonly used clustering algorithm. Its basic idea is to iteratively find a division scheme of c clusters, so that the mean value of c clusters can be used to represent the corresponding types of samples The overall error obtained is the smallest. The C-means method is sometimes called the k-means method
insert image description hereinsert image description hereinsert image description here

C-means algorithm steps
insert image description here
insert image description here
insert image description here
insert image description here

In the C-Means algorithm, the number C of clusters is a hyperparameter, which needs to be determined by human input. The core task of C-Means is to find K optimal centroids according to the K we set, and assign the data closest to these centroids to the clusters represented by these centroids. The specific process can be summarized as follows :

  • Randomly select K samples as the initial centroids

  • Start the loop:

  • Assign each sample point to their nearest centroid, generating K clusters

  • For each cluster, calculate the mean of all sample points assigned to the cluster as the new centroid

  • When the position of the centroid no longer changes, the iteration stops and the clustering is completed

insert image description here
insert image description here

▣ One of the evaluation indicators: Kalinski-Harabaz Index (Calinski-Harabaz Index, referred to as CHI, also known as the variance ratio standard)

The formula is as follows:
insert image description here
the greater the dispersion between groups, the greater Bk; the smaller the dispersion within a group, the smaller Wk. Therefore, the larger the value of this formula, the better the purpose of clustering is met: the difference within the cluster is small, the cluster calinski-Harabaz index has no bounds, and the clustering on convex data will also appear falsely high. But compared with the silhouette coefficient, it has a huge advantage, that is, the calculation is very fast (it is fast when it is related to matrix calculation).

▣Evaluation index 2: SSE + elbow method

The core idea is that with the increase of the number of clusters c, the sample division will be more refined, and the degree of aggregation of each cluster will gradually increase, so the squared error and SSE will naturally gradually decrease. When c is less than the optimal number of clusters, the increase of c will greatly increase the degree of aggregation of each cluster, so the decline in SSE will be large; when c reaches the optimal number of clusters, the degree of aggregation obtained by increasing c , the return will quickly become smaller, so the decline in SSE will decrease sharply, and then level off as the value of c continues to increase. That is to say, the relationship between SSE and c is in the shape of an elbow, and the c value corresponding to this elbow is the optimal clustering number of the data. This is why the method is called the elbow method. To put it simply, with the change of c value, the change rule of SSE is found, and the c value with the smallest decrease in SSE is found, and the c value at this time is relatively reasonable.

▣Evaluation index three: contour coefficient

For each sample:

  1. The similarity a of a sample to other samples in its own cluster is equal to the average distance between the sample and all other points in the same cluster
  2. The similarity b between a sample and samples in other clusters is equal to the average distance between the sample and all points in the next nearest cluster

According to the requirement of clustering "the difference within the cluster is small, and the difference outside the cluster is large", we hope that b will always be greater than a, and the bigger the better. The silhouette coefficient for a single sample is calculated as: s = (ba) / max(a,b)

It is easy to understand that the range of the silhouette coefficient is (-1,1), and the closer the value is to 1, the sample is very similar to the sample in its own cluster, and is not similar to the samples in other clusters. When the sample point is similar to the sample outside the cluster When it is more similar, the silhouette coefficient is negative. When the silhouette coefficient is 0, it means that the samples in the two clusters have the same similarity, and the two clusters should be one cluster.

A simple summary is "Things of a feather flock together, and people are divided into groups"

Reference article Clustering of machine learning

2. Introduction to the usage of make_blobs in sklearn

The make_blobs function in sklearn is mainly to generate data sets, as follows:

1. Call make_blobs

from sklearn.datasets import make_blobs

2. Usage of make_blobs

data, label = make_blobs(
    n_features=2,  # 表示每一个样本有多少特征值
    n_samples=100,  # 表示样本的个数
    centers=3,  # 是聚类中心点的个数,可以理解为label的种类数
    random_state=3,  # 是随机种子,可以固定生成的数据
    cluster_std=[0.8, 2, 5]  # 设置每个类别的方差
)
  1. for example
"""创建训练的数据集"""
from sklearn.datasets import make_blobs
data, label = make_blobs(n_features=2, n_samples=100, centers=2, random_state=2019, cluster_std=[0.6,0.7] )

dataThere are 2 features ( n_features=2), the number of samples is 100 ( n_samples=100, labelonly 0 or 1 ( centers=2), and the dimension is 100. random_stateAfter a given value, the data set generated each time is fixed, which is convenient for later reproduction. The default is random each time Generate, pay attention!

At this point, we can set different parameters to have a data set we want, and then we can start some follow-up work!


3. Experiment code and results of surface vegetation classification

Dataset generation

# 数据集的生成
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# 使用sklearn的make_blobs函数生成样本,样本有2个属性,分别为地表植被的东西方向相对位置、南北方向相对位置。
# 要求生成300个地表植被,共有4类,位置接近的植被属于同类,同一类植物的标准差为0.6。设置随机种子random_state=0
X, y_true = make_blobs(n_samples=300,
                       centers=4,
                       cluster_std=0.60,
                       random_state=0
                       )
# 使用matplotlib的scatter函数绘制植被位置散点图,散度大小为50
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.show()  # 绘制数据集

insert image description here

Model building and clustering results:

from sklearn.cluster import KMeans

# 模型建立:将sk-learn的KMeans类实例化,设定聚类簇数为4
m_kmeans = KMeans(n_clusters=4, n_init=10)
from sklearn import metrics


def draw(m_kmeans, X, y_pred, n_clusters):
    """
    Calinski-Harabaz指标评估测试结果,并在图片中对比测试结果
    :param m_kmeans: KMeans对象
    :param X: 样本属性集
    :param y_pred: 样本预测标签集
    :param n_clusters: 聚类簇数
    :return: None
    """
    # 使用KMean对象的cluster_centers_属性获取聚类中心
    centers = m_kmeans.cluster_centers_
    print(centers)
    # 使用scatter函数绘制样本,其中,样本的颜色由样本的预测类别决定,样本点大小为50,颜色图(colormap)为viridis
    plt.scatter(X[:, 0], X[:, 1], c=y_pred, s=50, cmap='viridis')
    # 使用scatter函数绘制聚类中心,聚类中心的颜色为红色(red或r),大小为200,透明度(alpha)为0.5
    plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)
    # 使用sk-learn的metrics.calinski_harabasz_score函数,输入样本与预测标签,评估预测结果并输出
    print("Calinski-Harabasz score:%lf" % metrics.calinski_harabasz_score(X, y_pred))
    # 将图片标题命名为K-Means (clusters = %d),%d处为聚类簇数,并显示图片
    plt.title("K-Means (clusters = %d)" % n_clusters, fontsize=20)
    plt.show()

    
if __name__ == '__main__':
    # 模型训练:使用KMeans对象的fit方法,输入样本属性集进行训练
    m_kmeans.fit(X)
    # 使用KMeans对象的predict方法,输入样本属性集进行测试
    y_pred = m_kmeans.predict(X)
    # 使用Calinski-Harabaz指标评估并绘图
    draw(m_kmeans, X, y_pred, 4)

The result is as follows:

 [ 1.98258281  0.86771314]
 [-1.37324398  7.75368871]
 [ 0.94973532  4.41906906]]
 Calinski-Harabasz score:1210.089914

Here I use the Kalinssky-Harabas index mentioned at the beginning as a measure.

insert image description here

The generated data is displayed intuitively, and it can be seen that the code and running results have reached the expected standard. At the same time, the Calinski-Harabasz score reached 1200, which shows that the effect is quite ideal.

4. Expansion

1. Observe what happens to the classification results of the C-means (k-means) method when the number of clusters set in advance is not enough.

The selection of the number of clusters has always been a difficult problem for the clustering algorithm. Please refer to Multiple Determination Methods and Theoretical Proofs of the Number of Clusters

Still using the previous algorithm, if the initial number of clusters is changed to 3 or 2, the intuitive result of clustering is as follows:

insert image description here
insert image description here

Calinski-Harabasz score:615.093327

insert image description here

Calinski-Harabasz score:405.753290

It can be seen that a too small c value will increase the difference within the cluster , and the cluster sum of squares ( cluster Sum of Square) will also increase, while the Kalinsky-Harabas index is also constantly decreasing (simple The higher the Calinski-Harabaz index, the better the clustering effect)

2. Handwritten k_means algorithm

  1. Algorithm part: the distance adopts Euclidean distance. The default value of the parameter is chosen arbitrarily.
import numpy as np


def k_means(x, k=4, epochs=500, delta=1e-3):
    #  随机选取k个样本点作为中心
    indices = np.random.randint(0, len(x), size=k)
    centers = x[indices]
    #  保存分类结果
    results = []
    for i in range(k):
        results.append([])
    step = 1
    flag = True
    while flag:
        if step > epochs:
            return centers, results
        else:
            # 合适的位置清空
            for i in range(k):
                results[i] = []
        #  将所有样本划分到离它最近的中心簇
        for i in range(len(x)):
            current = x[i]
            min_dis = np.inf
            tmp = 0
            for j in range(k):
                distance = dis(current, centers[j])
                if distance < min_dis:
                    min_dis = distance
                    tmp = j
            results[tmp].append(current)
        # 更新中心
        for i in range(k):
            old_center = centers[i]
            new_center = np.array(results[i]).mean(axis=0)
            #  如果新,旧中心不等,更新
            #  if not (old_center==new_center).all():
            if dis(old_center, new_center) > delta:
                centers[i] = new_center
                flag = False
        if flag:
            break
        # 需要更新flag重设为True
        else:
            flag = True
        step += 1
    return centers, results


def dis(x, y):
    return np.sqrt(np.sum(np.power(x - y, 2)))
  1. Verify that some points on the plane are randomly selected and then classified.
x = np.random.randint(0, 50, size=100)
y = np.random.randint(0, 50, size=100)
z = np.array(list(zip(x, y)))

import matplotlib.pyplot as plt
% matplotlib inline

plt.plot(x, y, 'ro')

Before uncategorized:

insert image description here

Classified result:

centers, results = k_means(z)

color = ['ko', 'go', 'bo', 'yo']
for i in range(len(results)):
    result = results[i]
    plt.plot([res[0] for res in result], [res[1] for res in result], color[i])
plt.plot([res[0] for res in centers], [res[1] for res in centers], 'ro')
plt.show()

insert image description here

Choose different k values:

centers, results = k_means(z, k=5)

color = ['ko', 'go', 'bo', 'yo', 'co']
for i in range(len(results)):
    result = results[i]
    plt.plot([res[0] for res in result], [res[1] for res in result], color[i])
plt.plot([res[0] for res in centers], [res[1] for res in centers], 'ro')
plt.show()

insert image description here

It can be seen that this algorithm is very sensitive to the initial value.

3. C-means algorithm to realize digital clustering.

experimental code

# -*- coding: utf-8 -*-
# @Author : Xenon
# @Date : 2023/2/7 14:55 
# @IDE : PyCharm(2022.3.2) Python3.9.13

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale


def run():
    # matplotlib画图中中文显示会有问题,需要这两行设置默认字体
    plt.rcParams['font.sans-serif'] = ['SimHei']
    plt.rcParams['axes.unicode_minus'] = False

    digits = load_digits()  # 从sklearn加载数据集
    images = digits.images
    plt.figure(figsize=(10, 5))
    plt.suptitle('handwritten_Image')
    # 前十张图片
    for i in range(10):
        plt.subplot(2, 5, i + 1)
        plt.title('number:%d' % (digits.target[i]))
        plt.imshow(images[i])
        plt.axis('off')
    plt.show()

    # 标准化和簇心
    data = scale(digits.data)
    n_digits = len(np.unique(digits.target))
    reduced_data = PCA(n_components=2).fit_transform(data)
    kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)
    kmeans.fit(reduced_data)
    label_pred = kmeans.labels_

    plt.clf()
    plt.figure(figsize=(10, 7))
    centroids = kmeans.cluster_centers_
    plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='x', s=169, linewidths=3,
                color='w', zorder=10)
    color_list = ['#000080', '#006400', '#00CED1', '#800000', '#800080',
                  '#CD5C5C', '#DAA520', '#E6E6FA', '#F08080', '#FFE4C4']
    for i in range(n_digits):
        x = reduced_data[label_pred == i]
        plt.scatter(x[:, 0], x[:, 1], c=color_list[i], marker='.', label='label%s' % i)
    plt.title('K-means聚类')
    plt.legend()
    plt.axis('on')
    plt.show()


if __name__ == '__main__':
    run()

Experimental results
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/yxn4065/article/details/128919648