Table of contents
1. Introduction of the author
Zhang Yong, male, graduate student of 2022, School of Electronic Information, Xi'an Polytechnic University
Research direction: Intelligent information processing and information system research
Email: [email protected]
Chen Mengdan, female, School of Electronic Information, Xi'an Polytechnic University, 2022 graduate student, Zhang Hongwei's artificial intelligence research group
Research direction: machine vision and artificial intelligence
Email: [email protected]
2. K-Means clustering algorithm
2.1 Basic concepts
The K-Means clustering algorithm is the K-means algorithm, which is a clustering analysis algorithm for iterative solution. It is a process of classifying and organizing data members that are similar in some respects. Given a set of data points and the required number of clusters K, K is specified by the user, and the K-means algorithm repeatedly divides the data into K clusters according to a certain distance function.
K-means algorithmThe advantage is that it is fast, the principle is simple, and it is easy to operate, but there are alsoshortcoming: (1) How many groups or classes must be selected; (2) Different algorithms may produce different clustering results during operation, and the results are not repeatable and lack consistency; (3) Often end at local optimum; (4) Sensitive to noisy and outlier data and unsuitable for finding clusters of non-convex shapes.
2.2 Algorithm process
The core goal of K-Means is to divide a given data set into K clusters (K is a hyperparameter), and give the center point corresponding to each sample data. The specific steps are very simple and can be divided into 4 steps:
- Data preprocessing. Mainly standardization and outlier filtering.
- Randomly select K centers, denoted as:
- Define the loss function:
- Let t=0,1,2,... be the number of iteration steps, repeat the following process until J converges:
(1) For each sample, assign it to the nearest center
(2) For each class center k, recalculate the center of the class
3. Implementation of K-Means clustering algorithm
3.1 Iris data set
Iris Iris data set: Contains 3 categories: Iris-setosa, Iris-versicolor and Iris-virginica, a total of 150 data, 50 data for each category, Each record has 4 features: sepal length, sepal width, petal length, and petal width. Usually, these 4 features can be used to predict which species the iris flower belongs to.
The Iris dataset is a .csv file with the following data format:
The meaning of the first row of data in the figure is: 150 (the total number of data in the data set); 4 (the number of categories of feature values), that is, the length of the sepal, the width of the sepal, the length of the petal, and the width of the petal; setosa, versicolor, virginica: three kinds of iris flower names.
The meaning of each column data starting from the second row: The first column is the sepal length value; the second column is the sepal width value; the third column is the petal length value; the fourth column is the petal width value; 2 means).
3.2 Preparations
1. First, download sklearn in your own Python environment (enter your personal virtual environment and enter):
pip install scikit-learn -i https://pypi.tuna.tsinghua.edu.cn/simple
2. Download the dataset:
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.datasets import load_iris
iris = load_iris()
3.3 Code implementation
When K is equal to 2, 3, and 4 respectively, the specific implementation code is as follows:
K=2:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:]
estimator = KMeans(n_clusters=2) #构造聚类器,这里聚成两类
estimator.fit(X) #聚类
label_pred = estimator.labels_ #获取聚类标签
#绘图
x0 = X[label_pred == 0]
x1 = X[label_pred == 1]
plt.scatter(x0[:, 0], x0[:, 1], c = "red", marker='o', label='label0')
plt.scatter(x1[:, 0], x1[:, 1], c = "green", marker='*', label='label1')
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(loc=2)
plt.show()
K=3:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:]
estimator = KMeans(n_clusters=3) #构造聚类器,这里聚成两类
estimator.fit(X) #聚类
label_pred = estimator.labels_ #获取聚类标签
#绘图
x0 = X[label_pred == 0]
x1 = X[label_pred == 1]
x2 = X[label_pred == 2]
plt.scatter(x0[:, 0], x0[:, 1], c = "red", marker='o', label='label0')
plt.scatter(x1[:, 0], x1[:, 1], c = "green", marker='*', label='label1')
plt.scatter(x2[:, 0], x2[:, 1], c = "blue", marker='+', label='label2')
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(loc=2)
plt.show()
K=4:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:]
estimator = KMeans(n_clusters=4) #构造聚类器,这里聚成两类
estimator.fit(X) #聚类
label_pred = estimator.labels_ #获取聚类标签
#绘图
x0 = X[label_pred == 0]
x1 = X[label_pred == 1]
x2 = X[label_pred == 2]
x3 = X[label_pred == 3]
plt.scatter(x0[:, 0], x0[:, 1], c = "red", marker='o', label='label0')
plt.scatter(x1[:, 0], x1[:, 1], c = "green", marker='*', label='label1')
plt.scatter(x2[:, 0], x2[:, 1], c = "blue", marker='+', label='label2')
plt.scatter(x3[:, 0], x3[:, 1], c = "orange", marker='+', label='label3')
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(loc=2)
plt.show()
3.4 Result display
4. Questions and Analysis
When building a clusterer, modify the K value, that is, the modification needs to be divided into several categories (clusters),It is not correct to simply modify the parameters cluster=2, 3, and 4 , and it must be modified accordingly in the drawing program part。
For example: cluster=4, the drawing program still uses the drawing program of cluster=3, although the program will not report an error, but there are only three categories of classification, and the experimental result is wrong.The following is the error code and result demonstration:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:]
estimator = KMeans(n_clusters=4) #构造聚类器,这里聚成两类
estimator.fit(X) #聚类
label_pred = estimator.labels_ #获取聚类标签
#绘图
x0 = X[label_pred == 0]
x1 = X[label_pred == 1]
x2 = X[label_pred == 2]
plt.scatter(x0[:, 0], x0[:, 1], c = "red", marker='o', label='label0')
plt.scatter(x1[:, 0], x1[:, 1], c = "green", marker='*', label='label1')
plt.scatter(x2[:, 0], x2[:, 1], c = "blue", marker='+', label='label2')
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(loc=2)
plt.show()
Error code experiment results display:
reference link
https://zhuanlan.zhihu.com/p/184686598?utm_source=qq
https://blog.csdn.net/u010916338/article/details/86487890