K-Means Algorithm Realizes Clustering of Iris Dataset

1. Introduction of the author

Zhang Yong, male, graduate student of 2022, School of Electronic Information, Xi'an Polytechnic University
Research direction: Intelligent information processing and information system research
Email: [email protected]

Chen Mengdan, female, School of Electronic Information, Xi'an Polytechnic University, 2022 graduate student, Zhang Hongwei's artificial intelligence research group
Research direction: machine vision and artificial intelligence
Email: [email protected]

2. K-Means clustering algorithm

2.1 Basic concepts

The K-Means clustering algorithm is the K-means algorithm, which is a clustering analysis algorithm for iterative solution. It is a process of classifying and organizing data members that are similar in some respects. Given a set of data points and the required number of clusters K, K is specified by the user, and the K-means algorithm repeatedly divides the data into K clusters according to a certain distance function.

K-means algorithmThe advantage is that it is fast, the principle is simple, and it is easy to operate, but there are alsoshortcoming: (1) How many groups or classes must be selected; (2) Different algorithms may produce different clustering results during operation, and the results are not repeatable and lack consistency; (3) Often end at local optimum; (4) Sensitive to noisy and outlier data and unsuitable for finding clusters of non-convex shapes.

2.2 Algorithm process

The core goal of K-Means is to divide a given data set into K clusters (K is a hyperparameter), and give the center point corresponding to each sample data. The specific steps are very simple and can be divided into 4 steps:

  1. Data preprocessing. Mainly standardization and outlier filtering.
  2. Randomly select K centers, denoted as:
    insert image description here
  3. Define the loss function:
    insert image description here
  4. Let t=0,1,2,... be the number of iteration steps, repeat the following process until J converges:
    (1) For each sample, assign it to the nearest center
    insert image description here
    (2) For each class center k, recalculate the center of the class
    insert image description here

3. Implementation of K-Means clustering algorithm

3.1 Iris data set

Iris Iris data set: Contains 3 categories: Iris-setosa, Iris-versicolor and Iris-virginica, a total of 150 data, 50 data for each category, Each record has 4 features: sepal length, sepal width, petal length, and petal width. Usually, these 4 features can be used to predict which species the iris flower belongs to.
insert image description here
The Iris dataset is a .csv file with the following data format:
insert image description here
The meaning of the first row of data in the figure is: 150 (the total number of data in the data set); 4 (the number of categories of feature values), that is, the length of the sepal, the width of the sepal, the length of the petal, and the width of the petal; setosa, versicolor, virginica: three kinds of iris flower names.

The meaning of each column data starting from the second row: The first column is the sepal length value; the second column is the sepal width value; the third column is the petal length value; the fourth column is the petal width value; 2 means).

3.2 Preparations

1. First, download sklearn in your own Python environment (enter your personal virtual environment and enter):

pip install scikit-learn -i https://pypi.tuna.tsinghua.edu.cn/simple

2. Download the dataset:

from sklearn.cluster import KMeans         
from sklearn import datasets
from sklearn.datasets import load_iris    
iris = load_iris() 

3.3 Code implementation

When K is equal to 2, 3, and 4 respectively, the specific implementation code is as follows:
K=2:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans         
from sklearn import datasets
from sklearn.datasets import load_iris    
iris = load_iris()
X = iris.data[:]     
estimator = KMeans(n_clusters=2)       #构造聚类器,这里聚成两类
estimator.fit(X)                       #聚类
label_pred = estimator.labels_         #获取聚类标签
#绘图
x0 = X[label_pred == 0]
x1 = X[label_pred == 1]
plt.scatter(x0[:, 0], x0[:, 1], c = "red", marker='o', label='label0')
plt.scatter(x1[:, 0], x1[:, 1], c = "green", marker='*', label='label1')
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(loc=2)
plt.show()

K=3:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans         
from sklearn import datasets
from sklearn.datasets import load_iris    
iris = load_iris()
X = iris.data[:]     
estimator = KMeans(n_clusters=3)       #构造聚类器,这里聚成两类
estimator.fit(X)                       #聚类
label_pred = estimator.labels_         #获取聚类标签
#绘图
x0 = X[label_pred == 0]
x1 = X[label_pred == 1]
x2 = X[label_pred == 2]
plt.scatter(x0[:, 0], x0[:, 1], c = "red", marker='o', label='label0')
plt.scatter(x1[:, 0], x1[:, 1], c = "green", marker='*', label='label1')
plt.scatter(x2[:, 0], x2[:, 1], c = "blue", marker='+', label='label2')
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(loc=2)
plt.show()

K=4:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans         
from sklearn import datasets
from sklearn.datasets import load_iris    
iris = load_iris()
X = iris.data[:]     
estimator = KMeans(n_clusters=4)       #构造聚类器,这里聚成两类
estimator.fit(X)                       #聚类
label_pred = estimator.labels_         #获取聚类标签
#绘图
x0 = X[label_pred == 0]
x1 = X[label_pred == 1]
x2 = X[label_pred == 2]
x3 = X[label_pred == 3]
plt.scatter(x0[:, 0], x0[:, 1], c = "red", marker='o', label='label0')
plt.scatter(x1[:, 0], x1[:, 1], c = "green", marker='*', label='label1')
plt.scatter(x2[:, 0], x2[:, 1], c = "blue", marker='+', label='label2')
plt.scatter(x3[:, 0], x3[:, 1], c = "orange", marker='+', label='label3')
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(loc=2)
plt.show()

3.4 Result display

insert image description here
insert image description hereinsert image description here

4. Questions and Analysis

When building a clusterer, modify the K value, that is, the modification needs to be divided into several categories (clusters),It is not correct to simply modify the parameters cluster=2, 3, and 4 , and it must be modified accordingly in the drawing program part

For example: cluster=4, the drawing program still uses the drawing program of cluster=3, although the program will not report an error, but there are only three categories of classification, and the experimental result is wrong.The following is the error code and result demonstration:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans         
from sklearn import datasets
from sklearn.datasets import load_iris    
iris = load_iris()
X = iris.data[:]     
estimator = KMeans(n_clusters=4)       #构造聚类器,这里聚成两类
estimator.fit(X)                       #聚类
label_pred = estimator.labels_         #获取聚类标签
#绘图
x0 = X[label_pred == 0]
x1 = X[label_pred == 1]
x2 = X[label_pred == 2]
plt.scatter(x0[:, 0], x0[:, 1], c = "red", marker='o', label='label0')
plt.scatter(x1[:, 0], x1[:, 1], c = "green", marker='*', label='label1')
plt.scatter(x2[:, 0], x2[:, 1], c = "blue", marker='+', label='label2')
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(loc=2)
plt.show()

Error code experiment results display:
insert image description here

reference link

https://zhuanlan.zhihu.com/p/184686598?utm_source=qq
https://blog.csdn.net/u010916338/article/details/86487890

Guess you like

Origin blog.csdn.net/m0_37758063/article/details/130893227