[Machine Learning] Basic understanding of clustering_K-means algorithm and mobile phone classification, image cutting, semi-supervised learning practice


foreword

This article will introduce kmeans, algorithm understanding, basic operation, mobile phone classification model, image cutting, semi-supervised algorithm and other practical cases to learn kmeans algorithm

1. Introduction to K-means

K-means clustering (k-means clustering) is a common unsupervised machine learning algorithm that can be used to divide a data set into multiple different clusters. The basic idea of ​​the algorithm is: divide the data set into k clusters, and the center point of each cluster is the mean of all points in the cluster. The center point of each cluster is continuously optimized by iterative method, so that the distance between the points in the same cluster is the smallest, and the distance between different clusters is the largest.

2. Algorithm Implementation

K-means clustering algorithm is a basic clustering algorithm. The basic idea is to divide n samples into k clusters, and optimize the distribution of samples in each cluster by minimizing the distance between the samples in each cluster and the center of the cluster. The mathematical implementation of the algorithm is based on the idea of ​​iteration, including the following steps:

  • Initialization: randomly select k points as the center point.
  • Assignment: For each sample point, calculate its distance from k center points, and assign it to the cluster where the nearest center point is located.
  • Update: For each cluster, recalculate the coordinates of its center point.
  • Steps 2 and 3 are repeated until convergence or a predetermined number of iterations is reached.

3. Parameters

Before starting the case, we first know a few parameters, so that we can learn

from sklearn.cluster import KMeans

KMeans initialization parameters:

  • n_clusters: Specify the number of clusters, that is, divide the data into several clusters. Usually, it is necessary to determine the number of clusters based on actual problems and data characteristics, which can be determined through experience, parameter tuning, and visualization. The default value is 8.
  • init: Specifies the initialization method of the cluster center point. You can choose k-means++ (default value), random or custom. k-means++ will preferentially select the point farther away from the existing center point as the new center point, which can speed up the convergence of the algorithm. Random is to randomly select the initial center point, which is fast but the effect may be poor. Customization requires the user to manually specify the initial center point.
  • n_init: Specify the number of times the algorithm runs, that is, run the algorithm from different initial center points, and select the optimal set of clusters. The default value is 10.
  • max_iter: Specifies the maximum number of iterations for the algorithm. The default value is 300.
  • tol: Specifies the convergence threshold of the algorithm. When the distance between the cluster center points of two iterations is less than the threshold, the algorithm is considered to have converged and the iteration can be stopped. The default is 1e-4.
  • random_state: Specifies the seed of the random number generator, which is used to determine the random selection method of the initial center point. The default is None, which means to use the default random number generator.

Kmeans commonly used method

  • cluster_centers_: Indicates the center point of each cluster. It is a two-dimensional array, each row represents the center point coordinates of a cluster, where the number of rows is equal to the number of clusters k.
  • labels_: Indicates the cluster label to which each sample point belongs. It is a one-dimensional array with a length equal to the number of samples of the input data. labels_[i] indicates the cluster label to which the i-th sample point belongs.
  • transform(X): Transform the data X into a new space, and each dimension of the new space represents the distance from the data point to the center point of the cluster. It returns a two-dimensional array, each row represents the distance from a sample point to each cluster center point, where the number of rows is equal to the number of samples of the input data.
  • inertia_: Represents the sum of squared errors (SSE, Sum of Squared Errors) for clustering. It is a scalar, representing the sum of the squares of the distances from all sample points to the cluster centers to which they belong. The smaller the inertia_, the better the clustering effect.
  • n_iter_: Indicates the number of iterations of the KMeans algorithm.
  • score(X): Returns the negative average degree of distortion of the data X, that is, the inverse of the sum of the squares of the distances from the data to the nearest cluster center point. The closer to 0, the better the clustering effect.

4. Practical cases

1. Classification of mobile phones

The data are characteristic values ​​such as performance, price, and capacity of different mobile phone brands. For convenience, we only classify the two characteristic values ​​of price and performance.

  • Data Extraction Processing
import pandas as pd
import numpy as np


dr=pd.read_csv('python.csv')
dr=pd.DataFrame(dr)
#将co_perf和price改为float类型
dr['co_perf'].astype('float')
dr['price'].astype('float')

#先去除重复多余数据
dr=dr.drop_duplicates(keep='first')

# print(df)
# df=df.drop('model',axis=1)
#判断是否有无穷大的数
# print(np.isfinite(df).any)
"""检测到有无穷大的值对其进行处理"""
dr.replace(np.inf,0,inplace=True)

dr
  • data observation
import matplotlib.pyplot as plt
import matplotlib
# matplotlib.use('TkAgg')
from sklearn.datasets import make_blobs

from sklearn.cluster import KMeans

X=dr[['co_perf','price']].values.reshape(dr.shape[0],2)
X
plt.plot(X[:,0],X[:,1],'b.')
plt.show()

insert image description here
It can be observed that most of this data is in the cluster at the lower left foot, which may be roughly divided into 3 to 4. Later, we can observe through experiments to see how much n_clusters is better. Another problem is that the two eigenvalue ranges are quite different. We standardize

from sklearn.preprocessing import StandardScaler

std=StandardScaler()
X=std.fit_transform(X)
  • Model training for different n_clusters
    Now we are not sure how many clusters we can choose. We can get different score scores through different attempts to see how to choose a better number of clusters
def plt_modal(modal,n_cluster):
    #绘制簇中心
    centiods=modal.cluster_centers_
    plt.scatter(centiods[0],centiods[1],marker='X',c='red',linewidths=0.2)
    labels=modal.labels_
    plt.scatter(X[:,0],X[:,1],c=labels,s=4,alpha=0.3)
    plt.title(f'n_cluster:{
      
      n_cluster}')

n_clusterses=(2,3,4,5,6,7,8,9,10)
modal_sorces=[]

for i,n_clusters in enumerate(n_clusterses):
    modal=KMeans(n_clusters=n_clusters)
    modal.fit(X)
    plt.subplot(331+i)
    plt_modal(modal,n_clusters)
    modal_sorces.append(modal.score(X))

plt.show()

plt.plot(range(2,11),modal_sorces)
plt.show()

insert image description here
This figure is the classification of different number of clusters
insert image description here

This figure shows the scores of different cluster score methods. The score method of the KMeans model returns the opposite number of the average degree of distortion of the data set X. The degree of distortion refers to the square of the distance from each sample point to the center point of the cluster it belongs to. The smaller the degree of distortion, the better the clustering effect. Therefore, the closer the value of score is to 0, the better the clustering effect is.
Generally speaking, when we use the KMeans clustering algorithm, we will try different numbers of clusters, and then choose the number of clusters with the smallest score. This is because as the number of clusters increases, the model can better fit the data and the degree of distortion will gradually decrease. However, when the number of clusters increases to a certain extent, the fitting ability of the model will decrease, and the degree of distortion will also begin to increase. At this time, we need to find a balance between fitting ability and simplicity, and choose the number of clusters with the smallest score. The number of clusters to use as the final model.

If you only look at the score, it may be divided into categories of more than 10, but the bigger the better, for the classification of mobile phones, we can generally divide them into four categories.

High price, low allocation,
low price, high allocation,
high price, high allocation
, low price, low allocation

For this reason, we can observe the situation with n_clusters=4

modal=KMeans(n_clusters=4,random_state=42)
modal.fit(X)

#%%
dr['label']=modal.labels_
centrids=modal.cluster_centers_
for i,centrid in enumerate(centrids):
    plt.scatter(x=dr['co_perf'][dr.label==i],
                y=dr['price'][dr.label==i]
                ,s=8,label=i)
# #

plt.xlabel('co_perf')
plt.ylabel('price')
plt.legend()
plt.show()

insert image description here

From the figure, we can see that the price and performance can match or account for the majority, let's print and observe

Type 0 means low price and low configuration.
Type 1 means high price and low configuration . Type 2
also feels like low price and low configuration or high price and high configuration.

  • Observation and evaluation
    We mainly observe those with relatively high cost performance and some high-priced and low-end products. Let’s take a look at the light of cost performance.
print(dr['price'].describe())
print(dr['co_perf'].describe())
modalist=dr[['co_perf','price','model']][dr.label==3]
modalist

insert image description here
Since I haven't used these phones, I don't know whether the high price is low, but if you look at the average data, it is still a bit convincing.
Let's take a look at the high-priced and low-matched ones

modalist=dr[['co_perf','price','model']][dr.label==1]
modalist

insert image description here
Finally, let's count the number of brands in each category


plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签

plt.rcParams['axes.unicode_minus'] = False#用来正常显示负号
def plt_show(X,y,i):
    print(X)
    print(y)
    plt.pie(X,labels=y)
    plt.title(f'{
      
      i}类别')

plt.subplots_adjust(left=0.01)#控制子图间距


for i in range(4):
    data=dr[dr['label']==i].brand.value_counts().to_dict()
    plt.subplot(221+i)

    plt_show(list(data.values()),list(data.keys()),i)

plt.show()

insert image description here

The above results are unknown due to the source of the data and the source of the performance score, just refer to the method

2. Image Segmentation

The K-means clustering algorithm can divide the pixels in the image into different clusters, thereby realizing image segmentation

  • Data preparation
    A photo of the lake view I took in Daguanlou
import numpy as np
import os
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
from matplotlib.image import  imread
plt.rcParams['axes.labelsize']=14
plt.rcParams['xtick.labelsize']=12
plt.rcParams['ytick.labelsize']=12

import warnings
warnings.filterwarnings('ignore')
np.random.seed(24)
#%%

image=imread('2.jpg')
plt.imshow(image,cmap=None)
plt.show()

insert image description here
view shape

image.shape
(3000, 4000, 3)

representation(height, width, number of color channels)

In order to facilitate the calculation, we need to reshape the image and convert the image into a NumPy array, which is convenient for our machine learning later

X=image.reshape(-1,3)
X.shape

But found that the size of the picture is too large, here we need to reduce the picture, otherwise the running time in the machine learning will be too long (experienced)

from PIL import Image
image=Image.open('2.jpg')
width,heght=image.size
rate=0.1

image=image.resize((int(width*rate),int(heght*rate)))

plt.imshow(image,cmap=None)
plt.show()
image.size
image.save('new.png')
  • Image processing
    Our images are composed of multiple colors, kmeans clustering is to classify related color areas

We choose the number of colors, which can also be said to be the number of clusters, for training and visualization comparison ( 256 , 128 , 64 , 32 , 16 , 8 , 4 , 2 ) (256,128,64,32,16,8,4,2)(256,128,64,32,16,8,4,2)

from sklearn.cluster import KMeans
image=imread('./new.png')
X=image.reshape(-1,3)

n_colors=(256,128,64,32,16,8,4,2)
images_=[]

for colors in n_colors:
    kmeans=KMeans(n_clusters=colors,random_state=42,init='random')
    kmeans.fit(X)

    segemented_img=kmeans.cluster_centers_[kmeans.labels_]
    images_.append(segemented_img.reshape(image.shape))

plt.figure(figsize=(10,6))
plt.subplot(331)
plt.imshow(image)
plt.title('original image')

for idx,n_cluster in enumerate(images_):
    plt.subplot(332+idx)
    plt.imshow(images_[idx])
    plt.title('{}cloros'.format(n_colors[idx]))

plt.show()

insert image description here

The number of colors is from 256 to 2, and you can clearly see that the cutting is still very successful.

Five, kmeans semi-supervised learning implementation

Semi-supervised learning refers to a machine learning method that uses labeled and unlabeled data for learning, and is often used to improve the accuracy of the model when there are fewer labels in the data set. In the kmeans clustering algorithm, the method of semi-supervised learning can also be used to improve the accuracy of clustering, and it can also improve the accuracy of data model training with fewer label values.

The general idea of ​​usage is

  • Cluster the unlabeled data to get the cluster to which each data point belongs.
  • Select the most representative samples in each cluster and mark them manually.
  • Propagate these labels to other samples in the same cluster, resulting in a semi-labeled dataset.
  • Use a semi-labeled dataset to train a classifier to make predictions.

Next, the kmeans semi-supervised algorithm will be used to optimize the logistic regression to realize the following handwritten digit set model optimization case

  • The core
    propagates the label of the closest sample in each cluster to some samples of the cluster through kmeans classification, and then uses these propagated samples as labeled samples for training the logistic regression classifier to improve the accuracy of the model. Suitable for models with a small number of label values
  • Import modules and load datasets
from sklearn.cluster import k_means
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X,y=load_digits(return_X_y=True)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)
X_train.shape
  • We first use the handwritten digit set model without optimized logistic regression for training classification and evaluation
lables=100
log_reg=LogisticRegression(random_state=42,multi_class='multinomial',solver='lbfgs')
log_reg.fit(X_train[:lables],y_train[:lables])
score=log_reg.score(X_test,y_test)
score

This is done by training 100 sets of handwritten digits, the model score is then
insert image description here
used kmeans clustering training to get the distance from the instance to the center of the cluster

k=100
kmeans=KMeans(n_clusters=k,random_state=42)
X_dit=kmeans.fit_transform(X=X_train)

For each cluster, find the index of the instance closest to the cluster center.

import numpy as np
dit_index=np.argmin(X_dit,axis=0)
dit_index.shape

Then through the index, find out the position in the training set and visualize it

X_rep_dits=X_train[dit_index]
X_rep_dits
#%%

import matplotlib.pyplot as plt
plt.figure(figsize=(8,4))
for index,X_rep_dit in enumerate(X_rep_dits):
    plt.subplot(10,10,index+1)
    plt.imshow(X_rep_dit.reshape(8,8),cmap='binary')
    plt.axis('off')

plt.show()

insert image description here
These are the 100 closest instance numbers to the 100 clusters,
and we will label these (because this data set is a bit special, I will be lazy and just take the labels)

y_pre_dits=np.array(y_train[dit_index])
y_pre_dits

Then use these representative labels to train

log_reg=LogisticRegression(multi_class='multinomial',solver='lbfgs',random_state=42)
log_reg.fit(X_rep_dits,y_pre_dits)
score=log_reg.score(X_test,y_test)
score

insert image description here
It can be seen that the model has been significantly optimized

Now we use the labels of these representative numbers to perform label propagation on the data, which means assigning these labels to the data points closest to them. In this way, we transform a large number of unlabeled data points into partially labeled data.

y_train_propagated = np.empty(len(X_train), dtype=np.int32)
for i in range(k):
    y_train_propagated[kmeans.labels_==i] = y_pre_dits[i]

log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train_propagated)

Initialize an array with a size of len(X_train) using the label of the representative number, then, we use the K-Means clustering algorithm to get the cluster to which each data point belongs, and assign the label of the representative number of the cluster to the All data points in a cluster.

Then we choose the range, which is very important. Here we choose 26, which represents 26% of the distance from the representative label in each cluster, that is, the center of the cluster.

closet=26#选择簇最近的26%样本

X_clusrer_dist=X_dit[np.arange(len(X_train)),kmeans.labels_]
for i in range(k):
    in_cluter=(kmeans.labels_==i)
    clusetr_dist=X_clusrer_dist[in_cluter]

    cotoff_distance=np.percentile(clusetr_dist,closet)#得到前26%
    abover_citoff=(X_clusrer_dist>cotoff_distance)
    X_clusrer_dist[in_cluter & abover_citoff]=-1

  • The X_clusrer_dist array calculates the distance from each training sample to all cluster centers, where the kmeans.labels_ array indicates which cluster each training sample belongs to,
  • For each cluster, find the quantile of the distance of the samples within it to the cluster center. Here it is calculated using the np.percentile() function.
  • Then, select a certain proportion of samples closest to the current cluster center. Here, the samples above the quantile of the distance are selected, and the above_citoff array is used to mark them.
  • Finally, set the distance to -1 for samples above the distance quantile (26%) so that they are ignored when the next cluster centers are processed.

The purpose of the whole process is to select a part of the samples closest to the cluster center so that they become the samples representing the cluster. These samples will be used in the label propagation algorithm to help classify unlabeled samples. Here, the closet is set to 26, which means that the 26% of the samples in each cluster that are closest to the cluster center will be selected as representative samples.

partially_propagated = (X_clusrer_dist != -1)
x_train_partially_propaged=X_train[partially_propagated]
y_train_partially_propaged=y_train[partially_propagated]

In the above code, we set the 26% data away from the center to -1, then get the index, and set the first 26% as representative
samples

final training

log_reg = LogisticRegression(random_state=42)
log_reg.fit(x_train_partially_propaged,y_train_partially_propaged)
log_reg.score(X_test,y_test)

insert image description here
It can be seen that there is still some improvement, and the range value can be set by yourself until the optimal one is found.

Summarize

The Kmeans clustering algorithm has a wide range of applications in practical applications, such as:

Market segmentation: Merchants can use K-means clustering algorithm to divide customers into different clusters, and formulate different marketing strategies for each cluster; Image segmentation:
K-means clustering algorithm can divide the pixels in the image into different clusters clusters to achieve image segmentation;
natural language processing: K-means clustering algorithm can divide text into different clusters to achieve text clustering and topic classification.
Semi-supervised learning: Allows us to use unlabeled data to improve the accuracy of classifiers or models when there is not enough labeled data.

In addition to K-means clustering algorithm, there are some other clustering algorithms, such as hierarchical clustering, spectral clustering, density clustering, etc. In practical applications, it is necessary to select an appropriate clustering algorithm according to the characteristics and requirements of the data set.

Although the kmeans algorithm is simple and widely used, it has certain shortcomings and cannot handle clusters of arbitrary shapes, so in the next section I will introduce the DBSCAN algorithm in the supervised algorithm.

Due to my limited level, if there are any mistakes in the above, please correct me
I will continue to learn and come up with more interesting things

Guess you like

Origin blog.csdn.net/qq_61260911/article/details/130063788