R language using the optimal cluster number of clusters k-medoids clustering customer segmentation

Original link: http://tecdat.cn/?p=9997


k-medoids cluster Profile

k-medoids another clustering algorithm, it may be used to locate packets in a data set. k-medoids clustering and k-means clustering is very similar, except for some differences. Optimization k-medoids clustering algorithm with k-means slightly different. In this section, we will study the k-medoids clustering.

k-medoids clustering algorithm

There are many different types of algorithms may perform k-medoids cluster, the simplest and most effective method is PAM. In PAM, we perform the following steps to find the cluster center:

  1. K data points selected as the starting point from the cluster center of the scattergram.

  2. They distance of all points in the scatter plot is calculated.

  3. Each classification point to the nearest cluster center.

  4. In each cluster select a new point, such that the sum of their minimum distance to the cluster all points.

  5. Repeat  Step 2  until the center stops changing.

It can be seen, in addition to Step 1  and  Step 4 outside PAM same k-means clustering algorithm Algorithm. For most practical purposes, the results of the cluster analysis k-medoids almost the same k-means clustering. However, in some special cases, we have outliers in the data set, and therefore preferred k-medoids clustering, because it is more robust than the outliers.

Cluster codes k-medoids

In this section, we will use the same iris data set used in the two previous paragraphs, and compared to see if the result is significantly different from the results obtained last time.

K-medoid achieve clustering

In this exercise, we will use the pre-built library to perform R k-medoids:

  1. The first two columns in the data set stored  iris_data  variables:

     

    iris_data<-iris[,1:2]
  2. Install the package:

     

    install.packages("cluster")
  3. Import packages:

     

    library("cluster")
  4. The PAM clustering results are stored in  km.res  variable:

     

    km<-pam(iris_data,3)
  5. Import library:

     

    library("factoextra")
  6. PAM clustering results plotted in the drawings:

     

    fviz_cluster(km, data = iris_data,palette = "jco",ggtheme = theme_minimal())

    Output is as follows:

    FIG: Clustering Results k-medoids

The results k-medoids clustering is not much difference with the results of our k-means clustering made in the previous section.

Thus, we can see the front of the PAM algorithm our data set into three clusters, which clusters and cluster we have three by k-means clustering similar.

 

FIG: Results k-medoids clustering and k-means clustering

In the previous figures, k-means clustering was observed and how the k-means clustering so close to the center, but the center of k-means clustering directly overlap the existing data points, and not the center of k-means clustering.

k- means clustering clustering and k-medoids

Now that we have studied the k-means clustering and k-medoids, they are almost exactly the same, we will study the differences between them and when to use which type of cluster:

  • Computational complexity: In both methods, k-medoids clustering is more computationally complex. When our data set is too large (> 10,000) and when we want to save computing time, with respect to the k-medoids clustering, we prefer to k-means clustering.

    Whether the data set is large depends entirely on the available computing power.

  • Presence of outliers: k-means clustering readily susceptible to outliers than the more outliers.

  • Cluster center: k-means clustering algorithm and k cluster centers are found in different ways.

Clustering using k-medoids customer segmentation

Customer data set performed using k-means clustering and k-medoids, and compare the results.

step:

  1. Select only two, namely grocery store and freeze for convenient cluster of two-dimensional visualization.

  2. Clusters using k-medoids draw a graph, the data show four clusters.

  3. K-means clustering using four clusters plotted in FIG.

  4. Comparison of the two figures, to review the results of the two methods of how different.

The result will be a k-means clustering FIG follows:

 

Figure: Expected k-means clustering map

Determine the optimal number of clusters

So far, we have been studying the iris data set, the data set we know how many flowers, and select the data set is divided into three clusters based on this knowledge. However, unsupervised learning, our main task is to deal with no data information, such as data set the number of natural clusters or categories. Similarly, the cluster can also be a form of exploratory data analysis.

Type of clustering index

Unsupervised learning methods to determine the optimal number of clusters is more than one. The following is the content we will study in this chapter:

  • Profile score

  • Elbow law / WSS

  • 差距统计

轮廓分数

轮廓分数或平均轮廓分数计算用于量化通过聚类算法实现的聚类质量。

轮廓分数在1到-1之间。如果聚类的轮廓分数较低(介于0和-1之间),则表示该聚类散布开或该聚类的点之间的距离较高。如果聚类的轮廓分数很高(接近1),则表示聚类定义良好,并且聚类的点之间的距离较低,而与其他聚类的点之间的距离较高。因此,理想的轮廓分数接近1。

 

计算轮廓分数

我们学习如何计算具有固定数量簇的数据集的轮廓分数:

  1. 将iris数据集的前两列(隔片长度和隔宽度)放在  iris_data  变量中:

     

  2. 执行k-means集群:

     

  3. 将k均值集群存储在  km.res  变量中:

     

  4. 将所有数据点的成对距离矩阵存储在  pair_dis  变量中:

     

  5. 计算数据集中每个点的轮廓分数:

     

  6. 绘制轮廓分数图:

     

    输出如下:

  7. 图:每个群集中每个点的轮廓分数用单个条形表示

前面的图给出了数据集的平均轮廓分数为0.45。它还显示了聚类和点聚类的平均轮廓分数。

我们计算了三个聚类的轮廓分数。但是,要确定要拥有多少个群集,就必须计算数据集中多个群集的轮廓分数。

确定最佳群集数

针对k的各个值计算轮廓分数来确定最佳的簇数:

从前面的图中,选择得分最高的k值;即2。根据轮廓分数,聚类的最佳数量为2。

  1. 将数据集的前两列(长度和宽度)放在  iris_data  变量中:

  2. 导入  库

  3. 绘制轮廓分数与簇数(最多20个)的图形:

    注意

    在第二个参数中,可以将k-means更改为k-medoids或任何其他类型的聚类。

    输出如下:

    图:聚类数与平均轮廓分数

WSS /肘法

为了识别数据集中的聚类,我们尝试最小化聚类中各点之间的距离,并且平方和(WSS)方法可以测量该距离  。WSS分数是集群中所有点的距离的平方的总和。

使用WSS确定群集数

在本练习中,我们将看到如何使用WSS确定集群数。执行以下步骤。

  1. 将虹膜数据集的前两列(隔片长度和隔片宽度)放在  iris_data  变量中:

  2. 导入  库

  3. 绘制WSS与群集数量的图表

    输出如下:

  4. 图:WSS与群集数量

在前面的图形中,我们可以将图形的肘部选择为k = 3,因为在k = 3之后WSS的值开始下降得更慢。选择图表的肘部始终是一个主观选择,有时可能会选择k = 4或k = 2而不是k = 3,但是对于这张图表,很明显k> 5是不适合k的值,因为它们不是图形的肘部,而是图形的斜率急剧变化的地方。

差距统计

差距统计数据是在数据集中找到最佳聚类数的最有效方法之一。它适用于任何类型的聚类方法。通过比较我们观察到的数据集与没有明显聚类的参考数据集生成的聚类的WSS值,计算出Gap统计量。

因此,简而言之,Gap统计量用于测量观察到的数据集和随机数据集的WSS值,并找到观察到的数据集与随机数据集的偏差。为了找到理想的聚类数,我们选择k的值,该值使我们获得Gap统计量的最大值。

利用间隙统计量计算理想的簇数

在本练习中,我们将使用Gap统计信息计算理想的聚类数目:

  1. 将Iris数据集的前两列(隔片长度和隔片宽度)放在  iris_data  变量中

     

  2. 导入  factoextra  库

     

  3. 绘制差距统计与集群数量(最多20个)的图表:

     

    图1.35:差距统计与集群数量

如上图所示,Gap统计量的最大值是k = 3。因此,数据集中理想的聚类数目为3。

找到理想的细分市场数量

使用上述所有三种方法在客户数据集中找到最佳聚类数量:

将变量中的批发客户数据集的第5列到第6列加载。

  1. 用轮廓分数计算k均值聚类的最佳聚类数。

  2. 用WSS分数计算k均值聚类的最佳聚类数。

  3. 使用Gap统计量计算k均值聚类的最佳聚类数。

结果将是三个图表,分别代表轮廓得分,WSS得分和Gap统计量的最佳聚类数。

发布了445 篇原创文章 · 获赞 246 · 访问量 97万+

Guess you like

Origin blog.csdn.net/qq_19600291/article/details/103730713