Clustering Machine Learning Chapter 9


转载请标明出处,本篇文章允许转载,禁止抄袭

foreword

       Clustering is the most important application in unsupervised learning.
       Unsupervised learning: the label information of the training samples is unknown

  •        Objective: Through the learning of unlabeled training samples, reveal the inherent properties and laws of the data, and provide the basis for further data analysis.

insert image description here
       For example, as shown in the figure, the above-mentioned unmarked (but with multiple attributes x) balls are divided into k classes.

1. Clustering task

       Clustering attempts to divide the samples in the data set into several usually disjoint subsets, each subset is called a "cluster".

insert image description here
       As shown in the figure, clustering divides it into 5 clusters. Each cluster may correspond to some potential concepts (not the attributes in the original attribute set) such as ball 15 (radius 0.5cm, smooth). The material and weight of the balls may be used to divide all the balls. However, the material of the balls or other concepts that can divide the balls into clusters are unknown to the clustering algorithm
in advance . During the clustering process, only the cluster structure can be formed automatically, but the concept corresponding to each cluster needs to be named by the user.
       Formally speaking, assuming that the sample set D = {x 1 , x 2 , , x m } contains m unlabeled samples, and each sample xi = ( xi 1 ; distribution structure        . It can also be used as a precursor process for other tasks such as classification.

Two, performance measurement

       Clustering performance measures are also referred to as clustering "effectiveness indicators". Similar to performance metrics in supervised learning.
       For the clustering results, we need to use some performance measure to evaluate its quality; on the other hand, if the final measure to be used is clear, we can directly use this measure as the optimization goal of the clustering process, so as to better obtain clustering results that meet the requirements.
       Clustering is to divide the sample D into a number of mutually disjoint subsets and set sample clusters.
       We want samples from the same cluster to be as similar to each other as possible, and samples from different clusters to be as different as possible. That is, the " intra-cluster similarity" of the clustering results is as high as possible, and the " inter-cluster similarity" is as high as possible.
       Clustering performance metrics broadly fall into two categories:

  • External metrics: compare the clustering results to some "reference model (i.e. external model)"
  • Internal indicators: directly examine the clustering results without using any reference model

1. External indicators

For a data set (m data) D = {x 1 , x 2 ,..., x m }, assume that the cluster division (k clusters) given by clustering is C={C 1 , C 2 ,...,C k }, and the cluster division (k clusters) given by the reference model is C*={C 1 *, C 2 *, ... C k *}.
insert image description here
Correspondingly, let λ and λ* denote the cluster label vectors corresponding to C and C*, respectively. We consider pairs of samples, define
                     a=|SS|,SS={(x i ,x j ) | λ i = λ j , λ i * = λ j *,i<j},(9.1)
                     b=|SD|,SD={(x i ,x j ) | λ i = λ j , λ i * ≠ λ j*,i<j},(9.2)
                     c=|DS|,DS={(x i , x j ) | λ i ≠ λ j , λ i * = λ j * , i < j},(9.3) d
                     =|DD|,DD={(x i , x j ) |         Pairs of samples that belong to the same cluster in C and also belong to the same cluster in C* are obtained.         The set SD contains pairs of samples that belong to the same cluster in C and belong to different clusters in C*.         The set DS contains pairs of samples that belong to different clusters in C and belong to the same cluster in C*.         The set DD contains pairs of samples that belong to different clusters in C and also belong to different clusters in C*. There is a+b+c+d=m(m-1)/2





Based on formulas (9.1)~(9.4), the following commonly used clustering performance metrics can be derived:

  • Jaccard coefficient (referred to as JC)
                          JC = a / (a+b+c)
  • FM Index (FMI for short)
                          FMI = √((a/(a+b)*(a/(a+c))
  • Rand Index (RI for short)
                          RI = 2(a+b)/(m(m-1))

The result values ​​of the above performance metrics are all in the [0,1] interval, and the larger the value, the better.

2. Internal indicators

The cluster division C={C 1 , C 2 ,...,C k }         considering the clustering results is defined as:

Please add a picture description

The above definition can be analyzed with reference to the following figure:
Please add a picture description

Based on formulas (9.8)~(9.11), the following commonly used clustering performance measurement internal indicators can be derived: The
Please add a picture description
smaller the value of DBI, the better, and the larger the value of DI, the better.

3. Distance calculation

       对于函数dist(·,·),若他是一个“距离度量”,则需满足一些基本性质:
        非负性:dist(xi,xj)≥0                可理解为:距离不能为负
        同一性:dist(xi,xj)= 0 当且仅当xi=xj        可理解为:当一个距离为0时,只有当且仅当这两个点重合时
        对称性:dist(xi,xj)=dist(xj,xi)        可理解为:i点到j点的距离=j点到i点的距离
        直递性:dist(xi,xj)≤ dist(xi,xk)+ dist(xk,xj) 可理解为:两边之和大于第三边,当k在点i和点j的连线上时,等式成立。

          Given samples x i = (x i1 ; x i2 ;...; x in ) and x j = (x j1 ; x j2 ;...; x jn ), the most commonly used is " Minkowski distance "
        dist mk ( xi , x j ) = (∑ i = 1 n ∣ xiu − xju ∣ p ) 1 / p (\sum_{i=1}^{n} |x_{iu}-x_{ju}|^p)^{1/p}i=1nxiuxyes _p1/ p
to p≥1, the above formula obviously satisfies the basic properties of the distance measure.
          When p = 2, the Minkowski distance is the Euclidean distance
        disted(xi,xj) = ||xi-xj||2=( ∑ i = 1 n ∣ xiu − xju ∣ ) \sqrt{}(\sum_{i=1}^{n} |x_{iu}-x_{ju}|) i=1nxiuxyes _)
           When p = 1, the Minkowski distance is the Manhattan distance
        distman(xi,xj) = ||xi-xj||1=∑ i = 1 n ∣ xiu − xju ∣ \sum_{i=1}^{n} |x_{iu}-x_{ju}|i=1nxiuxyes _

Minkowski distance can be used for ordered attributes.

Note:
   Attributes are often divided into: continuous attributes and discrete attributes.
   When discussing distance calculation, it is more important whether the "order" relationship is defined on the attribute. Here, discrete attributes are further divided into ordered attributes and unordered attributes.
Ordered attributes:
   attributes that can directly calculate distances on attribute values, such as {1, 2, 3}
Unordered attributes:
   cannot directly calculate distances on attribute values, such as {dog, kitten, mouse}

VDM can be used for unordered attributes. (I will not explain in detail here, please refer to Watermelon Book p200 for details)
Please add a picture description

Combining Minkowski distance and VDM can handle mixed attributes.Please add a picture description

Weighted Minkowski distance:Please add a picture description

Note:
  Similarity measures
    Usually we define "similarity measures" based on some form of distance, the larger the distance, the smaller the similarity. The distance used for the similarity measure does not necessarily have to satisfy all the basic properties of the distance measure, especially the directness.

4. Prototype Clustering

Prototype clustering is also called "prototype-based clustering". This type of algorithm assumes that the cluster structure can be described by a set of prototypes, and is very commonly used in real-world clustering tasks. Usually, the algorithm first initializes the prototype, and then iteratively updates the prototype to solve it. Using different prototype representations and different solving methods will produce different algorithms.

(1) k-means algorithm (k-means algorithm)

  Given a sample set D = {x 1 , x 2 ,...,x m }, the k-means algorithm divides the clusters obtained by clustering C = {C1, C2,..., Ck} (divide D into k clusters to obtain cluster 1, cluster 2,..., cluster k), minimize the square error E, and obtain:
        E = ∑ i = 1 k ∑ x ∈ C i ∣ ∣ x − μ i ∣ ∣ 2 2 \sum_{ i=1}^{k}\sum_{x∈C_i} ||x - μ_i||_2^2i=1kxCi∣∣xmi22
其中,μi = 1 ∣ C i ∣ ∑ x ∈ C i x \frac 1 {|C_i|}\sum_{x∈C_i}x Ci1xCix
   μiis the mean vector of cluster Ci. The above formula describes to a certain extent the closeness of the samples in the cluster around the cluster mean vector. The smaller the E, the higher the similarity of the samples in the cluster.
  Minimizing E is not easy. Finding its optimal solution requires examining all possible cluster divisions in the sample set D. This is an NP-hard problem (it can be temporarily understood as the data is too large and the difficulty is too high).
  Therefore, the k-means algorithm adopts a greedy strategy and approximates the solution through iterative optimization.

The algorithm flow is as follows:
Please add a picture description

(2) Learning vector quantization

  Similar to the k-means algorithm, learning vector quantization (LVQ) is also a view to find a set of prototype vectors to describe the cluster structure.
  However, unlike general clustering algorithms, LVQ assumes that data samples are labeled with categories, and the learning process uses these supervised information of samples to assist clustering.


   This algorithm is a bit abstract, and the idea of ​​the algorithm is described in a relatively vernacular way, with examples listed, which will be relatively easier to understand.

  • First input the sample set D, input the number q of the desired prototype vectors, and input the initial category label {t1,t2,...,tq} of each hypothetical prototype vector. The category label of this prototype vector belongs to the category label in the original sample set D. (For example: the label Y={airplane, ship, train, high-speed rail} of the original sample set, the initial category label set must also be one of them (ti∈Y), bicycles and the like cannot appear, but it can be repeated, such as 5 prototype vectors={airplane, airplane, train, high-speed rail, airplane}.) Input learning rate η∈(0,1).

  • Initialize a set of prototype vectors {p1,p2,p3,p4,p5} (for p5, assuming that the 5th is an airplane as in the above example, randomly select one from the samples marked as airplanes as the prototype vector of the 5th cluster)

  • do {   randomly select samples (xj, yj) from the sample set D;   calculate the distance between sample xj and each prototype vector pi: d ji = ||x j - p i || 2   Find the prototype vector closest to xj, assuming it is pi* (update the nearest prototype vector) If the label yj of the selected   sample = the label of this prototype vector, the new prototype vector p' = pi* + η · (xj - pi*) If the two labels are not equal, the new prototype vector p' = pi* - η · (xj - pi*) update: assign the value of this new prototype vector p' to   pi   *





  • }while (satisfying the stop condition) the stop condition of the algorithm: if the maximum number of iterations has been reached, or the prototype vector update is small or even no longer updated

  • Output prototype vector {p1,p2,p3,p4,p5}

  The key to the algorithm is how to update the prototype vector . It can be understood that, for a sample, if the prototype vector and the sample belong to the same category, then the prototype vector is close to the sample, otherwise it is far away.
  After learning a set of prototype vectors, the cluster division of the sample space χ can be realized.
  For any sample x, it will be classified into the cluster represented by the closest prototype vector; in other words, each prototype vector pi defines a region Ri known to be related, and the distance between each sample and pi in this region is no greater than the distance between him and other prototype vectors. Thus, a cluster partition {R1, R2, . . . , Rq} of the sample space χ is formed, which is usually called a Voronoi partition .

(3) Gaussian mixture clustering

1. Concept

  This chapter has a high degree of abstraction. It is recommended that you read it in conjunction with the book, or skip it, and read it again when necessary.

  Unlike k-means and LVQ, which use prototype vectors to describe the cluster structure, Gaussian mixture clustering uses a probability model to express the cluster prototype.

  First, let's review (preview) the definition of the Gaussian distribution. For a random vector x in n-dimensional sample space χ, if x obeys Gaussian distribution, its probability
insert image description here
  density
    is
Please add a picture description

2. Algorithm idea

Algorithmic ideas are still described in the vernacular as much as possible here:

输入:样本集D,高斯混合成分个数k(可以视为自己设定的分成簇的个数)
过程:(这里把过程用花括号括起来方便理解)
1:初始化高斯混合分布的模型参数{(αi,μi,∑i)| 1≤i≤k}
2:do{(循环迭代以下过程直至满足条件)
3:  for j = 1,2,……,m {   (对每个训练样本进行循环)
4:    do{
5:       γji=pM(zj = i| xj) (1≤i≤k) (根据公式(略)计算第j个样本由各混合成分生成的后验概率,1个样本计算k次,生成的γ可以看成是一个矩阵形式)
6:      }
7:  }
8:  for i = 1,2,……,k {   (对每个簇进行循环)
9:    do{
10:       计算新的均值向量μi、新的协方差矩阵∑i、新的混合系数αi(公式略)
11:      }
12:  }
13:  将模型参数{(αi , μ i , ∑ i )| 1≤i≤k} changed to {(α' i , μ' i , ∑' i )| 1≤i≤k}
14:}
15:C i = ∅ (1≤i≤k) to clear the cluster
16: for j=1,2,...,m { 17: do{ 18 : Determine the cluster label of the jth sample according to the formula (abbreviated) λ j 19: Will Divide x j into the corresponding cluster: C λj = C λj ∪{x j } (find the corresponding cluster by the cluster label of the sample, and add this sample to this cluster) 20: } 21: } Output: cluster division C={C1,C2,...,Ck}





  In general, it can be understood that the algorithm assumes that the sample generation process is given by a Gaussian mixture distribution. By continuously updating the posterior probability of each sample generated by each mixture component, and then updating the new mean vector and other parameters according to the new posterior probability, after cyclic iterations, each sample is divided into the corresponding cluster.

5. Density clustering

  Density clustering, also known as "density-based clustering", is an algorithm that assumes that the cluster structure can be determined by the tightness of the sample distribution .
  Usually, the density clustering algorithm examines the connectivity between samples from the perspective of sample density , and continuously expands the clusters based on the connected samples to obtain the final clustering results.

  DBSCAN is a well-known density clustering algorithm, which describes the closeness of samples based on a set of neighborhood parameters.
Algorithm idea:
  first choose a core object as the seed, and then determine the corresponding clustering cluster.
  The algorithm first finds out all the core objects according to the given neighborhood parameters;
  then selects the core objects as the starting point, finds its reachable samples and generates clusters
  until all samples are visited.
insert image description here
Algorithm iteration process (example):
Please add a picture description

6. Hierarchical clustering

  Hierarchical clustering attempts to divide the data set at different levels to form a tree-like clustering structure.
  The division of data sets can adopt a bottom-up aggregation strategy, or a top-down split strategy.
  AGENS is a hierarchical clustering algorithm using a bottom-up aggregation strategy. He first regards each sample in the data set as an initial cluster, and then finds the two closest clusters in each step of the algorithm to merge, and this process is repeated until the preset number of clusters is reached.
  The key here is how to calculate the distance between clusters .
In fact, each cluster is a sample set, so it is only necessary to adopt a certain distance about the set, which is as follows: Obviously, the
Please add a picture description
  minimum distance is determined by the nearest sample of the two clusters, the maximum distance is determined by the farthest sample of the two clusters, and the average distance is determined by all samples of the two clusters. When the clustering distance is calculated by the above three distances, it is called the single-linkage, full-linkage, and even-linkage algorithms accordingly.

Please add a picture description
Algorithm idea:
  The algorithm first initializes the initial clusters containing only one sample and the corresponding distance matrix,
  continuously merges the closest clusters, and updates the distance matrix of the merged clusters
  until the conditions are met

  The AGENS algorithm has been executed so that all samples appear in the same cluster, that is, k=1, and the following dendrogram can be obtained.
insert image description here
  By segmenting at a specific level of the dendrogram, the corresponding cluster division results can be obtained.


转载请标明出处,本篇文章允许转载,禁止抄袭

Guess you like

Origin blog.csdn.net/G_Shengn/article/details/127341835