【机器学习】8 聚类

1 Unsupervised Learning

  • 无标签数据
  • 聚类(clustering):点集,就是聚在一起的点组成的一个小部落

2 K-Means Algorithm(K-均值算法)

  • iterate algorithm
  1. cluster assignment(簇分配)
  2. move centroid(移动聚类中心)
  • Randomly initialize K K K cluster centroids μ 1 , μ 2 , ⋅ ⋅ ⋅ , μ K ∈ R n \mu_1,\mu_2,···,\mu_K∈\mathbb{R}^n μ1,μ2,,μKRn
    Repeat{
    (1) Cluster assignment step
        for i = 1 i=1 i=1 to m m m
           c ( i ) c^{(i)} c(i) := index ( from 1 to K K K ) of cluster centroid closest to x ( i ) x^{(i)} x(i)
    c ( i ) = min k ∣ ∣ x ( i ) − μ k ∣ ∣ 2 c^{(i)}=\mathop{\text{min}}\limits_{k}{||x^{(i)}-\mu_k||}^2 c(i)=kminx(i)μk2
    (2) Move centroid step
        for k = 1 k=1 k=1 to K K K
           μ k \mu_k μk := average ( mean ) of points assigned to cluster k k k
        }

3 Optimization Objective 优化目标

c ( i ) c^{(i)} c(i) = index of cluster ( 1 , 2 , ⋅ ⋅ ⋅ , K 1,2,···,K 1,2,,K) to which example x ( i ) x^{(i)} x(i) is currently assigned 当前样本 x ( i ) x^{(i)} x(i)所属的那个簇的索引或引号
μ k \mu_k μk = cluster centroid k k k ( μ k ∈ R n \mu_k∈\mathbb{R}^n μkRn) 第 k k k个聚类中心的位置
μ c ( i ) \mu_{c^{(i)}} μc(i) = cluster centroid of cluster to which example x ( i ) x^{(i)} x(i) has been assigned x ( i ) x^{(i)} x(i)所属的那个簇的聚类中心

  • Distortion Function 畸变函数
    min c ( i ) , ⋅ ⋅ ⋅ , c ( m ) μ 1 , ⋅ ⋅ ⋅ , μ K    J ( c ( 1 ) , ⋅ ⋅ ⋅ , c ( m ) , μ 1 , ⋅ ⋅ ⋅ , μ K ) = 1 m ∑ i = 1 m ∣ ∣ x ( i ) − μ c ( i ) ∣ ∣ 2 \mathop{\text{min}}\limits_{\begin{aligned}c^{(i)},···,c^{(m)}\\\mu_1,···,\mu_K\ \ \end{aligned}} J(c^{(1)},···,c^{(m)},\mu_1,···,\mu_K)=\frac{1}{m}\sum_{i=1}^m{||x^{(i)}-\mu_{c^{(i)}}||}^2 c(i),,c(m)μ1,,μK  minJ(c(1),,c(m),μ1,,μK)=m1i=1mx(i)μc(i)2

4 Random Initialization 随机初始化

  1. Should have K < m K<m K<m 聚类中心点的个数 < < <所有训练集实例的数量
  2. Randomly pick K K K training examples 随机选择 K K K个训练实例
  3. Set μ 1 , ⋅ ⋅ ⋅ , μ K \mu_1,···,\mu_K μ1,,μK equal to these K K K examples 令 K K K个聚类中心分别与这 K K K个训练实例相等

为了解决局部最小值问题, K K K较小,需要多次运行 K − 均 值 K-均值 K算法,每一次都重新进行随机初始化,最后再比较多次运行的结果,选择代价函数最小的结果

  • for i =1 to 100 {
        Randomly initialize K-means.
        Run K-means. Get c ( i ) , ⋅ ⋅ ⋅ , c ( i ) , ⋅ ⋅ ⋅ , μ 1 , ⋅ ⋅ ⋅ , μ K c^{(i)},···,c^{(i)},···,\mu_1,···,\mu_K c(i),,c(i),,μ1,,μK
        Compute cost function (distortion) J ( c ( 1 ) , ⋅ ⋅ ⋅ , c ( m ) , μ 1 , ⋅ ⋅ ⋅ , μ K ) J(c^{(1)},···,c^{(m)},\mu_1,···,\mu_K) J(c(1),,c(m),μ1,,μK)
        }
    Pick clustering that gave lowest cost J ( c ( 1 ) , ⋅ ⋅ ⋅ , c ( m ) , μ 1 , ⋅ ⋅ ⋅ , μ K ) J(c^{(1)},···,c^{(m)},\mu_1,···,\mu_K) J(c(1),,c(m),μ1,,μK)

5 Choosing the Number of Clusters

  • Elbow method 肘部法则
    肘部法则

6 Reference

吴恩达 机器学习 coursera machine learning
黄海广 机器学习笔记

猜你喜欢

转载自blog.csdn.net/qq_44714521/article/details/108527650