[] Machine learning machine learning portal 08-- clustering and K-Means clustering algorithm

Time flies, this article is the last in a series of machine learning portal. A short period of eight weeks, although machine learning and there is not much opportunity to use and familiar, but for some of the basic concepts of machine learning has almost been a synoptic understanding, such as classification and regression, loss of function, as well as some simple --kNN algorithm algorithm, decision tree algorithm.
Well, today on the use of K-Means clustering algorithm and to end our trip this machine learning.

1. Clustering

1.1 What is clustering

The process of collection of physical or abstract objects by like into a plurality of classes of objects is called clustering. The clusters generated by the cluster is a collection of a set of data objects, these objects similar to one another with the same object in a cluster, the cluster is different to other objects. "Like attracts like, people in groups" in the natural sciences and the social sciences, there are a lot of classification. Cluster analysis, also known as cohort analysis, it is the study of a statistical analysis method (sample or index) classification problems. Cluster analysis originated in taxonomy, but does not mean clustering classification. In that the classification of different clusters, cluster divided required class is not known. Cluster analysis of the content is very rich, hierarchical clustering method, ordered sample clustering, dynamic clustering, fuzzy clustering, graph theory, clustering, clustering prediction method.

- Excerpt from Baidu Encyclopedia "clustering" entry.

Cluster analysis (English: Cluster analysis), also known as cluster analysis, is a technique for statistical data analysis, it has been widely used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering is the similar objects into different groups or more subset (subset) by means of a static classification, so for some properties in the same subset of members of the object has a similar, included in the common coordinate system more space for short distance.

Generally summarized as a data clustering unsupervised learning.

- Excerpt from Wikipedia "cluster analysis" entry

Translate, try to cluster the data set is divided into several sub-sample set is usually not intersect, each sub-integration as a "cluster."
By this division, each cluster may correspond to some of the underlying concepts (ie classes), such as "light-colored melon," "dark melon", "watermelon has" "no watermelon" or even "this sweet potato" and "Outside sweet potatoes "and the like; It should be noted that these concepts for clustering algorithm is unknown in advance, automatic clustering process can only form a cluster structure, corresponding to the cluster by the user to grasp the concept of semantics and naming. (Excerpt from https://www.jianshu.com/p/caef1926adf7 )

1.2 The difference between clustering and classification

Clustering is an unsupervised learning algorithm, classification is a supervised learning algorithm. There is the so-called supervised training set of known labels (that is to know in advance where the training set of data belongs to which category), machine learning algorithms to learn on the training set to the appropriate parameters, build the model, and then applied to the test set. The clustering algorithm is no label, clustering, we do not care what a class is, we need to achieve the goal is to bring similar things come together.

With a small ball to Example:

  • Prepare a number of boxes were marked "blue", "red", "green", "yellow", the number of the four colored balls by color dropped into the appropriate box - Classification
  • There is no difference between preparing a number of boxes, boxes and boxes. The number of balls by mood into different boxes (such as: color difference between the larger, according to the color classification; classified according to size as colors approach; to the extent of soft Or consider this series of factors at the same time when the size is also close.).
    In simple terms, is classified in accordance with the specified criteria ball thrown into the box pre-customized, real-time clustering is put the ball in accordance with known standards thrown several identical box.

2. Cluster

Each set of objects is called clustering obtained by dividing a target cluster. To be meaningful cluster analysis to obtain useful objects cluster is the most critical step. Here are a few common clusters:

(excerpt from https://blog.csdn.net/taoyanqi8932/article/details/53727841 )
which, based on the center of the cluster is K-Means algorithm we are going to study the use to the cluster Types of.

Mentioned distance, and then repeat from various types previous article mentioned.

Our most popular is still the Euclidean distance (ie, the 2-norm of the difference vector)

3. K-Means algorithm

3.1 What is the K-Means algorithm

k-means clustering algorithm (k-means clustering algorithm) clustering algorithm is iterative solver, the step of randomly selected objects as the initial K cluster centers, and then calculated for each subject and each of the cluster centers of seed the distance between, assign each object to its distance from the nearest cluster center. Cluster centers and the object assigned to them on behalf of a cluster. Each sample is assigned a clustering of the cluster centers are recalculated based on the existing cluster objects. This process is repeated until a termination condition is met. Termination condition may be no (or minimal number) of the object is reassigned to different clusters, no (or minimal number) cluster centers then vary, local minimum squared error.

- Excerpt from Baidu Encyclopedia "K-means clustering algorithm" entry

K-平均算法(英文:k-means clustering)源于信号处理中的一种向量量化方法,现在则更多地作为一种聚类分析方法流行于数据挖掘领域。k-平均聚类的目的是:把n个点(可以是样本的一次观察或一个实例)划分到k个聚类中,使得每个点都属于离他最近的均值(此即聚类中心)对应的聚类,以之作为聚类的标准。这个问题将归结为一个把数据空间划分为Voronoi cells的问题。

这个问题在计算上是NP困难的,不过存在高效的启发式算法。一般情况下,都使用效率比较高的启发式算法,它们能够快速收敛于一个局部最优解。这些算法通常类似于通过迭代优化方法处理高斯混合分布的最大期望算法(EM算法)。而且,它们都使用聚类中心来为数据建模;然而k-平均聚类倾向于在可比较的空间范围内寻找聚类,期望-最大化技术却允许聚类有不同的形状。

k-平均聚类与k-近邻之间没有任何关系(后者是另一流行的机器学习技术)。

——摘自 Wikipedia“K-Means算法”词条

简而言之,K-Means算法做的事情是:对于给定的样本集,按照样本之间的距离大小,将样本集划分为K个簇。让簇内的点尽量紧密的连在一起,而让簇间的距离尽量的大。

3.2 K-Means算法的优缺点

  • 优点:
    • 简单,易于理解和实现;收敛快,一般仅需5-10次迭代即可,高效
  • 缺点:
    • 对K值得选取把握不同对结果有很大的不同
    • 对于初始点的选取敏感,不同的随机初始点得到的聚类结果可能完全不同
    • 对于不是凸的数据集比较难收敛
    • 对噪点过于敏感,因为算法是根据基于均值的
    • 结果不一定是全局最优,只能保证局部最优
    • 对球形簇的分组效果较好,对非球型簇、不同尺寸、不同密度的簇分组效果不好。

3.3 K-Means算法描述和代码实现

K-Means算法的流程描述如下:

给出一段参考的Python代码实现:

import numpy as np
import pandas as pd
import random
import sys
import time
class KMeansClusterer:
    def __init__(self,ndarray,cluster_num):
        self.ndarray = ndarray
        self.cluster_num = cluster_num
        self.points=self.__pick_start_point(ndarray,cluster_num)
         
    def cluster(self):
        result = []
        for i in range(self.cluster_num):
            result.append([])
        for item in self.ndarray:
            distance_min = sys.maxsize
            index=-1
            for i in range(len(self.points)):                
                distance = self.__distance(item,self.points[i])
                if distance < distance_min:
                    distance_min = distance
                    index = i
            result[index] = result[index] + [item.tolist()]
        new_center=[]
        for item in result:
            new_center.append(self.__center(item).tolist())
        # 中心点未改变,说明达到稳态,结束递归
        if (self.points==new_center).all():
            return result
         
        self.points=np.array(new_center)
        return self.cluster()
             
    def __center(self,list):
        '''计算一组坐标的中心点
        '''
        # 计算每一列的平均值
        return np.array(list).mean(axis=0)
    def __distance(self,p1,p2):
        '''计算两点间距
        '''
        tmp=0
        for i in range(len(p1)):
            tmp += pow(p1[i]-p2[i],2)
        return pow(tmp,0.5)
    def __pick_start_point(self,ndarray,cluster_num):
        
        if cluster_num <0 or cluster_num > ndarray.shape[0]:
            raise Exception("簇数设置有误")
      
        # 随机点的下标
        indexes=random.sample(np.arange(0,ndarray.shape[0],step=1).tolist(),cluster_num)
        points=[]
        for index in indexes:
            points.append(ndarray[index].tolist())
        return np.array(points)

4. 结语

好了,为期八周的机器学习入门到这里就结束了。希望这段学习经历能够或多或少能在将来起到一些作用。
那么,笔者也终于可以全心全意开始课程设计和期末复习啦。Bye~

参考资料:

https://www.jianshu.com/p/caef1926adf7
https://blog.csdn.net/taoyanqi8932/article/details/53727841
https://zh.wikipedia.org/wiki/%E8%81%9A%E7%B1%BB%E5%88%86%E6%9E%90
https://zh.wikipedia.org/wiki/K-%E5%B9%B3%E5%9D%87%E7%AE%97%E6%B3%95
https://baike.baidu.com/item/K%E5%9D%87%E5%80%BC%E8%81%9A%E7%B1%BB%E7%AE%97%E6%B3%95/15779627?fr=aladdin
https://baike.baidu.com/item/%E8%81%9A%E7%B1%BB/593695?fr=aladdin

Guess you like

Origin www.cnblogs.com/DrChuan/p/12082985.html