Single-pass clustering algorithm for weather clustering


Introduction to Clustering Algorithms

(1) Systematic clustering method

The basic idea of ​​the system clustering method is: the samples with close distances are clustered into clusters first, and the samples with long distances are clustered into clusters later.

According to the definition between classes, the systematic clustering method can be divided into the shortest distance method, the longest distance method, the middle distance method, the center of gravity method, the class average method, the variable class average method, the variable method, and the sum of squared deviation method. .

have to be aware of is:

(1) This method does not need to specify the number of clusters in advance, but is determined according to the final classification process.

(2) For intuitive reflection, the classification system can be drawn as a pedigree diagram, so system clustering is also called pedigree analysis.

(2) K-means clustering method

K-means clustering method is a very classic clustering algorithm, and the relevant information is very rich, so this article will not repeat it.

have to be aware of is:

(1) This method performs clustering by calculating the Euclidean distance and comparing the similarity between samples. However, I have also seen that clustering is done by calculating the correlation coefficient.

(2) This method needs to specify the number of clusters, and the determination of K is a difficult point. There are many methods for K optimization.

(3) single-pass clustering method

meaning applies

Single-pass clustering, the Chinese name is generally translated as "single-pass clustering", which is a concise and efficient text clustering algorithm. Compared with the commonly used K-means clustering method, its calculation speed is very fast, and it does not need to specify the number of clusters, but is limited by setting a similarity threshold.

The Single-pass clustering algorithm is also an incremental clustering algorithm (Incremental Clustering Algorithm), each document only needs to flow through the algorithm once, and is commonly used in the clustering of text topics. It can be well applied to social media big data fields such as topic monitoring and tracking, online event monitoring, etc. It is especially suitable for streaming data, such as Weibo post information.

Streaming data: A set of sequential, massive, fast, and continuously arriving data sequences. In general, streaming data can be viewed as a collection of dynamic data that grows infinitely over time.
Source: Baidu Encyclopedia

processing steps

The Single-pass algorithm processes the text sequentially, using the first document as a seed to establish a new topic. After that, the similarity between the newly entered document and the existing topic is carried out, and the document is added to the topic with the highest similarity and greater than a certain threshold. If the similarity with all existing topics is less than the threshold, use the document as a clustering seed to establish a new topic category. The algorithm flow is as follows:

(1) Use the first document as a seed to create a topic;

(2) Vectorize the document X;

(3) Calculate the similarity between document X and all existing topics, and use the Euclidean distance or cosine distance equidistance measurement method

(4) Find out the existing topic that has the greatest similarity with document X;

(5) If the similarity value is greater than the threshold θ, then add document X to the topic with the largest similarity, and jump to (7);

(6) If the similarity value is less than the threshold θ, the document X does not belong to any existing topic, and a new topic category needs to be created, and the current text is assigned to the newly created topic category;

(7) Clustering ends, waiting for the next document to enter

Note: The ones with the highest similarity mentioned above are classified into one category. In the actual code, the so-called maximum similarity is greater than a certain threshold, and it can also be understood as the closest (small) distance is less than a certain threshold.

Sample description

Screenshot of sample information
Save the sample information to a TXT file, as shown above after the screenshot.

I don't know how many lines there are, but I can find out with describe() in python, and the number of columns is ten.

The two categories after the date are the highest and lowest temperature, and the last two columns indicate the longitude and latitude of the place.

Code

Define classes and functions

First, a cluster unit ClusterUnit is defined, a single-type OnePassCluster is defined, and the Euclidean distance euclidian_distance between vector a and b is defined.

# coding=utf-8
import numpy as np
from math import sqrt
import time
import matplotlib.pylab as pl


# 定义一个簇单元
class ClusterUnit:
    def __init__(self):
        self.node_list = []  # 该簇存在的节点列表
        self.node_num = 0  # 该簇节点数
        self.centroid = None  # 该簇质心

    def add_node(self, node, node_vec):
        """
        为本簇添加指定节点,并更新簇心
         node_vec:该节点的特征向量
         node:节点
         return:null
        """
        self.node_list.append(node)
        try:
            self.centroid = (self.node_num * self.centroid + node_vec) / (self.node_num + 1)  # 更新簇心
        except TypeError:
            self.centroid = np.array(node_vec) * 1  # 初始化质心
        self.node_num += 1  # 节点数加1

    def remove_node(self, node):
        # 移除本簇指定节点
        try:
            self.node_list.remove(node)
            self.node_num -= 1
        except ValueError:
            raise ValueError("%s not in this cluster" % node)  # 该簇本身就不存在该节点,移除失败

    def move_node(self, node, another_cluster):
        # 将本簇中的其中一个节点移至另一个簇
        self.remove_node(node=node)
        another_cluster.add_node(node=node)


# cluster_unit = ClusterUnit()
# cluster_unit.add_node(1, [1, 1, 2])
# cluster_unit.add_node(5, [2, 1, 2])
# cluster_unit.add_node(3, [3, 1, 2])
# print(cluster_unit.centroid)


# 计算向量a与向量b的欧式距离
def euclidian_distance(vec_a, vec_b):
    diff = vec_a - vec_b
    return sqrt(np.dot(diff, diff))             # dot计算矩阵内积


class OnePassCluster:
    def __init__(self, t, vector_list):
        # t:一趟聚类的阈值
        self.threshold = t                      # 一趟聚类的阈值
        self.vectors = np.array(vector_list)    # 数据列表(向量列表)
        self.cluster_list = []                  # 聚类后簇的列表

        t1 = time.time()
        self.clustering()
        t2 = time.time()
        self.cluster_num = len(self.cluster_list)       # 聚类完成后 簇的个数
        self.spend_time = t2 - t1                       # 聚类花费的时间

    def clustering(self):
        self.cluster_list.append(ClusterUnit())                 # 初始新建一个簇
        self.cluster_list[0].add_node(0, self.vectors[0])       # 将读入的第一个节点归于该簇
        for index in range(len(self.vectors))[1:]:
            min_distance = euclidian_distance(vec_a=self.vectors[index],
                                              vec_b=self.cluster_list[0].centroid)  # 与簇的质心的最小距离
            min_cluster_index = 0  # 最小距离的簇的索引
            for cluster_index, cluster in enumerate(self.cluster_list[1:]):
                # enumerate会将数组或列表组成一个索引序列
                # 寻找距离最小的簇,记录下距离和对应的簇的索引
                distance = euclidian_distance(vec_a=self.vectors[index],
                                              vec_b=cluster.centroid)
                if distance < min_distance:
                    min_distance = distance
                    min_cluster_index = cluster_index + 1
            if min_distance < self.threshold:                   # 最小距离小于阈值,则归于该簇
                self.cluster_list[min_cluster_index].add_node(index, self.vectors[index])
            else:  # 否则新建一个簇
                new_cluster = ClusterUnit()
                new_cluster.add_node(index, self.vectors[index])
                self.cluster_list.append(new_cluster)
                del new_cluster

    def print_result(self, label_dict=None):
        # 打印出聚类结果
        # label_dict:节点对应的标签字典
        print("***********  single-pass的聚类结果展示  ***********")
        for index, cluster in enumerate(self.cluster_list):
            print("cluster:%s" % index)         # 簇的序号
            print(cluster.node_list)            # 该簇的节点列表
            if label_dict is not None:
                print(" ".join([label_dict[n] for n in cluster.node_list]))     # 若有提供标签字典,则输出该簇的标签
            print("node num: %s" % cluster.node_num)
            print( "-------------")
        print( "所有节点的个数为: %s" % len(self.vectors))
        print("簇类的个数为:%s" % self.cluster_num)
        print("花费的时间为: %.9fs" % (self.spend_time / 1000))

After running, a total of ten clusters are clustered, and the number of clusters ranges from 0-9.

The following figure shows the cities included in the eighth category and their corresponding indexes:
Contents in the eighth category
the overall running results are as follows:
operation result

call and draw

Clustering is then achieved by instantiating classes and calling functions

# 读取测试集
temperature_all_city = np.loadtxt('data1.txt', delimiter=",", usecols=(3, 4),encoding='utf-8')  # 读取聚类特征
xy = np.loadtxt('data1.txt', delimiter=",", usecols=(8, 9),encoding='utf-8')  # 读取各地经纬度
f = open('data1.txt', 'r',encoding='utf-8')
lines = f.readlines()
zone_dict = [i.split(',')[1] for i in lines]  # 读取地区并转化为字典

f.close()

# 构建一趟聚类器
clustering = OnePassCluster(vector_list=temperature_all_city, t=9)
clustering.print_result(label_dict=zone_dict)
print(temperature_all_city)


# 将聚类结果导出图
fig, ax = pl.subplots()
fig = zone_dict
c_map = pl.get_cmap('jet', clustering.cluster_num)
c = 0

for cluster in clustering.cluster_list:
    for node in cluster.node_list:
        #ax.scatter(xy[node][0], xy[node][1], c=c, s=30, cmap=c_map, vmin=0, vmax=clustering.cluster_num)
        ax.scatter(xy[node][0], xy[node][1])
    c += 1

#pl.axis('off') # 不显示坐标轴
pl.savefig('./map.jpg')
pl.show()

According to the latitude and longitude information in the sample, combined with the results of the clustering algorithm, the graph can be drawn as follows:
Visual drawing

Summarize

This article mainly gives an example of the single-pass clustering algorithm, which is very easy to reproduce. I hope it will be helpful to you brothers and sisters.

Reference: https://blog.csdn.net/maqian5/article/details/107333316

The data and codes in this article are quoted elsewhere. If the source is not marked, please contact me to change and add.

If you need the data and code of this article, you can leave a message in the comment area of ​​this article. I hope it will be helpful to you brothers and sisters.

Guess you like

Origin blog.csdn.net/golden_knife/article/details/124434270