Euclidean distance clustering algorithm (for learning only)

1. Definition

Euclidean distance clustering algorithm (Euclidean Distance Clustering Algorithm) is a clustering algorithm based on Euclidean distance. different classes. This algorithm is a hierarchical clustering algorithm, because the clustering results generated by it can be expressed as a tree structure (called clustering tree or spectral tree), each node on the tree represents a cluster, and each node The child nodes of represent the subclusters of the node.

Euclidean distance refers to the distance between two points in n-dimensional space, namely:

d(x,y)=√((x1−y1)²+(x2−y2)²+...+(xn−yn)²)

Among them, x and y are two n-dimensional vectors, and x1, x2,..., xn and y1, y2,..., yn are the values ​​​​in their corresponding dimensions.

Two, steps

  1. Initialization: Treat each sample point as a cluster.

  2. To calculate the distance between adjacent clusters, the single linkage or complete linkage method is usually used to calculate the distance.

  3. Merge the two closest clusters to form a new cluster.

  4. Repeat steps 2 and 3 until all sample points are clustered into one class or reach the preset number of clusters.

3. Advantages and disadvantages

The advantage of the Euclidean distance clustering algorithm is that it is simple to implement, easy to understand and implement.

But its disadvantage is that it is sensitive to noise and outliers, and because it is a hierarchical clustering algorithm, it needs to calculate the distance between all sample points, resulting in low efficiency when processing large-scale data.

4. Use the Euclidean distance clustering algorithm to complete the steps of classifying point clouds of different tree species

  1. Data preprocessing: convert point cloud data into n-dimensional vector form, where n is the feature dimension of each point. Consider using descriptors based on features such as shape and color to represent the feature vector of each point.

  2. Initialization: Treat each point as a cluster.

  3. Calculate the distance between adjacent clusters, and usually use the single link or full link method to calculate the distance. A distance matrix can be used to store the distances between adjacent clusters to reduce computation.

  4. Merge the two closest clusters to form a new cluster. Clusters can be merged using Union Find.

  5. Repeat steps 3 and 4 until all points are clustered into one class or the preset number of clusters is reached.

  6. Optional post-processing: Some clustering evaluation indicators (such as silhouette coefficient) can be used to evaluate the clustering quality, and to visualize and analyze the clustering results.

In the specific implementation process, it is necessary to pay attention to the selection of appropriate feature descriptors and distance calculation methods, as well as the reasonable setting of parameters such as the number of clusters and the threshold of cluster merging. At the same time, when processing point cloud data, problems such as point cloud density and noise need to be considered, and methods such as filtering and downsampling can be used to deal with it.

5. Code implementation

1.Python

import numpy as np
from sklearn.cluster import AgglomerativeClustering

# 加载点云数据
data = np.loadtxt('data.txt')

# 构建欧氏距离聚类模型
model = AgglomerativeClustering(n_clusters=3, linkage='ward', affinity='euclidean')

# 训练模型并进行预测
labels = model.fit_predict(data)

# 输出聚类结果
print(labels)

Among them, is a text file containing point cloud data, and each line represents a feature vector of a point, which can be loaded data.txtusing numpy functions. The parameter specifies the number of clusters, the parameter specifies the method of cluster merging, and the parameter specifies the method of distance calculation. The method is used to train the model and make predictions, returning the cluster label to which each point belongs. Finally, check by outputting the clustering results. It should be noted that the clustering algorithm in the sklearn library is used here, and other libraries or clustering algorithms can be used to complete this task.loadtxt()n_clusterslinkageaffinityfit_predict()

2.C++

#include <iostream>
#include <fstream>
#include <vector>
#include <cmath>
#include <algorithm>
#include <limits>

using namespace std;

typedef vector<double> Point;

// 计算两个点之间的欧氏距离
double distance(const Point& p1, const Point& p2) {
    double dist = 0;
    for (size_t i = 0; i < p1.size(); i++) {
        double diff = p1[i] - p2[i];
        dist += diff * diff;
    }
    return sqrt(dist);
}

// 计算点集中所有点与目标点的距离
vector<double> distances(const vector<Point>& points, const Point& target) {
    vector<double> dists(points.size());
    for (size_t i = 0; i < points.size(); i++) {
        dists[i] = distance(points[i], target);
    }
    return dists;
}

// 将点集划分为k个聚类
vector<int> kmeans(const vector<Point>& points, int k) {
    // 随机选择k个点作为聚类中心
    vector<Point> centers(k);
    for (int i = 0; i < k; i++) {
        centers[i] = points[rand() % points.size()];
    }

    // 迭代更新聚类中心,直到收敛
    vector<int> labels(points.size());
    bool converged = false;
    while (!converged) {
        // 分配每个点到距离最近的聚类中心
        converged = true;
        for (size_t i = 0; i < points.size(); i++) {
            vector<double> dists = distances(centers, points[i]);
            int label = distance(dists.begin(), min_element(dists.begin(), dists.end()));
            if (labels[i] != label) {
                converged = false;
                labels[i] = label;
            }
        }

        // 更新每个聚类的中心点
        for (int i = 0; i < k; i++) {
            vector<Point> cluster_points;
            for (size_t j = 0; j < points.size(); j++) {
                if (labels[j] == i) {
                    cluster_points.push_back(points[j]);
                }
            }
            if (!cluster_points.empty()) {
                Point center(cluster_points[0].size());
                for (size_t j = 0; j < cluster_points.size(); j++) {
                    for (size_t k = 0; k < cluster_points[j].size(); k++) {
                        center[k] += cluster_points[j][k];
                    }
                }
                for (size_t k = 0; k < center.size(); k++) {
                    center[k] /= cluster_points.size();
                }
                centers[i] = center;
            }
        }
    }

    return labels;
}

int main() {
    // 加载点云数据
    ifstream fin("data.txt");
    vector<Point> points;
    double value;
    while (fin >> value) {
        Point point(3);
        point[0] = value;
        fin >> value;
        point[1] = value;
        fin >> value;
        point[2] = value;
        points.push_back(point);
    }
    fin.close();

    // 使用kmeans算法进行

6. Other tree species point cloud classification methods

  1. Support Vector Machine (SVM): SVM is a supervised learning algorithm that can be used to classify point clouds of different tree species. Its core idea is to separate different classes of point clouds by finding the maximum margin hyperplane. The advantage of SVM is that it can handle high-dimensional data, but the disadvantage is that it requires a lot of training data and computing resources.

  2. Random Forest: Random Forest is an integrated learning method based on decision trees, which can be used to classify point clouds of different tree species. Its advantage is that it can handle high-dimensional data and nonlinear relationships, and it can be calculated in parallel during training. The disadvantage is that it requires a large amount of training data and computing resources.

  3. Neural Network: A neural network is a machine learning method based on an artificial neuron model, which can be used to classify point clouds of different tree species. Its advantage is that it can handle high-dimensional data and nonlinear relationships, and it can be calculated in parallel during training. The disadvantage is that it requires a large amount of training data and computing resources, and requires different network structure designs and hyperparameter adjustments for different data sets. .

  4. Principal Component Analysis (PCA): PCA is a dimensionality reduction algorithm that reduces high-dimensional point cloud data to a low-dimensional space while preserving the main features of the data. PCA can be used to classify point clouds of different tree species. Its advantage is that it can eliminate data redundancy and noise, and it can reduce computational complexity. The disadvantage is that some important information may be lost.

These methods have their own advantages and disadvantages. To choose an appropriate method, multiple factors such as data characteristics, data size, algorithm complexity, and computing resources need to be considered.

Guess you like

Origin blog.csdn.net/z377989129/article/details/129804014