7 clustering algorithms in matlab statistics and machine learning toolbox

1. Overview of clustering algorithms in matlab

This article provides a brief overview of the clustering methods available in the Matlab Statistics and Machine Learning Toolbox, and presents their clustering functions. In the process of use, it is very convenient to call this function directly, and I have to feel the power of matlab.

Cluster analysis, also known as segmentation analysis or classification analysis, is a common unsupervised learning method. Unsupervised learning is used to reason from unlabeled input data to obtain the classification label to which the data belongs, which is equivalent to "labeling" the data. For example, cluster analysis can be used to find hidden patterns or groupings in unlabeled data.

Cluster analysis creates groups or clusters of data. Objects belonging to the same cluster are similar to each other, and objects belonging to different clusters are different from each other. To quantify "similar" and "different", dissimilarity measures (or distance measures) applied to the domain of specific programs and datasets can be used. In addition, according to your own needs, you can consider scaling (or standardizing) the variables in the data to give them equal importance in the clustering process.

Specifically, the Statistics and Machine Learning Toolbox provides these clustering algorithms:

  • Hierarchical Clustering
  • k-Means and k-Medoids clustering
  • Density-based spatial clustering DBSCAN with noise applications
  • Gaussian mixture model
  • k nearest neighbor search and radius search
  • Spectral Clustering Spectral Clustering

2. Introduction and application of 7 clustering algorithms that come with matlab

1) Hierarchical Clustering Hierarchical Clustering
groups data on different scales by creating cluster trees or dendrograms. The tree is not a single collection of clusters, but a multi-level hierarchy in which clusters at one level combine to form clusters at the next level. This multi-tiered hierarchy allows you to choose the cluster level or size that best suits your application. Hierarchical clustering assigns each point in the data to a cluster.

Use the clusterdata function to perform hierarchical clustering on input data. The Clusterdata function includes pdist , linkage , and cluster 3 functions, which can be used for more detailed analysis. In addition, in hierarchical clustering, the dendrogram function can be used to draw the clustering tree to obtain the visual result of the clustering (hierarchical clustering result graph). The specific introduction of hierarchical clustering will not be repeated here, and a more detailed introduction may be updated later when there is time.

T = clusterdata(X,cutoff)

But I think these are enough to make people understand the basic content, and you can view and apply them in matlab according to your own needs. You can also communicate with me.

2) k-Means and k-Medoids clustering
k-means clustering and k-medoids clustering divide the data into k mutually exclusive clusters. These clustering methods require the number k of clusters to be specified in advance. Both k-means and k-medoids clustering assign each point in the data to a cluster; however, unlike hierarchical clustering, these methods are based on actual observations (rather than dissimilarity measures), and Create a single level of clustering. Therefore, for large amounts of data, k-means or k-medoids clustering is often more suitable than hierarchical clustering.

idx = kmeans(X,k)

idx = kmedoids(X,k)

In matlab, the kmeans and kmedoids functions can be directly used to implement k-means clustering and k-medoids clustering respectively. For these two relatively simple clustering algorithms, you should be familiar with them, so there is no need to say more.

3) Density-based spatial clustering with noise application DBSCAN

DBSCAN is a density-based algorithm that can identify clusters of arbitrary shape and outliers (noise) in the data. During clustering, DBSCAN identifies points that do not belong to any cluster, which makes the method useful for density-based outlier detection. Unlike k-means and k-medoids clusters, DBSCAN does not require prior knowledge of the number of clusters , which is a very important point.

idx = dbscan(X,epsilon,minpts)

In matlab, the dbscan function can be used directly to perform clustering on the input data matrix or on pairwise distances between observations. For a more detailed introduction to DBSCAN, you can check out my previous blog: Density-Based Clustering Algorithm (1) - Detailed Explanation of DBSCAN .

4) Gaussian mixture model

Gaussian mixture models (GMMs) form clusters as a mixture of multivariate normal density components. For a given observation, a GMM assigns a posterior probability to each component density (or cluster). The posterior probability indicates that the observation has a certain probability of belonging to each cluster. GMMs can perform hard clustering by selecting the component with the largest posterior probability as the assigned cluster of observations. You can also use GMMs to perform soft or fuzzy clustering by assigning observations to multiple clusters based on their scores or posterior probabilities. GMM may be a more suitable method than k-means clustering when clusters have different sizes and different correlation structures.

GMModel = fitgmdist(X,k)
idx = cluster(GMModel,X)

In matlab, use fitgmdist to fit a gmdistribution object to the data being analyzed. You can also use gmdistribution to create GMM objects by specifying distribution parameters. When there is a suitable GMM, the cluster function can be used to cluster the data.

5) k nearest neighbor search and radius search

A k-nearest neighbor search finds the k closest points in the data to a query point or set of points. In contrast, a radius search finds all points in the data that are within a specified distance of a query point or set of query points. The results of these methods depend on the distance metric you specify.

In matlab, you can directly use the knnsearch function to find the k nearest neighbors, or use the rangesearch function to find all neighbors within a specified distance of the input data. You can also create a searcher object with a training dataset and pass the object and query dataset to the object functions ( knnsearch and rangesearch ).

Idx = knnsearch(X,Y)

6) spectral clustering spectral clustering

Spectral clustering is a graph-based clustering algorithm for finding k clusters of arbitrary shape in data. This technique involves representing data in low dimensions. In low dimensions, the clusters in the data are more widely separated, which enables the use of algorithms such as k-means or k-medoids clustering. This low dimension is based on the eigenvectors of the Laplacian matrix. The Laplacian matrix is ​​a way to represent similarity graphs, which models the local neighborhood relationships between data points as an undirected graph.

In matlab, you can directly use the spectralcluster function to perform spectral clustering on the input data matrix or the similarity matrix of the similarity graph. The Spectralcluster function requires the number of clusters to be specified in advance. In addition, the spectral clustering algorithm also provides a method for estimating the number of clusters in the data, and the specific application can be viewed in matlab.

 idx = spectralcluster(X,k)

3. Summary and comparison of 7 clustering algorithms

insert image description here

Guess you like

Origin blog.csdn.net/weixin_50514171/article/details/128077044