Clustering algorithm learning

great clustering learning

Introduction

question

1. Many algorithms require developers to provide certain parameters during the analysis process (such as the expected number of clusters K, the initial centroid of clusters), which leads to the clustering results being very sensitive to these parameters, which not only aggravates the development The burden of the operator also greatly affects the accuracy of the clustering results.

optimization

  1. Scalability:
    When clustering objects increase from hundreds to millions, we hope that the accuracy of the final clustering results will remain consistent

  2. Ability to handle different types of attributes:
    Some clustering algorithms can only process data types of the attributes of objects, but in practical application scenarios, we often encounter other types of data (such as binary data), classification Data, etc., although we can also convert these other types of data into numerical data when preprocessing data, but there is often a loss in clustering efficiency or clustering accuracy

  3. Discover clusters of arbitrary shape:
    Because many clustering algorithms quantify the similarity between instance objects based on distance (such as Euclidean distance or Manhattan distance), based on this method, we often only find spherical shapes of similar size and density clusters or convex clusters. But in many scenarios, the shape of clusters may be arbitrary

  4. Minimization of knowledge requirements for clustering algorithm initialization parameters:
    Many algorithms require developers to provide certain parameters during the analysis process (such as the expected number of clusters K, the initial centroid of clusters), which leads to clustering results that are not accurate to these Parameters are very sensitive, which not only increases the burden on developers, but also greatly affects the accuracy of clustering results

  5. Ability to deal with noisy data:
    The so-called noisy data can be understood as interference data that affects the clustering results. The existence of these noisy data will cause "distortion" of the clustering results, eventually leading to low-quality clustering

  6. Incremental clustering and insensitivity to input order:
    Some clustering algorithms cannot insert newly added data into existing clustering results. Input order sensitivity means that for a given set of data objects, in different order When the input object is provided, the difference in the final clustering results will be relatively large

  7. High-dimensionality:
    Some algorithms are only suitable for processing 2-dimensional or 3-dimensional data, but the processing ability for high-dimensional data is very weak, because the distribution of data in high-dimensional space may be very sparse and highly skewed.

  8. Constraint-based clustering:
    In practical applications, it may be necessary to perform clustering under various conditions, because the same clustering algorithm can bring different clustering results in different application scenarios, so find a solution that satisfies the " Data grouping with good clustering properties under specific constraints is very challenging. The most difficult problem here is how to identify the "specific constraints" implicit in the problem we want to solve, and what algorithm to use to best "fit" this constraint

  9. Interpretability and usability:
    we hope that the clustering results can be explained with specific semantics and knowledge, and related to actual application scenarios

Classification

distance density interconnectivity

Understanding of Clustering Algorithms from the Perspective of Information Theory

Intuitively, we wish to achieve two contradictory goals.
1. On the one hand, we want the mutual information of document attributes and cluster attributes to be as small as possible, which reflects our desire for strong compression of the original data.
2. On the other hand, we want the mutual information of clustering variables and word attributes to be as large as possible, which reflects the goal of preserving document information (represented by the occurrence of words in documents). The least sufficient statistic in parametric statistics is generalized to arbitrary distributions.
It is usually very difficult to solve the optimization problem under the information bottleneck criterion, and the solution idea is similar to the EM criterion.
3. The greater the concept of mutual information, the greater the correlation between the two. Understanding the Markov chain can also be understood with information theory.

K-Means algorithm (K-means clustering K-means clustering algorithm) - clustering based on hard division

Before learning the details of the specific K-means algorithm, we need to understand some problems inherent in K-means:

1. The objective function optimization process of the k-means algorithm is monotonously non- increasing coarse (that is, each iteration at least does not make the result worse), but the k-means algorithm itself does not give a theory for the number of iterations to achieve convergence ensure.
2. The difference between the output value of the k-means objective function given by the algorithm and the minimum possible value of the objective function does not have a trivial lower bound. In fact, the k-means may converge to a local minimum . In order to improve the results of k-means, it is common to use different random initialization center points, run the program multiple times, and select the best results. In addition, there are some unsupervised algorithms that can be used as the pre-algorithm of the k-means algorithm to select the initialization center.
3. The "best" clustering on the training set according to the sum-of-squares-distance criterion will inevitably select as many clusters as there are data points! Because the loss is 0 at this time, in order to suppress this tendency, it is necessary to apply the MDL criterion to punish the model structure complexity , and to seek a balance between the model complexity and the optimization of the loss target.
Process:
1. The number of cluster centers k
2. The selection of initial cluster centers (commonly generated randomly)
3. The remaining sample points are classified according to the distance metric standard this year.
LP distance formula
4. The condition for calculating the convergence is What
insert image description here
Kmeans, an unsupervised learning algorithm, cannot guarantee that the clustered "clusters" have "practical significance", that is, the classified groups obtained by Kmeans may only be similar point sets in the Euclidean space, but in fact they are not Must really be in the same category. On the other hand, the classification result of Kmeans is strongly correlated with the K value. If we pass in an "unreasonable" K value, it may lead to overfitting of Kmeans, and finally get a "wrong" classification result.
application:
After the complex data is clustered, the classified data is used to achieve small data and reduced dimensions (for example, 96615 pixels are clustered into 64 pixels). The k-means++ algorithm solves this
problem to a certain extent. The k-means++ selection The basic idea of ​​the initial seeds is: the distance between the initial cluster centers should be as far as possible. The basic idea of ​​k-means++ clustering is: Although the first center point is still randomly selected, other points are preferentially selected for those points that are far away from each other

link clustering model

The link-based clustering algorithm is agglomerative. At the beginning, the data is completely fragmented, and then gradually builds larger and larger clusters. If we do not add a stopping rule, the result of the link algorithm can be used as a dendrogram of the clustering system. To describe, that is, a tree composed of domain subsets, its leaf nodes are single-element sets, and the root node is the whole domain.
Common stopping criteria include:
fixed number of classes: fixed parameter k, stop clustering when the number of clusters is k. Using this stopping criterion requires us to have strong domain knowledge about our scene, that is, to know in advance the number of clusters that need to be clustered.
Set the upper limit of distance: set the maximum upper limit of the distance between domain subsets. If in a certain round of iterations, all The component distances of all exceed the threshold, then stop clustering
If there is no stopping criterion, there is only one class (the universe) left at the end.

DBSCAN (Density-Based Spatial Clustering of Application with Noise) - Density-Based Clustering Algorithm

A fundamental difference between density-based clustering methods and other methods is that instead of being based on various distance measures, it is based on density . Therefore, it can overcome the disadvantage that the distance-based algorithm can only find " circle-like" clusters.
The guiding idea of ​​DBSCAN is:
use the number of neighbor points in the ∈ neighborhood of a point to measure the density of the space where the point is located, as long as the density of points in an area is greater than a certain threshold, it will be added to the cluster that is close to it Go
It can find out
the
oddly-shaped cluster, and it is not necessary to know the number of clusters in advance when clustering. It should be noted
that the core point is located inside the cluster, and it definitely belongs to a specific cluster ; Noise point is the interference data in the data set, it does not belong to any cluster; and the boundary point is a special kind of point, it is located at the edge of one or several clusters, it may belong to one cluster, or it may belong to another cluster , whose cluster affiliation is not clear
insert image description here
Determining the cluster centers in an image by depicting the minimum and maximum distances
insert image description here

SOM (Self-organizing Maps) - model-based clustering (model-based methods)

Model-based methods assume a model (prespecified) for each cluster and then search for datasets that satisfy this model well. Such a model might be a density distribution function of data points in space or something else. One of its underlying assumptions is that
the target data set is determined by a series of probability distributions.
There are usually two directions to try: statistical schemes; and neural network schemes
. Features:

  1. Order-preserving mapping: Map the sample pattern classes of the input space to the output layer in an orderly manner
  2. Data compression: The SOM network has obvious advantages in projecting samples in high-dimensional space to low-dimensional space while keeping the topological structure unchanged. No matter how many dimensions the input sample space is, its pattern can be corresponding in a certain area of ​​the output layer of the SOM net. After the SOM network is trained, similar samples are input in the high-dimensional space, and the corresponding positions of the output are also similar.
  3. Feature extraction: mapping from high-dimensional space samples to low-dimensional space, the output layer of SOM network is equivalent to low-dimensional feature space
  4. The process is similar to Kmeans, but it has a weight model, which realizes high-dimensional to low-dimensional in this way, and this weight model can also be used for later learning. And visualize high-dimensional data.

##EM algorithm
The EM algorithm is an iterative algorithm for the maximum likelihood estimation of the probability model parameters containing hidden variables (hidden variable), or the maximum posterior probability estimation.
#GNG: Growing Neural Gas Network

Summary:
Clustering:
The sample has many features, but the target y of clustering is generally less than the feature. For y, the mutual information between y' and the sample's y' should be as large as possible, and the other sample features are small enough. Here, y' It can be fused with other sample features using factor analysis. There are mainly two aspects of thinking.
Clustering is unsupervised, so the main thinking direction is learning (neural network, model method), iteration (K-means), hierarchy (tree), growth (GNG) to build a large framework, the purpose is to continuously correct, However, there will also be cases of overfitting (MDL principle penalty), and many algorithms are set by experience to end the reference value of the algorithm.
For features, the clustering center or classification standard is determined by three angles of distance (Manhattan distance, Euler distance, flexible distance statistical function (Gaussian Kernel), etc.), density, and probability.

Guess you like

Origin blog.csdn.net/Carol_learning/article/details/104107647