08 Unsupervised Learning - Clustering

1. What is a clustering task?

Category: Unsupervised Learning
Purpose: To reveal the inherent nature and laws of the data through the learning of unlabeled training samples, and provide a basis for further data analysis.

1.1K Means Clustering

step:

Randomly select samples as the initial mean vector (initial value: the value of k [that is, several clusters])
Calculate the distance from each sample point to the initial mean vector separately, which cluster belongs to which point is closest
Recalculate the center point for each cluster, repeat the second step until convergence

Distance calculation:
distance metric/non-distance metric: if it is a distance metric, it should satisfy the following properties:

Non-negativity: dist(x,y)>=0
Identity: dist(x,y)=0 if and only if x=y
Symmetry: dist(x,y) = dist(y,x)
直递性：dist(x,z) <= dist(x,y)+dist(y,z)

1.1.1 Manhattan distance

${\rm{dis}}{ {\rm{t}}_{man}}({x_i},{x_j}) = {\left\| { {x_i} - {x_j}} \right\|_1} = \sum\limits_{\mu = 1}^n {\left| { {x_{i\mu }} - {x_{j\mu }}} \right|}$

1.1.2 Euclidean distance

$\operatorname{dist}_{\mathrm{ed}}\left(\boldsymbol{x}_i, \boldsymbol{x}_j\right)=\left\|\boldsymbol{x}_i-\boldsymbol{x}_j\right\|_2=\sqrt{\sum_{u=1}^n\left|x_{i u}-x_{j u}\right|^2}$

1.1.3 Chebyshev distance

The Chebyshev distance is defined as the maximum difference between two vectors in any coordinate dimension. In other words, it is the maximum distance along an axis. The Chebyshev distance is often referred to as the chessboard distance because the minimum number of moves for a chess king to go from one square to another is equal to the Chebyshev distance.

$\mathop {\max }\limits_i (|{x_i} - {y_i}|)$

Chebyshev distance is often used for specific use cases, which makes it difficult to use as a general distance measure like Euclidean distance or cosine similarity. Therefore, only use it when you are sure it is suitable for your use case.

1.1.4 Min's distance

给定样本 ${x_i} = ({x_{i1}};{x_{i2}}; \ldots ;{x_{in}})$ 与 ${x_j} = ({x_{j1}};{x_{j2}}; \ldots ;{x_{jn}})$ , the most commonly used is the Minkowski distance.

$dis{t_{mk}}({x_i},{x_j}) = {(\sum\limits_{u = 1}^n { { {\left| { {x_{iu}} - {x_{ju}}} \right|}^p}} )^{ {1 \over p}}}$

When p=1, the Minkowski distance is the Manhattan distance (Manhattan distance)
When p=2, the Minkowski distance is the Euclidean distance
p= $\infty$ , the Minkowski distance is the Chebyshev distance

1.1.5 Cosine Similarity

Cosine similarity is often used to counteract high-dimensional Euclidean distance problems. Cosine similarity refers to the cosine of the angle between two vectors.

Two vectors with exactly the same direction have a cosine similarity of 1, while two vectors facing each other have a cosine similarity of -1. Note that their size does not matter as this is a measure in direction.

$\cos (\theta ) = { {x \cdot y} \over {\left\| x \right\|\left\| y \right\|}}$

Use case: Cosine similarity can be used when we are not concerned about the size of high-dimensional data vectors. For example, for text analytics, this measure is often used when the data is expressed in word counts. For example, when a word occurs more frequently in one document than another, it doesn't necessarily mean that the document is more relevant to that word. It could be that the files are of uneven length or the count is of less importance. We're better off using cosine similarity which ignores magnitude.

1.1.6 Hamming Distance

The Hamming distance is the number of distinct values between two vectors. It is usually used to compare two binary strings of the same length. It can also be used on strings to compare the similarity between them by counting the number of different characters.

Disadvantage: Hamming distance is cumbersome to use when the two vector lengths are not equal.

Use Cases: Typical use cases include error correction/detection when data is transmitted over a computer network. It can be used to determine the number of distortions in a binary word as a way of estimating the error. Additionally, you can also use the Hamming distance to measure the distance between categorical variables.

1.2 Density clustering (Density-based Spatial Clustering of Applications with Noise)

Density-based clustering, which assumes that the cluster structure can be determined by the tightness of the sample distribution

The density clustering algorithm examines the connectivity between samples from the perspective of sample density , and continuously expands clusters based on connected samples to obtain the final clustering result

hierarchical clustering
Gaussian Mixture Model Clustering
Measuring index of clustering effect

1.3 Hierarchical clustering (hierarchical clustering)

Attempts to divide the data set at different levels to form a tree-like clustering structure. The data set can be divided using a "bottom-up" aggregation strategy or a "top-down" splitting strategy.

AGNES is a hierarchical clustering algorithm using a bottom-up aggregation strategy. It first regards each sample in the data set as an initial cluster, and then finds the two closest clusters in each step of the algorithm. Clusters are merged, and the process is repeated until the preset number of clusters is reached.

AGNES算法步骤:
(1) Initialize, each sample is regarded as a cluster
(2) Calculate the distance between any two clusters, find the two closest clusters, and merge the two clusters
(3) Repeat step 2...
until the distance between the farthest two clusters exceeds the threshold , or the number of clusters reaches the specified value, the algorithm is terminated

DIANA算法步骤:
(1) Initialization, all samples are grouped into one cluster
(2) In the same cluster, calculate the distance between any two samples, find the two farthest sample points a, b, and
use a, b as The centers of the two clusters;
(3) Calculate the distance between a and b of the remaining sample points in the original cluster, which center is closest to which cluster they are assigned to (4
) Repeat steps 2, 3...
until the distance between the farthest two clusters is insufficient Threshold, or the number of clusters reaches a specified value, the algorithm is terminated

Application of Hierarchical Clustering in Bioinformatics

Hierarchical clustering is a more practical clustering method, which is used in data analysis algorithms in different fields.
In the field of biomedical informatics, hierarchical clustering methods are often used for protein sequence data clustering and gene expression data clustering. Proteins with similar structures have similar functions. By clustering, proteins with similar functions are grouped together to help study the functions of proteins. Gene expression data clustering is to cluster genes with similar expression profiles into one group, called co-expressed genes, infer the biological functions of these genes according to the co-expression phenomenon of genes, so as to annotate new gene functions and Biological function research is of great significance.

1.4 Gaussian mixture clustering model

Known sample set D= $D = \{ {x_1},{x_2},...,{x_m}\}$ , to cluster these samples into k classes, we think that the samples obey the mixed Gaussian distribution:

${p_M}(x) = \sum\limits_{i = 1}^k { {\alpha_i}} \ cdot p(x|{\mu_i},{\sigma_i})$

${\alpha _i}}$ of the Gaussian mixture distribution $a_{i}$ , ${\mu _i}$ , ${\Sigma _i}$

The second step is to calculate the posterior probability of x generated by each mixture component, that is, the observed data ${x_j}$ Probability p generated by the i-th sub-model $p({z_j} = i|{x_j})$ , and recorded as ${\gamma _{ji}}$

${\gamma _{ji}} = { { {a_i} \cdot p({x_j}|{\mu _i},{\Sigma _i})} \over {\sum\limits_{i = 1}^k { {a_i}p({x_j}|{\mu _i},{\Sigma _i})} }}$

The third step is to calculate the new horizontal parameters:

$\begin{aligned} \boldsymbol{\mu}_i^{\prime} & =\frac{\sum_{j=1}^m \gamma_{j i} x_j}{\sum_{j=1}^m \gamma_{j i}} \\ \boldsymbol{\Sigma}_i^{\prime} & =\frac{\sum_{j=1}^m \gamma_{j i}\left(x_j-\mu_i^{\prime}\right)\left(x_j-\mu_i^{\prime}\right)^{\mathrm{T}}}{\sum_{j=1}^m \gamma_{j i}} \\ \alpha_i^{\prime} & =\frac{\sum_{j=1}^m \gamma_{j i}}{m} \end{aligned}$

The fourth step repeats steps 2 and 3 according to the new model parameters until the stop condition is met

The fifth step is to input each sample according to ${\lambda _j} = \mathop {\arg \max }\limits_{i \ in (1,2,...,k)} {\gamma _{ji}}$ , into the corresponding cluster. That is, each sample is classified into the cluster of which sub-model has a high probability of coming from which sub-model, and finally k clusters are obtained

Summarize:

Hierarchical clustering is good at discovering embedded structure in data.
Density-based methods excel at finding clusters of unknown quantities with similar densities.
Considering finding "consensus" across the dataset, K-means considers each point in the dataset and uses that information to evolve a cluster over a series of iterations.
Gaussian mixture models consider clustering of coincident data.

2. Performance metrics

What is a good clustering?

Purpose: 1. To evaluate the quality of the clustering results; 2. To establish the goal of optimization

Conclusion: The sample sizes within clusters are as similar to each other as possible, and the samples between clusters are as different as possible.

External metrics: compare the clustering results with some "reference model", called "external metrics".

$\begin{aligned} & \left.a=|S S|, \quad S S=\left\{\left(\boldsymbol{x}_i, \boldsymbol{x}_j\right) \mid \lambda_i=\lambda_j, \lambda_i^*=\lambda_j^*, i<j\right)\right\}, \\ & \left.b=|S D|, \quad S D=\left\{\left(\boldsymbol{x}_i, \boldsymbol{x}_j\right) \mid \lambda_i=\lambda_j, \lambda_i^* \neq \lambda_j^*, i<j\right)\right\}, \\ & \left.c=|D S|, \quad D S=\left\{\left(\boldsymbol{x}_i, \boldsymbol{x}_j\right) \mid \lambda_i \neq \lambda_j, \lambda_i^*=\lambda_j^*, i<j\right)\right\}, \\ & \left.d=|D D|, \quad D D=\left\{\left(\boldsymbol{x}_i, \boldsymbol{x}_j\right) \mid \lambda_i \neq \lambda_j, \lambda_i^* \neq \lambda_j^*, i<j\right)\right\}, \\ & \end{aligned}$