Clustering using Gaussian mixture models

1. Description

        Gaussian Mixture Model (GMM) is a cluster analysis technique based on probability density estimation. It assumes that the data points are generated from a mixture of several Gaussian distributions with different means and variances. It can provide efficient clustering results in some results.

2. Effectiveness of Kmean algorithm

        The K-means clustering algorithm places a circular boundary around the center of each cluster. This method works well when the data has circular shapes.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

np.random.seed(42)

def generate_circular(n_samples=500):
    X = np.concatenate((
    np.random.normal(0, 1, (n_samples, 2)),
    np.random.normal(5, 1, (n_samples, 2)),
    np.random.normal(10, 1, (n_samples, 2))
    ))
    return X

X = generate_circular()

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X)

# boundaries of the cluster spheres
radii = [np.max(np.linalg.norm(X[kmeans_labels == i, :] - kmeans.cluster_centers_[i, :], axis=1))
         for i in range(3)]

# plot
fig, ax = plt.subplots(ncols=2, figsize=(10, 4))

ax[0].scatter(X[:, 0], X[:, 1])
ax[0].set_title("Data")

ax[1].scatter(X[:, 0], X[:, 1], c=kmeans_labels)
ax[1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
              marker='x', s=200, linewidth=3, color='r')
for i in range(3):
    ax[1].add_artist(plt.Circle(kmeans.cluster_centers_[i, :], radius=radii[i], color='r', fill=False, lw=2))
ax[1].set_title("K Means Clustering")

plt.show()
K represents clusters with circular clusters. 

        However, this method may not work when the data has different shapes, such as rectangles or ovals.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

np.random.seed(42)

def generate_elliptic(n_samples=500):
    X = np.concatenate((
    np.random.normal([0, 3], [0.3, 1], (n_samples, 2)),
    np.random.normal([2, 4], [0.3, 1], (n_samples, 2)),
    np.random.normal([4, 6], [0.4, 1], (n_samples, 2))
    ))
    return X

X = generate_elliptic()

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X)
kmeans_cluster_centers = kmeans.cluster_centers_

# the radius of each cluster
kmeans_cluster_radii = [np.max(np.linalg.norm(X[kmeans_labels == i, :] - kmeans.cluster_centers_[i, :], axis=1))
         for i in range(3)]

# plot
fig, ax = plt.subplots(ncols=2, figsize=(10, 4))


ax[0].scatter(X[:, 0], X[:, 1])
ax[0].set_title("data")

ax[1].scatter(X[:, 0], X[:, 1], c=kmeans_labels)
ax[1].scatter(kmeans_cluster_centers[:, 0], kmeans_cluster_centers[:, 1],
              marker='x', s=200, linewidth=3, color='r')
for i in range(3):
    circle = plt.Circle(kmeans_cluster_centers[i], kmeans_cluster_radii[i], color='r', fill=False)
    ax[1].add_artist(circle)
ax[1].set_title("k-means clustering")
plt.xlim(-4, 10) 
plt.ylim(-4, 10)
plt.show()
K represents clusters with elliptical shape clusters

3. GMM, which is more advanced than K-mean

        GMM extends the K-means model by using Gaussian distribution to represent clustering. Unlike K-means, GMM captures not only the mean but also the covariance of the clusters, allowing their ellipsoidal shape to be modeled. To fit the GMM, we use the expectation maximization (EM) algorithm, which maximizes the likelihood of the observed data. EM is similar to K-means, but assigns data points to clusters with soft probabilities instead of hard assignments.

        At a high level, GMM combines multiple Gaussian distributions to model the data. Instead of identifying clusters based on their closest centroids, a set of  Gaussians are fit to the data and parameters such as mean, variance, and weight are estimated for each cluster. Once you know the parameters of each data point, you can calculate the probability to determine which cluster the point belongs to.

        Each distribution is weighted by a weighting factor (π) to account for the different number of samples in the cluster. For example, if we only have 1,000 data points from the red cluster, but 100,000 data points from the green cluster, we will weigh the red cluster distribution more tightly to ensure that it has a significant impact on the overall distribution .

components. source

        The GMM algorithm consists of two steps: expectation (E) and maximization (M).

        The first step, called the expectation step or E-step, consists of computing the expectation of the component assignment Ck for each data point xi ∈ X given the model parameters πk μk and σk.

        The second step is called the maximization step or M-step, and it consists of maximizing the expectations calculated in the E- step relative to the model parameters. This step includes updating the values ​​πk, μk and σk.

        The entire iterative process is repeated until the algorithm converges, giving a maximum likelihood estimate. Intuitively, this algorithm works because knowing the component assignments Ck for each xi makes it easy to solve for πk μk and σk, and knowing πk μk σk makes it easy to infer p(Ck|xi).

        The expectation step corresponds to the latter case, while the maximization step corresponds to the former case. Therefore, maximum likelihood estimates for non-fixed values ​​can be efficiently calculated by alternating between assuming fixed values ​​or known values.

algorithm

  1. Initialize the mean (μk), covariance matrix (σk), and mixing coefficients (πk) with random or predefined values.
  2. Calculate the component distribution (Ck) for all clusters.
  3. All parameters are estimated using the current component allocation (Ck).
  4. Computes the log-likelihood function.
  5. Set convergence criteria.
  6. Stop the algorithm if the log-likelihood value converges to a specific threshold, or if all parameters converge to a specific value. Otherwise, return to step 2.

It should be noted that this algorithm is guaranteed to converge to the local optimum, but it does not ensure that this local optimum is also the global optimum. Therefore, if the algorithm starts with different initialization, it may result in different configurations.

4. python code

 from sklearn.mixture import GaussianMixture

parameter:

  • n_componentsis the number of clusters.
  • covariance_typeDetermines the type of covariance matrix used by the GMM. It can take the following values: : Each mixture component has its universal covariance matrix. : All mixture components share the same general covariance matrix. : Each mixture component has its diagonal covariance matrix. : Each mixture component has its individual variance value, resulting in a spherical covariance matrix.fulltieddiagspherical
  • tolControls the convergence threshold of the EM algorithm. It stops when the improvement in the log likelihood falls below this threshold.
  • reg_covarRegularization terms are added in the diagonal of the covariance matrix to ensure numerical stability during calculations. It helps prevent potential problems with poorly conditioned or singular covariance matrices.
  • max_iteris the number of EM iterations.
  • n_initControls the initialization of model parameters. It can take the following values: " kmeans : The initial mean is estimated using the K-means algorithm. random ": The initial mean is randomly selected from the data, and the covariance and mixing coefficients are initialized.
  • weights_initManually specify the initial weight (mixing coefficient) of each component.
  • means_initManually specify the initial mean vector for each component.
  • precision_initManually specify the initial accuracy matrix (the inverse of the covariance matrix) for each component.
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

def generate_elliptic(n_samples=500):
    X = np.concatenate((
    np.random.normal([0, 3], [0.3, 1], (n_samples, 2)),
    np.random.normal([2, 4], [0.3, 1], (n_samples, 2)),
    np.random.normal([4, 6], [0.4, 1], (n_samples, 2))
))
    return X

X = generate_elliptic()

# k-means clustering
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
kmeans_labels = kmeans.labels_

# Gaussian mixture clustering
gmm = GaussianMixture(n_components=3, random_state=0).fit(X)
gmm_labels = gmm.predict(X)

# Plot the clustering results
fig, axs = plt.subplots(1, 2, figsize=(10, 5))

axs[0].scatter(X[:, 0], X[:, 1], c=kmeans_labels)
axs[0].set_title('K-means clustering')

axs[1].scatter(X[:, 0], X[:, 1], c=gmm_labels)
axs[1].set_title('Gaussian mixture clustering')

plt.show()

K-means vs Gaussian. Image by the author.

print("Weights: ", gmm.weights_)
print("Means: ", gmm.means_)
print("Covariances: ", gmm.covariances_)
print("Precisions: ", gmm.precisions_)

"""
Weights:  [0.33300331 0.33410451 0.33289218]
Means:  [[ 1.98104152e+00  3.95197560e+00]
 [ 3.98369464e+00  5.93920471e+00]
 [-4.67796574e-03  2.97097723e+00]]
Covariances:  [[[ 0.08521068 -0.00778594]
  [-0.00778594  1.01699345]]

 [[ 0.16066983 -0.01669341]
  [-0.01669341  1.0383678 ]]

 [[ 0.09482093  0.00709653]
  [ 0.00709653  1.03641711]]]
Precisions:  [[[11.74383346  0.08990895]
  [ 0.08990895  0.98397883]]

 [[ 6.23435734  0.10022716]
  [ 0.10022716  0.9646612 ]]

 [[10.55160153 -0.07224865]
  [-0.07224865  0.96535719]]]
"""

Okan Jernigan 

5. Conclusion

        GMMs are particularly useful when dealing with complex data distributions, heterogeneous datasets, or tasks involving density estimation. They provide flexibility in modeling and capturing the underlying structure of data, making them invaluable tools in a variety of machine learning and data analysis tasks.

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132699051