1. Description
Gaussian Mixture Model (GMM) is a cluster analysis technique based on probability density estimation. It assumes that the data points are generated from a mixture of several Gaussian distributions with different means and variances. It can provide efficient clustering results in some results.
2. Effectiveness of Kmean algorithm
The K-means clustering algorithm places a circular boundary around the center of each cluster. This method works well when the data has circular shapes.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
np.random.seed(42)
def generate_circular(n_samples=500):
X = np.concatenate((
np.random.normal(0, 1, (n_samples, 2)),
np.random.normal(5, 1, (n_samples, 2)),
np.random.normal(10, 1, (n_samples, 2))
))
return X
X = generate_circular()
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X)
# boundaries of the cluster spheres
radii = [np.max(np.linalg.norm(X[kmeans_labels == i, :] - kmeans.cluster_centers_[i, :], axis=1))
for i in range(3)]
# plot
fig, ax = plt.subplots(ncols=2, figsize=(10, 4))
ax[0].scatter(X[:, 0], X[:, 1])
ax[0].set_title("Data")
ax[1].scatter(X[:, 0], X[:, 1], c=kmeans_labels)
ax[1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
marker='x', s=200, linewidth=3, color='r')
for i in range(3):
ax[1].add_artist(plt.Circle(kmeans.cluster_centers_[i, :], radius=radii[i], color='r', fill=False, lw=2))
ax[1].set_title("K Means Clustering")
plt.show()
However, this method may not work when the data has different shapes, such as rectangles or ovals.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
np.random.seed(42)
def generate_elliptic(n_samples=500):
X = np.concatenate((
np.random.normal([0, 3], [0.3, 1], (n_samples, 2)),
np.random.normal([2, 4], [0.3, 1], (n_samples, 2)),
np.random.normal([4, 6], [0.4, 1], (n_samples, 2))
))
return X
X = generate_elliptic()
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X)
kmeans_cluster_centers = kmeans.cluster_centers_
# the radius of each cluster
kmeans_cluster_radii = [np.max(np.linalg.norm(X[kmeans_labels == i, :] - kmeans.cluster_centers_[i, :], axis=1))
for i in range(3)]
# plot
fig, ax = plt.subplots(ncols=2, figsize=(10, 4))
ax[0].scatter(X[:, 0], X[:, 1])
ax[0].set_title("data")
ax[1].scatter(X[:, 0], X[:, 1], c=kmeans_labels)
ax[1].scatter(kmeans_cluster_centers[:, 0], kmeans_cluster_centers[:, 1],
marker='x', s=200, linewidth=3, color='r')
for i in range(3):
circle = plt.Circle(kmeans_cluster_centers[i], kmeans_cluster_radii[i], color='r', fill=False)
ax[1].add_artist(circle)
ax[1].set_title("k-means clustering")
plt.xlim(-4, 10)
plt.ylim(-4, 10)
plt.show()
3. GMM, which is more advanced than K-mean
GMM extends the K-means model by using Gaussian distribution to represent clustering. Unlike K-means, GMM captures not only the mean but also the covariance of the clusters, allowing their ellipsoidal shape to be modeled. To fit the GMM, we use the expectation maximization (EM) algorithm, which maximizes the likelihood of the observed data. EM is similar to K-means, but assigns data points to clusters with soft probabilities instead of hard assignments.
At a high level, GMM combines multiple Gaussian distributions to model the data. Instead of identifying clusters based on their closest centroids, a set of k Gaussians are fit to the data and parameters such as mean, variance, and weight are estimated for each cluster. Once you know the parameters of each data point, you can calculate the probability to determine which cluster the point belongs to.
Each distribution is weighted by a weighting factor (π) to account for the different number of samples in the cluster. For example, if we only have 1,000 data points from the red cluster, but 100,000 data points from the green cluster, we will weigh the red cluster distribution more tightly to ensure that it has a significant impact on the overall distribution .
components. source
The GMM algorithm consists of two steps: expectation (E) and maximization (M).
The first step, called the expectation step or E-step, consists of computing the expectation of the component assignment Ck for each data point xi ∈ X given the model parameters πk μk and σk.
The second step is called the maximization step or M-step, and it consists of maximizing the expectations calculated in the E- step relative to the model parameters. This step includes updating the values πk, μk and σk.
The entire iterative process is repeated until the algorithm converges, giving a maximum likelihood estimate. Intuitively, this algorithm works because knowing the component assignments Ck for each xi makes it easy to solve for πk μk and σk, and knowing πk μk σk makes it easy to infer p(Ck|xi).
The expectation step corresponds to the latter case, while the maximization step corresponds to the former case. Therefore, maximum likelihood estimates for non-fixed values can be efficiently calculated by alternating between assuming fixed values or known values.
algorithm
- Initialize the mean (μk), covariance matrix (σk), and mixing coefficients (πk) with random or predefined values.
- Calculate the component distribution (Ck) for all clusters.
- All parameters are estimated using the current component allocation (Ck).
- Computes the log-likelihood function.
- Set convergence criteria.
- Stop the algorithm if the log-likelihood value converges to a specific threshold, or if all parameters converge to a specific value. Otherwise, return to step 2.
It should be noted that this algorithm is guaranteed to converge to the local optimum, but it does not ensure that this local optimum is also the global optimum. Therefore, if the algorithm starts with different initialization, it may result in different configurations.
4. python code
from sklearn.mixture import GaussianMixture
parameter:
n_components
is the number of clusters.covariance_type
Determines the type of covariance matrix used by the GMM. It can take the following values: : Each mixture component has its universal covariance matrix. : All mixture components share the same general covariance matrix. : Each mixture component has its diagonal covariance matrix. : Each mixture component has its individual variance value, resulting in a spherical covariance matrix.full
tied
diag
spherical
tol
Controls the convergence threshold of the EM algorithm. It stops when the improvement in the log likelihood falls below this threshold.reg_covar
Regularization terms are added in the diagonal of the covariance matrix to ensure numerical stability during calculations. It helps prevent potential problems with poorly conditioned or singular covariance matrices.max_iter
is the number of EM iterations.n_init
Controls the initialization of model parameters. It can take the following values: " kmeans : The initial mean is estimated using the K-means algorithm. random ": The initial mean is randomly selected from the data, and the covariance and mixing coefficients are initialized.weights_init
Manually specify the initial weight (mixing coefficient) of each component.means_init
Manually specify the initial mean vector for each component.precision_init
Manually specify the initial accuracy matrix (the inverse of the covariance matrix) for each component.
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
def generate_elliptic(n_samples=500):
X = np.concatenate((
np.random.normal([0, 3], [0.3, 1], (n_samples, 2)),
np.random.normal([2, 4], [0.3, 1], (n_samples, 2)),
np.random.normal([4, 6], [0.4, 1], (n_samples, 2))
))
return X
X = generate_elliptic()
# k-means clustering
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
kmeans_labels = kmeans.labels_
# Gaussian mixture clustering
gmm = GaussianMixture(n_components=3, random_state=0).fit(X)
gmm_labels = gmm.predict(X)
# Plot the clustering results
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
axs[0].scatter(X[:, 0], X[:, 1], c=kmeans_labels)
axs[0].set_title('K-means clustering')
axs[1].scatter(X[:, 0], X[:, 1], c=gmm_labels)
axs[1].set_title('Gaussian mixture clustering')
plt.show()
K-means vs Gaussian. Image by the author.
print("Weights: ", gmm.weights_)
print("Means: ", gmm.means_)
print("Covariances: ", gmm.covariances_)
print("Precisions: ", gmm.precisions_)
"""
Weights: [0.33300331 0.33410451 0.33289218]
Means: [[ 1.98104152e+00 3.95197560e+00]
[ 3.98369464e+00 5.93920471e+00]
[-4.67796574e-03 2.97097723e+00]]
Covariances: [[[ 0.08521068 -0.00778594]
[-0.00778594 1.01699345]]
[[ 0.16066983 -0.01669341]
[-0.01669341 1.0383678 ]]
[[ 0.09482093 0.00709653]
[ 0.00709653 1.03641711]]]
Precisions: [[[11.74383346 0.08990895]
[ 0.08990895 0.98397883]]
[[ 6.23435734 0.10022716]
[ 0.10022716 0.9646612 ]]
[[10.55160153 -0.07224865]
[-0.07224865 0.96535719]]]
"""
5. Conclusion
GMMs are particularly useful when dealing with complex data distributions, heterogeneous datasets, or tasks involving density estimation. They provide flexibility in modeling and capturing the underlying structure of data, making them invaluable tools in a variety of machine learning and data analysis tasks.