A detailed explanation of 4 clustering algorithms and visualization (Python)

In this article, based on stock price time series data of 20 companies. Take a look at four different ways of clustering these companies based on the correlation between stock prices.

Apple (AAPL), Amazon (AMZN), Facebook (META), Tesla (TSLA), Alphabet (Google) (GOOGL), Shell (SHEL), Suncor Energy (SU), Exxon Mobil Corporation (XOM), Lululemon (LULU), Walmart (WMT), Carters (CRI), Childrens Place (PLCE), TJX Companies (TJX), Victoria's Secret & Co (VSCO), Macy's (M), Wayfair (W), Dollar Tree (DLTR) , CVS Caremark (CVS), Walgreen (WBA), Curaleaf Holdings Inc. (CURLF)

Our DataFrame df_combined contains 413 days of stock prices for the above companies with no missing data.

Target

Our goal is to group these companies according to their relevance and check the validity of these groupings. For example, Apple, Amazon, Google, and Facebook are often considered technology stocks, while Suncor and Exxon are considered oil and gas stocks. We will check whether we can get these classifications using only the correlation between the stock prices of these companies.

Correlations are used to classify these companies instead of stock prices where companies with similar stock prices are grouped together. But here, we want to classify companies based on their stock price behavior. A simple way to achieve this is to use correlations between stock prices.

Technology Exchange

Technology must learn to share and communicate, and it is not recommended to work behind closed doors. One person can go fast, and a group of people can go farther.

Relevant files and codes have been uploaded, and they can be obtained by joining the communication group. The group has more than 2,000 members. The best way to add notes is: source + interest direction, so that it is convenient to find like-minded friends.

Method ①, add WeChat account: dkl88194, remarks: from CSDN + add group
Method ②, WeChat search official account: Python learning and data mining, background reply: add group

optimal number of clusters

Finding the number of clusters is a problem of its own. There are methods, such as the elbow method, that can be used to find the optimal number of clusters. However, in this work, an attempt is made to divide these firms into 4 clusters. Ideally, the four groups would have to be technology stocks, oil and gas stocks, retail stocks and other stocks.

First get the correlation matrix for the data frame we have.

correlation_mat=df_combined.corr()

Define a utility function to display clusters and the firms that belong to that cluster.

# 用来打印公司名称和它们所分配的集群的实用函数
def print_clusters(df_combined,cluster_labels):
  cluster_dict = {
    
    }
  for i, label in enumerate(cluster_labels):
      if label not in cluster_dict:
          cluster_dict[label] = []
      cluster_dict[label].append(df_combined.columns[i])

  # 打印出每个群组中的公司 -- 建议关注@公众号：数据STUDIO 定时推送更多优质内容
  for cluster, companies in cluster_dict.items():
      print(f"Cluster {
      
      cluster}: {
      
      ', '.join(companies)}")

Method 1: K-means clustering method

K-means clustering is a popular unsupervised machine learning algorithm used to group similar data points based on the similarity of features. The algorithm iteratively assigns each data point to the nearest cluster center point, and then updates the center point based on the newly assigned data point until convergence. We can use this algorithm to cluster our data based on the correlation matrix.

from sklearn.cluster import KMeans

# Perform k-means clustering with four clusters
clustering = KMeans(n_clusters=4, random_state=0).fit(correlation_mat)

# Print the cluster labels
cluster_labels=clustering.labels_
print_clusters(df_combined,cluster_labels)

The result of k-means clustering

As expected, Amazon, Facebook, Tesla, and Alphabet were brought together, as were oil and gas companies. Also, Walmart and MACYs were brought together. However, we have seen some tech stocks such as Apple and Walmart get together.

Method 2: Agglomerative Clustering

Agglomerative clustering is a hierarchical clustering algorithm that iteratively merges similar clusters to form larger clusters. The algorithm starts with an individual cluster for each object and then merges the two most similar clusters at each step.

from sklearn.cluster import AgglomerativeClustering

# 进行分层聚类
clustering = AgglomerativeClustering(n_clusters=n_clusters, 
                                     affinity='precomputed', 
                                     linkage='complete'
                                    ).fit(correlation_mat)

# Display the cluster labels
print_clusters(df_combined,clustering.labels_)

Hierarchical Clustering Results

These results are slightly different from what we get from k-means clustering. We can see that some oil and gas companies are placed in different clusters.

Method 3: AffinityPropagation clustering method

Affinity propagation clustering is a clustering algorithm that does not require the number of clusters to be specified in advance. It works by sending messages between pairs of data points and letting the data points automatically determine the number of clusters and the best cluster assignment. Affinity propagation clustering can effectively identify complex patterns in data, but it is also computationally expensive for large datasets.

from sklearn.cluster import AffinityPropagation

# 用默认参数进行亲和传播聚类
clustering = AffinityPropagation(affinity='precomputed').fit(correlation_mat)

# Display the cluster labels
print_clusters(df_combined,clustering.labels_)

Affinity Propagation Clustering Results

Interestingly, this method found four clusters to be the optimal number of clusters for our data. In addition, we can observe that oil and gas companies are grouped together, and some technology companies are also grouped together.

Method 4: DBSCAN clustering method

DBSCAN is a density-based clustering algorithm that clusters points that are closely packed together. It does not need to specify the number of clusters in advance, and can identify clusters of arbitrary shape. The algorithm is robust to outliers and noise in the data and can automatically mark them as noise points.

from sklearn.cluster import DBSCAN

# Removing negative values in correlation matrix
correlation_mat_pro = 1 + correlation_mat

# Perform DBSCAN clustering with eps=0.5 and min_samples=5
clustering = DBSCAN(eps=0.5, min_samples=5, metric='precomputed').fit(correlation_mat_pro)

# Print the cluster labels
print_clusters(df_combined,clustering.labels_)

The result of DBScan clustering

Here, unlike affinity-based clustering, the DBScan method identified 5 clusters as the optimal number. It can also be seen that some clusters have only 1 or 2 companies.

visualization

It may be useful to simultaneously examine the results of the four clustering methods above to gain insight into their performance. The easiest way to do this is to use a heatmap, with companies on the x-axis and clusters on the y-axis.

def plot_cluster_heatmaps(cluster_results, companies):

    # 从字典中提取key和value
    methods = list(cluster_results.keys())
    labels = list(cluster_results.values())

    # 定义每个方法的热图数据
    heatmaps = []
    for i in range(len(methods)):
        heatmap = np.zeros((len(np.unique(labels[i])), len(companies)))
        for j in range(len(companies)):
            heatmap[labels[i][j], j] = 1
        heatmaps.append(heatmap)

    # Plot the heatmaps in a 2x2 grid
    fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(12, 12))

    for i in range(len(methods)):
        row = i // 2
        col = i % 2
        sns.heatmap(heatmaps[i], cmap="Blues", annot=True, fmt="g", xticklabels=companies, ax=axs[row, col])
        axs[row, col].set_title(methods[i])

    plt.tight_layout()
    plt.show()

companies=df_combined.columns
plot_cluster_heatmaps(cluster_results, companies)

Clustering results for all four methods

However, the above visualization is not very helpful when trying to compare the results of multiple clustering algorithms. It would be helpful to find a better way to represent this graph.

in conclusion

In this post, we explore four different methods for clustering 20 companies based on the correlation between their stock prices. The goal is to cluster these companies in a way that reflects their behavior rather than their stock prices. Tried K-means clustering, Agglomerative clustering, Affinity Propagation clustering and DBSCAN clustering methods, each with its own advantages and disadvantages. The results show that all four methods can cluster companies in a manner consistent with their industry or sector, while some methods are more computationally expensive than others. Correlation-based clustering methods provide a useful alternative to stock price-based clustering methods, allowing clustering based on the behavior of companies rather than stock prices.