Four machine learning clustering methods based on correlation

In this article, based on stock price time series data of 20 companies. Take a look at four different ways to categorize these companies based on the correlation between their stock prices.

Apple (AAPL), Amazon (AMZN), Facebook (META), Tesla (TSLA), Alphabet (Google) (GOOGL), Shell (SHEL), Suncor Energy (SU), Exxon Mobil Corporation (XOM), Lululemon (LULU), Walmart (WMT), Carters (CRI), Children's Place (PLCE), TJX Companies (TJX), Victoria's Secret & Co (VSCO), Macy's (M), Wayfair (W), Dollar Tree (DLTR) , CVS Caremark (CVS), Walgreen (WBA), Curaleaf Holdings Inc. (CURLF)

Our DataFrame df_combined contains the stock prices of the above companies for 413 days, with no missing data.

Target

Our goal is to group these companies based on relevance and examine the validity of these groupings. For example, Apple, Amazon, Google, and Facebook are often considered technology stocks, while Suncor and Exxon are considered oil and gas stocks. We will check if we can get these classifications using only the correlation between the stock prices of these companies.

Use correlation to classify these companies, rather than using stock price, where companies with similar stock prices would be clustered together. But here, we want to classify companies based on their stock price behavior. A simple way to achieve this is to use correlations between stock prices.

Optimal number of clusters

Finding the number of clusters is a problem of its own. There are methods, such as the elbow method, that can be used to find the optimal number of clusters. However, in this work, an attempt is made to classify these companies into 4 clusters. Ideally, these four groups would have to be technology stocks, oil and gas stocks, retail stocks, and other stocks.

First get the correlation matrix of the data frame we have.

correlation_mat=df_combined.corr()

Define a utility function to display clusters and companies belonging to that cluster.

# 用来打印公司名称和它们所分配的集群的实用函数
def print_clusters(df_combined,cluster_labels):
  cluster_dict = {}
  for i, label in enumerate(cluster_labels):
      if label not in cluster_dict:
          cluster_dict[label] = []
      cluster_dict[label].append(df_combined.columns[i])

  # 打印出每个群组中的公司 -- 建议关注@公众号:数据STUDIO 定时推送更多优质内容
  for cluster, companies in cluster_dict.items():
      print(f"Cluster {cluster}: {', '.join(companies)}")

Method 1: K-means clustering method

K-means clustering is a popular unsupervised machine learning algorithm used to group similar data points based on the similarity of features. The algorithm iteratively assigns each data point to the nearest cluster center point and then updates the center point based on the newly assigned data point until convergence. We can use this algorithm to cluster our data based on the correlation matrix.

from sklearn.cluster import KMeans

# Perform k-means clustering with four clusters
clustering = KMeans(n_clusters=4, random_state=0).fit(correlation_mat)

# Print the cluster labels
cluster_labels=clustering.labels_
print_clusters(df_combined,cluster_labels)

f0b214c12da8cf5ae813324381c73b7c.png

Results of k-means clustering

As expected, Amazon, Facebook, Tesla and Alphabet were grouped together, as were oil and gas companies. Additionally, Walmart and MACYs were also brought together. However, we are seeing some tech stocks, such as Apple and Walmart, cluster together.

Method 2: Agglomerative Clustering

Aggregate clustering is a hierarchical clustering algorithm that iteratively merges similar clusters to form larger clusters. The algorithm starts with a separate cluster for each object and then merges the two most similar clusters at each step.

from sklearn.cluster import AgglomerativeClustering

# 进行分层聚类
clustering = AgglomerativeClustering(n_clusters=n_clusters, 
                                     affinity='precomputed', 
                                     linkage='complete'
                                    ).fit(correlation_mat)

# Display the cluster labels
print_clusters(df_combined,clustering.labels_)

4346ef8f95b8aaec6bb5e2374d36261b.png

The results of hierarchical clustering

These results are slightly different from what we got from k-means clustering. We can see that some oil and gas companies are placed in different clusters.

Method 3: Affinity propagation clustering method AffinityPropagation

Affinity propagation clustering is a clustering algorithm that does not require the number of clusters to be specified in advance. It works by sending messages between pairs of data points and letting the data points automatically determine the number of clusters and the optimal cluster assignment. Affinity propagation clustering can effectively identify complex patterns in data, but is also computationally expensive for large data sets.

from sklearn.cluster import AffinityPropagation

# 用默认参数进行亲和传播聚类
clustering = AffinityPropagation(affinity='precomputed').fit(correlation_mat)

# Display the cluster labels
print_clusters(df_combined,clustering.labels_)

1dc1b359a402dbadf13e4aaf787c377e.png

Results of affinity propagation clustering

Interestingly, this method found four clusters to be the optimal number of clusters for our data. In addition, we can observe that oil and gas companies are brought together, and some technology companies are also brought together.

Method 4: DBSCAN clustering method

DBSCAN is a density-based clustering algorithm that clusters points that are closely packed together. It does not require the number of clusters to be specified in advance and can identify clusters of arbitrary shapes. The algorithm is robust to outliers and noise in the data and can automatically label them as noise points.

from sklearn.cluster import DBSCAN

# Removing negative values in correlation matrix
correlation_mat_pro = 1 + correlation_mat

# Perform DBSCAN clustering with eps=0.5 and min_samples=5
clustering = DBSCAN(eps=0.5, min_samples=5, metric='precomputed').fit(correlation_mat_pro)

# Print the cluster labels
print_clusters(df_combined,clustering.labels_)

3d3c06d7815c7c48fd372869adcf5eae.png

DBScan clustering results

Here, unlike affinity-based clustering, the DBScan method identifies 5 clusters as the optimal number. It can also be seen that some clusters only have 1 or 2 companies.

Visualization

It may be useful to examine the results of the above four clustering methods simultaneously to gain insight into their performance. The simplest way is to use a heat map, with companies on the X-axis and clusters on the Y-axis.

def plot_cluster_heatmaps(cluster_results, companies):
    """
    Plots the heatmaps of clustering for all companies
     for different methods side by side.

    Args:
    - cluster_results: a dictionary of cluster labels for each 
       clustering method
    - companies: a list of company names
    - 建议关注@公众号:数据STUDIO 定时推送更多优质内容
    """
    # 从字典中提取key和value
    methods = list(cluster_results.keys())
    labels = list(cluster_results.values())

    # 定义每个方法的热图数据
    heatmaps = []
    for i in range(len(methods)):
        heatmap = np.zeros((len(np.unique(labels[i])), len(companies)))
        for j in range(len(companies)):
            heatmap[labels[i][j], j] = 1
        heatmaps.append(heatmap)

    # Plot the heatmaps in a 2x2 grid
    fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(12, 12))

    for i in range(len(methods)):
        row = i // 2
        col = i % 2
        sns.heatmap(heatmaps[i], cmap="Blues", annot=True, fmt="g", xticklabels=companies, ax=axs[row, col])
        axs[row, col].set_title(methods[i])

    plt.tight_layout()
    plt.show()

companies=df_combined.columns
plot_cluster_heatmaps(cluster_results, companies)

516f0f8059ab51b491da5e1c8b3230a3.png

Clustering results for all four methods

However, the above visualization is not very helpful when trying to compare the results of multiple clustering algorithms. It would be helpful to find a better way to represent this graph.

in conclusion

In this article, we explore four different methods for clustering 20 companies based on correlations between their stock prices. The aim is to cluster these companies in a way that reflects their behavior rather than their stock prices. Tried K-means clustering, Agglomerative clustering, Affinity Propagation clustering and DBSCAN clustering methods, each method has its own advantages and disadvantages. The results show that all four methods are able to cluster companies in a way consistent with their industry or sector, while some methods are more computationally expensive than others. Correlation-based clustering methods provide a useful alternative to stock price-based clustering methods by clustering companies based on their behavior rather than stock prices.

推荐阅读:
我的2022届互联网校招分享
我的2021总结
浅谈算法岗和开发岗的区别
互联网校招研发薪资汇总
2022届互联网求职现状,金9银10快变成铜9铁10!!
公众号:AI蜗牛车
保持谦逊、保持自律、保持进步

发送【蜗牛】获取一份《手把手AI项目》(AI蜗牛车著)
发送【1222】获取一份不错的leetcode刷题笔记
发送【AI四大名著】获取四本经典AI电子书

Guess you like

Origin blog.csdn.net/qq_33431368/article/details/132784245