Calculation of the Adjusted Rand index (ARI)

Rand index (RI)

The value range of RI is [0,1]. The larger the value, the more consistent the clustering result is with the real situation:

If there are category labels, the clustering result can also calculate the accuracy and recall rate like classification.

Suppose U is the external evaluation standard, namely true_label, and V is the clustering result, set 4 statistics

symbol explain more straightforward explanation Whether the decision is correct or not
TP / a The number of pairs of data points that are of the same class in U and are also of the same class in V Group similar samples into the same cluster (same-same) right decision
TN / d The number of pairs of data points that do not belong to the same class in U and do not belong to the same class in V Classify dissimilar samples into different clusters (different – ​​different) right decision
FP / c Logarithm of data points that are not in the same class in U but are in the same class in V Group dissimilar samples into the same cluster (different-same) wrong decision
FN / b Logarithm of pairs of data points belonging to the same class in U but belonging to different classes in V Group similar samples into different clusters (same – different) wrong decision
Same Cluster Different Cluster Sum U
Same Class TP / a FN / b a+b
Different Class FP / c TN / d c+d
Sum V a+c b+d a+b+c+d

RI is to calculate the ratio of "correct decision", so RI = TP + TNTP + FP + TN + FN = TP + TNCN 2 = a + d C 2 nsamples RI=\frac{TP+TN}{TP+FP+TN +FN}=\frac{TP+TN}{C_N^2}=\frac{a+d}{C_2^{n_{samples}}}RI=TP+FP+TN+FNTP+TN=CN2TP+TN=C2nsamplesa+d
Denominator CN 2 C_N^2CN2C 2 samples C_2^{n_{samples}}C2nsamplesBoth represent the number of combinations of any two samples as one class, which is the logarithm of the total elements that can be formed in the data set

Adjusted Rand index (ARI)

Adjusted Rand coefficient (Adjusted Rand index, ARI), why introduce ARI, because the problem with RI is that for two random divisions, its RI value is not a constant close to 0. Hubert and Arabie proposed to adjust the Rand coefficient in 1985. The adjusted Rand coefficient assumes that the hyperdistribution of the model is a random model, that is, the division of X and Y is random, so the number of data points for each category and each cluster is fixed.

To calculate this value, first calculate the contingency table, each value nij n_{ij} in the tablenijIndicates that a document is located in the cluster at the same time ( YYY) 和 class ( X X X ), the ARI value can be calculated through this table.
insert image description here
insert image description here
ARI = RI − E ( RI ) max ( RI ) − E ( RI ) ARI=\frac{RI-E(RI)}{max(RI)-E(RI)}ARI=max(RI)E ( R I )RIE ( R I )
A R I ∈ [ − 1 , 1 ] ARI∈[-1,1] ARI[1,1 ] . A larger value means that the clustering result is more consistent with the real situation. From a broad perspective, ARI measures the degree of agreement between two data distributions.

  • Advantages and disadvantages
    • Advantages:
      1.) For any number of cluster centers and sample numbers, the ARI of random clustering is very close to 0;
      2.) The value is between [-1, 1]. Negative numbers represent poor results, and the closer 3.
      ) Can be used for comparison between clustering algorithms
    • Cons:
      1.) ARI requires ground truth labels

python code:

  • adjusted_rand_scoreRequires methods in the sklearn library
from sklearn import metrics
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]

# 基本用法
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)  # 0.24242424242424246

# 与标签名无关
labels_pred = [1, 1, 0, 0, 3, 3]
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)  # 0.24242424242424246

# 具有对称性
score = metrics.adjusted_rand_score(labels_pred, labels_true)
print(score)  # 0.24242424242424246

# 接近 1 最好
labels_pred = labels_true[:]
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)  # 1.0

# 独立标签结果为负或者接近 0
labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)  # -0.12903225806451613

Reference blog:

  1. https://zhuanlan.zhihu.com/p/53840697
  2. https://www.cnblogs.com/TimVerion/p/11323033.html
  3. Evaluation index of clustering algorithm
  4. [Python-ML] clustering performance evaluation index

Guess you like

Origin blog.csdn.net/qq_42887760/article/details/105728101