The Rand index

I’ve been looking for ways to compare clustering results and through my searching I came across something called the Rand index. In this short post, I explain how this index is calculated.

From the Wikipedia page you can see that the Rand index, R, is calculated by:

$R = \frac{a + b}{ {{n}\choose{2}} }$

Ignoring the numerator for now, notice that the denominator is a binomial coefficient. ${n}\choose{2}$ is the number of unordered pairs in a set of n elements. For example, if you have set of 4 elements {a, b, c, d}, there are 6 unordered pairs: {a, b}, {a, c}, {a, d}, {b, c}, {b, d}, and {c, d}. The $a$ in the formula refers to the number of times a pair of elements belongs to a same cluster across two different clustering results and the $b$ refers to the number of times a pair of elements are in different clusters across two different clustering results. It will be easier to understand the Rand index with a simple example.

Say we have a set of six elements: {a, b, c, d, e, f}. Clustering method 1 (CM1) forms three clusters; the first two items are in group 1, the third and fourth are in group 2, and the fifth and sixth are in group 3: {1, 1, 2, 2, 3, 3}. Clustering method 2 (CM2) forms two clusters; the first three items are in group 1 and the last three items are in group 2: {1, 1, 1, 2, 2, 2}. What’s the Rand index of these two clustering results?

To manually calculate the Rand index, we need to go through every unordered pair to work out $a$ and $b$ ; let’s work out $a$ first. There are 15 unordered pairs in a set of six elements: {a, b}, {a, c}, {a, d}, {a, e}, {a, f}, {b, c}, {b, d}, {b, e}, {b, f}, {c, d}, {c, e}, {c, f}, {d, e}, {d, f}, and {e, f}. $a$ is every time a pair of elements is grouped together by the two clustering methods. {a, b} and {e, f} are clustered together by CM1 and CM2, so $a$ = 2. $b$ is every time a pair of elements is not grouped together by the two clustering methods. {a, d}, {a, e}, {a, f}, {b, d}, {b, e}, {b, f}, {c, e}, and {c, f} are not clustered together by CM1 and CM2, so $b$ = 8. $a$ and $b$ are the times that both clustering methods agree. Therefore, the Rand index is:

$R = \frac{2 + 8}{ {{6}\choose{2}}} = \frac{10}{15} = 0.667$

Using the fossil package and the rand.index() function, we can work out the Rand index in R.

# CM1

a <- rep(1:3, each = 2)

a

[1] 1 1 2 2 3 3

# CM2

b <- rep(1:2, each = 3)

b

[1] 1 1 1 2 2 2

# install package if you haven't already

install.packages("fossil")

# load package

library(fossil)

rand.index(a, b)

[1] 0.6666667

Let’s take a look at the rand.index() function.

rand.index

function (group1, group2)

{

x <- abs(sapply(group1, function(x) x - group1))

x[x > 1] <- 1

y <- abs(sapply(group2, function(x) x - group2))

y[y > 1] <- 1

sg <- sum(abs(x - y))/2

bc <- choose(dim(x)[1], 2)

ri <- 1 - sg/bc

return(ri)

}

<environment: namespace:fossil>

Let’s go through the code step by step to see what it is doing.

group1 <- rep(1:3, 2)

group1

[1] 1 2 3 1 2 3

group2 <- rep(1:3, each = 2)

group2

[1] 1 1 2 2 3 3

x <- abs(sapply(group1, function(x) x - group1))

x

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 0 1 2 0 1 2

[2,] 1 0 1 1 0 1

[3,] 2 1 0 2 1 0

[4,] 0 1 2 0 1 2

[5,] 1 0 1 1 0 1

[6,] 2 1 0 2 1 0

x[x > 1] <- 1

x

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 0 1 1 0 1 1

[2,] 1 0 1 1 0 1

[3,] 1 1 0 1 1 0

[4,] 0 1 1 0 1 1

[5,] 1 0 1 1 0 1

[6,] 1 1 0 1 1 0

y <- abs(sapply(group2, function(x) x - group2))

y

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 0 0 1 1 2 2

[2,] 0 0 1 1 2 2

[3,] 1 1 0 0 1 1

[4,] 1 1 0 0 1 1

[5,] 2 2 1 1 0 0

[6,] 2 2 1 1 0 0

y[y > 1] <- 1

y

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 0 0 1 1 1 1

[2,] 0 0 1 1 1 1

[3,] 1 1 0 0 1 1

[4,] 1 1 0 0 1 1

[5,] 1 1 1 1 0 0

[6,] 1 1 1 1 0 0

Each row and column number in x and y correspond to the element number in group1 and group2, respectively. A zero indicates a match, i.e. that a particular element is in the same cluster as another element, and a one indicates that they belong to different clusters. The x and y matrices therefore list all the pairs (twice) and whether or not they belong to the same cluster.

Now the function can compare the two different groups of clusters to get the number of disagreements. Finally to get the Rand Index, which reflects the number of agreements, we subtract the fraction of disagreements from 1.

# divide by two because the pairs are counted twice

# sum to get all the disagreements

sg <- sum(abs(x - y))/2

sg

[1] 6

# get the total number of pairs

bc <- choose(dim(x)[1], 2)

bc

[1] 15

ri <- 1 - sg/bc

ri

[1] 0.6

Assessing clustering results

We can use the Rand index to assess a clustering approach. I’ll illustrate using k-means clustering using the famous iris dataset.

# check out the data

head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

# true labels

true_label <- as.numeric(iris$Species)

true_label

[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2

[64] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

[127] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

# perform k-means clustering

my_kmeans <- kmeans(x = iris[,-5], centers = 3)

# clustering results

my_kmeans$cluster

[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3

[64] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2

[127] 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2 2 3

# Rand index

rand.index(true_label, my_kmeans$cluster)

[1] 0.8797315

From the Wikipedia article:

The Rand index has a value between 0 and 1, with 0 indicating that the two data clusterings do not agree on any pair of points and 1 indicating that the data clusterings are exactly the same.

The Rand index suggests that the k-means clustering of the iris data using sepal and petal measurements is similar to the real “clustering” of the data.

Transfer from link: https://davetang.org/muse/2017/09/21/the-rand-index/

Assessing clustering results

猜你喜欢