Hierarchical Clustering on the Iris Dataset

5 min readJun 3, 2023

Introduction

Hierarchical clustering is an alternative approach to prototype-based clustering like the K-means clustering algorithm. An important advantage of hierarchical clustering is that it allows the plotting of dendrograms, which are visualizations of binary hierarchical clustering that can support the interpretation of the results by creating meaningful taxonomies. It also has the further advantage of not requiring the specification of the number of clusters in advance.

There are two main approaches to hierarchical clustering:

Agglomerative
Divisive

In divisive hierarchical clustering, the complete dataset begins as the initial cluster, and then it is iteratively split until each cluster contains only a single example. Agglomerative clustering takes the opposite approach and starts with a single example and merges the closest pairs of clusters until only one cluster remains.

Algorithms for agglomerative hierarchical clustering

The two standard algorithms for agglomerative clustering are:

Single linkage: The distances between the two most similar members for each pair of clusters are computed, and then the two clusters for which the distance between the most similar members is the smallest are merged.
Complete linkage: distances are computed the same way, but the two clusters with the highest distance are merged.

There are other alternative approaches that include:

Average linkage: Merging cluster pairs based on the minimum average distances between all group members in the two clusters
Ward linkage: Merging of cluster pairs based on the minimum increase of the total within-cluster sum of squared errors.

This article will focus on applying the complete linkage approach to the iris dataset.

Hierarchical complete linkage clustering has the following iterative steps:

Compute a pairwise distance of all examples
Represent each data point as a singleton cluster
Merge the two clusters based on the distance between the two most dissimilar (distant) members
Update the cluster linkage matrix
Repeat steps 2–4 until a single cluster remains

First we will use the Scipy package to implement and visualize, then we will use the sklearn library to implement and obtain and interpret the metrics.

First, we import the relevant packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist, squareform
from sklearn import metrics

Next, we load the iris dataset

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
X.shape

Output
(150,4)

Scipy Implementation and Dendogram

Getting the condensed distance matrix

#Condensed distance matrix
row_clusters = linkage(pdist(X, metric='euclidean'), method='complete')
pd.DataFrame(row_clusters,
             columns=['row label 1', 'row label 2',
                      'distance', 'no. of items in clust.'],
            index=[f'cluster {(i + 1)}'
                    for i in range(row_clusters.shape[0])])

Fig.1: Output of the condensed distance matrix

As it can be seen, the last cluster has all 150 examples in a single large cluster, as expected in agglomerative hierarchical clustering, where a single cluster remains afterwards.

Next is the dendrogram, a pictorial view

row_dendr = dendrogram(row_clusters)
plt.tight_layout()
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Data point index')
plt.ylabel('Euclidean distance')
plt.show()

Fig.2 : Hierarchical Clustering Dendrogram of the Iris Dataset

Scikit-Learn Implementation

The sklearn library has a good implementation of agglomerative clustering that allows the input of the number of clusters. Though the number of clusters does not have to be specified, unlike in K-means, where the number of clusters is a required parameter.

ac = AgglomerativeClustering(n_clusters=3, # chose 3 since we know the iris dataset has three different species
                             metric='euclidean', 
                             linkage='complete')
labels = ac.fit_predict(X)
print(f'Cluster labels: {labels}')

output
Cluster labels: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 2 0 2 0 2 0 2 2 2 2 0 2 0 2 2 0 2 0 2 0 0
 0 0 0 0 0 2 2 2 2 0 2 0 0 0 2 2 2 0 2 2 2 2 2 0 2 2 0 0 0 0 0 0 2 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0]

Evaluation Metrics

# Calculate the evaluation metrics
ari = metrics.adjusted_rand_score(iris.target, labels)
silhouette = metrics.silhouette_score(X, labels)
completeness = metrics.completeness_score(iris.target, labels)

# Print the evaluation metrics
print("Adjusted Rand Index (ARI):", ari)
print("Silhouette Score:", silhouette)
print("Completeness Score:", completeness)

output:

Adjusted Rand Index (ARI): 0.6422512518362898
Silhouette Score: 0.5135953221192214
Completeness Score: 0.7454382753016932

Understanding the metrics for our model

I calculated the adjusted Rand Index (ARI) using the adjusted_rand_score function.

The ARI measures the similarity between the cluster assignments and the ground truth labels, and a higher value indicates better clustering performance. Our model is fairly average since the ARI lies between 0 and 1, and the score is closer to one than to zero.

Next, I calculate the silhouette score using the silhouette_score function. The silhouette score measures the compactness and separation of the clusters, and it ranges from -1 to 1, with higher values indicating better-defined clusters. The silhouette score for our model is also modest. Showing that it at least assigned many of the points to the right clusters.

Finally, I calculated the completeness score using the completeness_score function. The completeness score measures the extent to which all members of a true class are assigned to the same cluster, and it ranges from 0 to 1, with higher values indicating better clustering performance. Our completeness score for the clustering model is good, at almost 75%.

Conclusion

In this article, we started with an overview of hierarchical clustering before proceeding to apply hierarchical clustering to the iris dataset using both the Scipy and the Sklearn libraries, ending with evaluating the model to determine how well it fit. Though the metrics are not that bad, it is obvious that another clustering method may be more suitable for the dataset.

Follow my machine learning journey on Medium.

References

Machine Learning with Pytorch and Scikit-Learn: Develop machine learning and deep learning models with Python-Sebastian Raschka, Yuxi (Hayden) Liu, Vahid Mirjalili, Dmytro Dzhulgakov