Chapter 5: Unsupervised Learning. Part 1

Machine Learning

Author

F.San Segundo & N.Rodríguez

Published

April 2026

%cd 5_2_Unsupervised_Clustering

/wd/5_2_Unsupervised_Clustering

5.2 Clustering Methods

Clustering is a very common form of unsupervised learning. It aims to find a natural grouping in data so that the items in the same cluster are more similar to each other than to those from different clusters.

There are two main categories of clustering: * Prototype-based clustering: Each cluster is represented by a prototype. Usually, this prototype is: - The centroid (average) of similar points with continuous features - The medoid (the point that minimises the distance to the other points in a cluster)

Hierarchical: This technique allows us to plot dendrograms. There are two main approaches:
- Divisive: We start with one cluster (all the dataset) and we iteratively split this cluster until each cluster only contains one example.
- Agglomerative: The opposite to divisive. We start with clusters of just one example each, and we iteratively merge these clusters until only one cluster remains.

5.2.1 K Means Clustering

We assume there are \(K\) cluster centers \(μ_{k} \in \mathbb{R}^{D}\), so we can cluster the data by assigning each data point \(x_{n} \in \mathbb{R}^{D}\) to it closest center:

\[ z_{n}^{*} = \text{argmin}_{k} || \vec{x}_{n} - \vec{\mu}_{k} ||^{2}\]

Of course, we don’t know the cluster centers, but we can estimate them by computing the average value of all points assigned to them:

\[ \vec{\mu}_{k} = \frac{1}{N_{k}} \sum_{n:z_{n} = k} \vec{x}_{n}\]

We can then iterate these steps to convergence.

More formally, we can view this as finding a local minimum of the following cost function, known as the distortion:

\[ J(\mathbf{M}, \mathbf{Z}) = \sum_{n=1}^{N} || \vec{x}_{n} - \vec{\mu}_{z_{n}} ||^{2} = || \mathbf{X} - \mathbf{Z}\mathbf{M}^{T} || ^{2} \]

where \(\mathbf{X} \in \mathbb{R}^{N \times D}\), \(\mathbf{Z} \in [0, 1]^{N \times K}\), and \(\mathbf{M} \in \mathbb{R}^{D\times K}\) containes the clusters centers \(\vec{\mu}_{k}\) in its columns.

K-Means optimises this using alternating minimisation. From a theoretical perspective, finding the optimal solution for a clustering problem is particularly difficult (NP-hard).

Thus, all algorithms are only approximations of the optimal solution. The k-means algorithm, is an iterative process that refines the solution until it converges to a local optimum.

Some properties of K-means: * Clusters do not overlap * Clusters are not hierarchical * It is assumed that there is at least one item in each cluster

Four steps:

Specify the number \(k\) of clusters to assign.
Randomly initialise \(k\) centroids.
repeat
1. expectation: Assign each point to its closest centroid.
2. maximisation: Compute the new centroid (mean) of each cluster.
until The centroid positions do not change or a user-defined tolerance or maximum number of iterations is reached.

5.2.1.1 Example

Handmade implementation

We are goint to apply these steps through an example. The following code just creates a dataset to work with.

%run -i "example_kmeans.py"

Initialization: Often, there is no good prior knowledge about the location of the centroids. An effortless way to start is to define the centroids by randomly selecting data points from the dataset (Forgy method).

def pick_centroids(data, k):
    indexes = np.random.choice(len(data), size=k, replace=False)
    centroids = data[indexes]
    return centroids

Think about it

What do you think about randomly selecting data points as the initial centroids? Is there a better approach?

Assignment: For each data point, the distance to all centroids is computed. The data points belong to the cluster represented by the closest centroid. This is called a hard assignment because the data point belongs to one and only one cluster.

def assign_cluster(data, centroids):
    # Pairwise squared L2 distances. Shape [n, k]
    distances = ((data[:, np.newaxis] - centroids)**2).sum(axis=2)
    # find closest centroid index. Shape [n]
    clusters = np.argmin(distances, axis=1)
    return clusters

Question

If there is a hard assignment, intuition tells us that there must be something called “soft assignment”…What’s the difference?

Update: Given all the points assigned to a cluster, the mean position is computed and defines the new location of the centroid. All centroids are updated simultaneously.

def update_centroids(data, clusters, k):
    # Mean positions of data within clusters
    centroids = [np.mean(data[clusters == i], axis=0) for i in range(k)]
    return np.array(centroids)

Using the two functions defined above, we can create an object KMEANS with methods fit and predict

class KMEANS:
    def __init__(self, k):
        self.k = k
        
    def fit(self, data, steps=20):
        self.centroids = pick_centroids(data, self.k)
        for step in range(steps):
            clusters = assign_cluster(data, self.centroids)
            self.centroids = update_centroids(data, clusters, self.k)
            
    def predict(self, data):
        return assign_cluster(data, self.centroids)

Similarly to what we do with scikit-learn…

kmeans = KMEANS(k=3)
kmeans.fit(data)
clusters = kmeans.predict(data)

We can plot the results of K-Means applied to the data:

import matplotlib.pyplot as plt
plt.scatter(data[:, 0], data[:, 1], c=clusters, cmap='viridis', alpha=0.5) 
plt.scatter(kmeans.centroids[:, 0], kmeans.centroids[:, 1], c="red", s=100, edgecolor='black', label='Centroids')
plt.title('Cluster Visualization with Centroids')  
plt.grid(False)
plt.show()

It is possible to visualize the decision boundaries of K-Means. For this, we can use a Voronoi diagram

Voronoi diagram

A Voronoi diagram is a way of dividing space into regions based on the distances to a specified set of points, often referred to as sites or seeds. In the case of K-Means, the seeds are the centroids of the clusters.

def plot_decision_boundaries(clusterer, X, resolution=1000):
    plt.figure()
    mins = X.min(axis=0) - 0.2
    maxs = X.max(axis=0) + 0.2
    xx, yy = np.meshgrid(np.linspace(mins[0], maxs[0], resolution), np.linspace(mins[1], maxs[1], resolution))
    Z = clusterer.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.title('Cluster Visualization with Voronoi cells') 
    plt.contourf(Z, extent=(mins[0], maxs[0], mins[1], maxs[1]), cmap="Pastel2")
    plt.contour(Z, extent=(mins[0], maxs[0], mins[1], maxs[1]), linewidths=1, colors="k")
    plt.scatter(data[:, 0], data[:, 1], c=clusters, cmap='viridis', alpha=0.5) 
    plt.scatter(kmeans.centroids[:, 0], kmeans.centroids[:, 1], c="red", s=100, edgecolor='black', label='Centroids')
    plt.show()

plot_decision_boundaries(kmeans, data)

Sklearn Implementation

from sklearn.cluster import KMeans
import numpy as np

kmeans_sklearn = KMeans(n_clusters=2, init='random', n_init=10, random_state=29).fit(data)
kmeans_sklearn.predict(data)
kmeans_sklearn.cluster_centers_

array([[ 4.40993308,  2.25708681],
       [-1.59048593, -0.23386661]])

Question

Inspect the code above. What does the init parameter do? And n_init? (Tip: go to the description of KMeans in Scikit-Learn documentation)

K-Means++

Here we have seen the classic K-means algorithm. However, the default algorithm used by Sklearn when calling KMeans is K-Means++.

In an nutshell, K-Means++ places the initial centroids far away from each other, leading to better and more consistent results than the classic algorithm.

Interesting link: What is kmeans++

5.2.1.2 Selecting a good number of clusters

In this section, we discuss how to choose the number of clusters K in the K-means algorithm and other related methods. We will see two main methods: * The elbow method * Silhouette plot method

The elbow method

This method should already be familiar to you. We can compare the performance of different k-means clustering (different number of clusters) by using the within-cluster Squared Standard Error (SSE), also called distortion.

Since we are using Sklearn, the distortion is already accessible via the inertia_ attribute of the Kmeans model. Therefore, we can plot this distortion for different values of k

distortions = []
for i in range(1, 11):
    km = KMeans(n_clusters=i, init="k-means++", n_init=10, max_iter=300, random_state=0)
    km.fit(data)
    distortions.append(km.inertia_)

plt.plot(range(1,11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.tight_layout()
plt.show()

We can see that the elbow is located at \(k=3\), so it would be a good choice for this dataset

The Silhouette plot method

The Silhouette method can be used to plot a measure of how tightly grouped the examples in the clusters are. It is designed to work for spherical (not elongated) clusters.

To do this, we have to define the Silhouette coefficient as:

\[sc(i) = \frac{b_{i} - a_{i}}{\text{max}(a_{i}, b_{i})}\]

where \(a_{i}\) is the mean distance to the other instances in the same cluster, thus it is a measure of compactness of the \(i\)’s cluster, and \(b_{i}\) is a measure of distance between the clusters (average distance between the example and all the instances in the nearest cluster).

The silhouette coefficient varies from \(-1\) to \(+1\) where: - \(1\) means instance is close to all the members of its cluster and far from other clusters. - \(0\) means it is close to a cluster boundary. - \(-1\) means it may be in the wrong cluster.

Let’s see this through an example. The code in the next cell creates a dataset whith some elongated clusters for demonstration purposes:

%run -i "example_silhouette.py"

Before plotting the silhouette plots, let’s see first the number of clusters given by the elbow method

distortions = []
for i in range(2, 10):
    km = KMeans(n_clusters=i, n_init=10, random_state=42)
    km.fit(X)
    distortions.append(km.inertia_)

plt.plot(range(2,10), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.tight_layout()
plt.show()

Let’s now plot the silhouette coefficients for each number of clusters \(k\)

from sklearn.metrics import silhouette_score

Ks = range(2, 10)
kmeans_per_k = [KMeans(n_clusters=k, random_state=42).fit(X) for k in Ks]
silhouette_scores = [silhouette_score(X, model.labels_) for model in kmeans_per_k]

plt.figure()
plt.plot(Ks, silhouette_scores, marker='o')
plt.xlabel("$k$", fontsize=14)
plt.ylabel("Silhouette score", fontsize=14)
plt.tight_layout()
plt.show()

from sklearn.metrics import silhouette_samples
from matplotlib.ticker import FixedLocator, FixedFormatter
import warnings

warnings.filterwarnings("ignore")


def plot_silhouette(model, X):
    mu = model.cluster_centers_
    K, D = mu.shape
    y_pred = model.labels_
    silhouette_coefficients = silhouette_samples(X, y_pred)
    silhouette_scores = silhouette_score(X, model.labels_)
    cmap = cm.get_cmap("Pastel2")
    colors = [cmap(i) for i in range(K)]
    padding = len(X) // 30
    pos = padding
    for i in range(K):
        coeffs = silhouette_coefficients[y_pred == i]
        coeffs.sort()
        color = mpl.cm.Spectral(i / K)
        # color = colors[i]
        plt.fill_betweenx(np.arange(pos, pos + len(coeffs)), 0, coeffs, facecolor=color, edgecolor=color, alpha=0.7)
        pos += len(coeffs) + padding
    score = silhouette_scores
    plt.axvline(x=score, color="red", linestyle="--")
    plt.title("$k={}, score={:0.2f}$".format(K, score), fontsize=16)


for model in kmeans_per_k:
    K, D = model.cluster_centers_.shape
    plt.figure()
    plot_silhouette(model, X)
    fname = f"kmeans_silhouette_diagram{K}.pdf"
    plt.tight_layout()

We can see in the resulting plots that the silhouettes have different lengths and widths, which indicates a suboptimal clustering.

## 5.2.2 Mini Batch K-Means for high N

::: {.callout-note icon=false}

#### Mini Batch K-Means

Source: Scikit-Learn

It is a variant of K-Means that uses mini-batches to reduce the computation time. Mini-batches are subsets of the input data, randomly sampled in each training iteration.

In general, the results of Mini Batch K-Means are only slightly worse than the standard algorithm

:::

::: {#cell-48 .cell}

The following code compares the computation times and clustering performance (inertia) of the KMeans and MiniBatchKMeans algorithms for different numbers of clusters, storing the results in the times and inertias arrays to plot them.

::: {#cell-50 .cell}

::: {.cell-output .cell-output-stdout}

::: {.cell-output .cell-output-display}

::: :::

## Evaluating the output of clustering methods

::: {.callout-note} #### The challenge of measuring the quality Clustering is an unsupervised learning technique, so it is hard to evaluate the quality of the output of any given method.

If we have labeled data for some of the data, we can use the similarity (or equality) between the labels of two data points as a metric for determining if the two inputs “should” be assigned to the same cluster or not.

If we don’t have labels, but the method is based on a generative model of the data, we can use log likelihood as a metric.

For this section, we will use data generated with make_moons. You should already be familiar with this.

::: {#cell-54 .cell} ``` {.python .cell-code} %run -i “make_moons.py”

import seaborn as sns sns.scatterplot(dfTR, x = “X1”, y = “X2”, hue=“Y”); ```

::: {.cell-output .cell-output-display}

::: :::

We will apply a simple K-means with just two clusters for demonstration purposes:

::: {#cell-56 .cell}

### Purity

::: {.callout-note} #### Purity It is a common evaluation metric for clustering algorithms when we have labeled data. It measures the extent to which clusters contain a single class of data points. Higher purity indicates better clustering performance, as it means that the clusters are more homogeneous with respect to the ground truth labels.

The steps are the following: 1. First, we assign each cluster to the class which is most frequent in the cluster: This is done by finding the majority class in each cluster. 2. Count the number of correctly assigned data points: These are the data points that belong to the majority class in their respective clusters. 3. Calculate purity: Purity is the ratio of the number of correctly assigned data points to the total number of data points.

Mathematically:

Let \(N_{ij}\) be the number of objects in cluster \(i\) that belong to class \(j\), \(K\) is the total number of clusters, and \(N\) is the total number of data points in the dataset.The overall purity of a clustering is: \[p = \frac{1}{N} \sum_{i=1}^{K} \max_{j} \, N_{ij}\]

We can code a function to calculate this purity:

::: {#cell-59 .cell} ``` {.python .cell-code} import numpy as np from sklearn import metrics

def purity_score(y_true, y_pred): contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)

majority_class_count = np.sum(np.max(contingency_matrix, axis=1)) #Sums the counts of the majority classes (np.max) purity = majority_class_count / np.sum(contingency_matrix)

return purity ``` :::

::: {#cell-60 .cell}

::: {.cell-output .cell-output-display}

The purity ranges between 0 (bad) and 1 (good).

However, we can trivially achieve a purity of 1 by putting each object into its own cluster, so this measure does not penalize for the number of clusters.

### Rand Index

::: {.callout-note} #### Rand Index It is a measure of the similarity between two data clusterings (for example, kmeans with the real labeled data)

The Rand Index ranges from 0 to 1, where 1 indicates perfect agreement between the two clusterings.

Let U and V be two different partitions of the N data points.

For example, U might be the estimated clustering and V is reference clustering derived from the class labels. Now define a \(2\times2\) contingency table, containing the following numbers:

- TP (True positives): Count the number of pairs of data points that are in the same cluster in both U and V. - TN (True negatives): Count the number of pairs of data points that are in different clusters in both U and V. - FN (False negatives): Count the number of pairs of data points that are in different clusters in U but same in V. - FP (False positives): Count the number of pairs of data points that are in the same cluster in U but different in V.

The Rand index is calculated: \[R = \frac{TP+ TN}{TP + FP + TN + FN}\] This can be interpreted as the fraction of clustering decisions that are correct. Clearly \(0 \leq R \leq 1\). The Rand index weights false positives and false negatives equally.

In sklearn we can directly use the rand_score function to calculate it.

::: {#cell-65 .cell}

::: {.cell-output .cell-output-display}

::: {.callout-note} #### Adjusted Rand index The Adjusted Rand Index (ARI) is a corrected-for-chance version of the Rand Index. It adjusts the Rand Index to account for the fact that random clusterings can sometimes result in a high Rand Index. The ARI ranges from -1 to 1, where 1 indicates perfect agreement between the two clusterings, 0 indicates random labeling, and negative values indicate less agreement than expected by chance.

The formula for the Adjusted Rand Index is:

\[\text{ARI} = \frac{\text{RI} - \text{Expected RI}}{\text{Max RI} - \text{Expected RI}}\]

Where:

* \(\text{RI}\) is the Rand Index. * \(\text{Expected RI}\) is the expected value of the Rand Index for random clusterings. * \(\text{Max RI}\) is the maximum value of the Rand Index.

ARI is less sensitive to the absolute sizes of the clusters.

::: {#cell-67 .cell}

::: {.cell-output .cell-output-display}

### Mutual Information

::: {.callout-note} #### Mutual information Another way to measure cluster quality is to compute the mutual information between two candidate partitions U and V.

To do this, let \(p_{\text{UV}}(i, j) = \frac{|u_{i}\cap v_{j}|}{N}\) be the probability that a randomly chosen object belongs to cluster \(u_{i}\) in U and \(v_{j}\) in V.

Also, let \(p_{U}(i) = \frac{|u_{i}|}{N}\) be the probability that a randomly chosen object belongs to cluster \(u_{i}\) in U; define \(p_{V}(j) = \frac{|v_{J}|}{N}\) be the probability that a randomly chosen object belongs to cluster \(v_{j}\) in V. Then we have:

\[\mathbb{I}(U,V) = \sum_{i=1}^{R}\sum_{j=1}^{C} p_{\text{UV}}(i, j) \text{log} \frac{p_{UV}(i, j)}{p_{U}(i)p_{V}(j)}\]

This lies between \(0\) and \(\text{min}\{\mathbb{H}(U), \mathbb{H}(V)\}\). Unfortunately, the maximum value can be achieved by using lots of small clusters, which have low entropy. To compensate for this, we can use the normalized mutual information which lies between \(0\) and \(1\).

\[N\mathbb{I}(U,V) = \frac{2\mathbb{I}(U,V)}{\mathbb{H}(U) + \mathbb{H}(V)}\]

- A score close to 1 indicates that the clusterings are very similar. - A score close to 0 indicates that the clusterings are dissimilar.

Therefore, a normalized mutual information score closer to 1 indicates that the clustering is good and there is a high degree of agreement between the true labels and the predicted labels. In other words, the clusters found by the algorithm closely match the true underlying clusters.

::: {#cell-70 .cell}

::: {.cell-output .cell-output-display}

::: {.callout-note} #### Summary * Purity presents bias towards many clusters, but it is easy to understand and compute. It is appropriate when you have clear true labels for your data. * ARI is useful to compare different clustering results on the same dataset, since it accounts for chance grouping. It does not matter if the compared methods have different number of clusters. * Mutual information to measure the amount of information shared between clustering and the true labels. It is not biased towards a specific number of clusters.

5.2.3 Hierarchical Agglomerative Clustering

HAC

A common form of clustering is known as hierarchical agglomerative clustering or HAC.

This is an alternative to prototype-based clustering (e.g., K-means) that allows us to plot dendrograms. Another advantages is that we don’t have to specify the number of clusters upfront.

In HAC, we start with each example as an individual cluster and merge the closest pairs of clusters until only one cluster remains.

For example, consider the set of \(5\) inputs points below in \(x_n \in \mathbb{R}^2\).

%run -i "example_hac_1.py"

the input to the algorithm is an \(N \times N\) dissimilarity matrix \(D_{ij}\geq0\), and the output is a tree structure in which groups i and j with small disimilarity are grouped together in a hierarchical fashion.

We will use city block distance between the points to define the dissimilarity, i.e.

\[d_{ij} = \sum_{k=1}^{2}|x_{ik}-x_{jk}|\]

We start with a tree with \(N\) leaves (in this example, \(N=5\)), each corresponding to a cluster with a single data point. Next we compute the pair of points that are closest, and merge them: * We see that (1,3) and (4,5) are both distance 1 apart, so they get merged first. * We then measure the dissimilarity between the sets {1,3}, {4,5} and {2} using some measure (details below), and group them, and repeat

The result is a binary tree known as a dendrogram. By cutting this tree at different heights, we can induce a different number of (nested) clusters.

linked = linkage(X, 'single')
labelList = range(1, 6)
plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top', labels=labelList, distance_sort='descending', show_leaf_counts=True)
plt.show()

How to interpret a dendrogram

The dendrogram visually represents the hierarchical clustering of the data.
The vertical lines represent the distances or dissimilarities between clusters.
The height at which two clusters are merged indicates the distance at which they were merged.
The truncation (truncate_mode=‘lastp’) allows you to focus on the final stages of the clustering process, showing only the last p merged clusters.