%cd 5_2_Unsupervised_Clustering/wd/5_2_Unsupervised_Clustering
Machine Learning
%cd 5_2_Unsupervised_Clustering/wd/5_2_Unsupervised_Clustering
Clustering is a very common form of unsupervised learning. It aims to find a natural grouping in data so that the items in the same cluster are more similar to each other than to those from different clusters.
There are two main categories of clustering: * Prototype-based clustering: Each cluster is represented by a prototype. Usually, this prototype is: - The centroid (average) of similar points with continuous features - The medoid (the point that minimises the distance to the other points in a cluster)
We assume there are \(K\) cluster centers \(μ_{k} \in \mathbb{R}^{D}\), so we can cluster the data by assigning each data point \(x_{n} \in \mathbb{R}^{D}\) to it closest center:
\[ z_{n}^{*} = \text{argmin}_{k} || \vec{x}_{n} - \vec{\mu}_{k} ||^{2}\]
Of course, we don’t know the cluster centers, but we can estimate them by computing the average value of all points assigned to them:
\[ \vec{\mu}_{k} = \frac{1}{N_{k}} \sum_{n:z_{n} = k} \vec{x}_{n}\]
We can then iterate these steps to convergence.
More formally, we can view this as finding a local minimum of the following cost function, known as the distortion:
\[ J(\mathbf{M}, \mathbf{Z}) = \sum_{n=1}^{N} || \vec{x}_{n} - \vec{\mu}_{z_{n}} ||^{2} = || \mathbf{X} - \mathbf{Z}\mathbf{M}^{T} || ^{2} \]
where \(\mathbf{X} \in \mathbb{R}^{N \times D}\), \(\mathbf{Z} \in [0, 1]^{N \times K}\), and \(\mathbf{M} \in \mathbb{R}^{D\times K}\) containes the clusters centers \(\vec{\mu}_{k}\) in its columns.
K-Means optimises this using alternating minimisation. From a theoretical perspective, finding the optimal solution for a clustering problem is particularly difficult (NP-hard).
Thus, all algorithms are only approximations of the optimal solution. The k-means algorithm, is an iterative process that refines the solution until it converges to a local optimum.
Some properties of K-means: * Clusters do not overlap * Clusters are not hierarchical * It is assumed that there is at least one item in each cluster
Four steps:
We are goint to apply these steps through an example. The following code just creates a dataset to work with.
%run -i "example_kmeans.py"def pick_centroids(data, k):
indexes = np.random.choice(len(data), size=k, replace=False)
centroids = data[indexes]
return centroidsdef assign_cluster(data, centroids):
# Pairwise squared L2 distances. Shape [n, k]
distances = ((data[:, np.newaxis] - centroids)**2).sum(axis=2)
# find closest centroid index. Shape [n]
clusters = np.argmin(distances, axis=1)
return clustersdef update_centroids(data, clusters, k):
# Mean positions of data within clusters
centroids = [np.mean(data[clusters == i], axis=0) for i in range(k)]
return np.array(centroids)Using the two functions defined above, we can create an object KMEANS with methods fit and predict
class KMEANS:
def __init__(self, k):
self.k = k
def fit(self, data, steps=20):
self.centroids = pick_centroids(data, self.k)
for step in range(steps):
clusters = assign_cluster(data, self.centroids)
self.centroids = update_centroids(data, clusters, self.k)
def predict(self, data):
return assign_cluster(data, self.centroids)
Similarly to what we do with scikit-learn…
kmeans = KMEANS(k=3)
kmeans.fit(data)
clusters = kmeans.predict(data)We can plot the results of K-Means applied to the data:
import matplotlib.pyplot as plt
plt.scatter(data[:, 0], data[:, 1], c=clusters, cmap='viridis', alpha=0.5)
plt.scatter(kmeans.centroids[:, 0], kmeans.centroids[:, 1], c="red", s=100, edgecolor='black', label='Centroids')
plt.title('Cluster Visualization with Centroids')
plt.grid(False)
plt.show()It is possible to visualize the decision boundaries of K-Means. For this, we can use a Voronoi diagram
def plot_decision_boundaries(clusterer, X, resolution=1000):
plt.figure()
mins = X.min(axis=0) - 0.2
maxs = X.max(axis=0) + 0.2
xx, yy = np.meshgrid(np.linspace(mins[0], maxs[0], resolution), np.linspace(mins[1], maxs[1], resolution))
Z = clusterer.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.title('Cluster Visualization with Voronoi cells')
plt.contourf(Z, extent=(mins[0], maxs[0], mins[1], maxs[1]), cmap="Pastel2")
plt.contour(Z, extent=(mins[0], maxs[0], mins[1], maxs[1]), linewidths=1, colors="k")
plt.scatter(data[:, 0], data[:, 1], c=clusters, cmap='viridis', alpha=0.5)
plt.scatter(kmeans.centroids[:, 0], kmeans.centroids[:, 1], c="red", s=100, edgecolor='black', label='Centroids')
plt.show()
plot_decision_boundaries(kmeans, data)from sklearn.cluster import KMeans
import numpy as np
kmeans_sklearn = KMeans(n_clusters=2, init='random', n_init=10, random_state=29).fit(data)
kmeans_sklearn.predict(data)
kmeans_sklearn.cluster_centers_array([[ 4.40993308, 2.25708681],
[-1.59048593, -0.23386661]])
In this section, we discuss how to choose the number of clusters K in the K-means algorithm and other related methods. We will see two main methods: * The elbow method * Silhouette plot method
This method should already be familiar to you. We can compare the performance of different k-means clustering (different number of clusters) by using the within-cluster Squared Standard Error (SSE), also called distortion.
Since we are using Sklearn, the distortion is already accessible via the inertia_ attribute of the Kmeans model. Therefore, we can plot this distortion for different values of k
distortions = []
for i in range(1, 11):
km = KMeans(n_clusters=i, init="k-means++", n_init=10, max_iter=300, random_state=0)
km.fit(data)
distortions.append(km.inertia_)
plt.plot(range(1,11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.tight_layout()
plt.show()We can see that the elbow is located at \(k=3\), so it would be a good choice for this dataset
The Silhouette method can be used to plot a measure of how tightly grouped the examples in the clusters are. It is designed to work for spherical (not elongated) clusters.
To do this, we have to define the Silhouette coefficient as:
\[sc(i) = \frac{b_{i} - a_{i}}{\text{max}(a_{i}, b_{i})}\]
where \(a_{i}\) is the mean distance to the other instances in the same cluster, thus it is a measure of compactness of the \(i\)’s cluster, and \(b_{i}\) is a measure of distance between the clusters (average distance between the example and all the instances in the nearest cluster).
The silhouette coefficient varies from \(-1\) to \(+1\) where: - \(1\) means instance is close to all the members of its cluster and far from other clusters. - \(0\) means it is close to a cluster boundary. - \(-1\) means it may be in the wrong cluster.
Let’s see this through an example. The code in the next cell creates a dataset whith some elongated clusters for demonstration purposes:
%run -i "example_silhouette.py"Before plotting the silhouette plots, let’s see first the number of clusters given by the elbow method
distortions = []
for i in range(2, 10):
km = KMeans(n_clusters=i, n_init=10, random_state=42)
km.fit(X)
distortions.append(km.inertia_)
plt.plot(range(2,10), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.tight_layout()
plt.show()Let’s now plot the silhouette coefficients for each number of clusters \(k\)
from sklearn.metrics import silhouette_score
Ks = range(2, 10)
kmeans_per_k = [KMeans(n_clusters=k, random_state=42).fit(X) for k in Ks]
silhouette_scores = [silhouette_score(X, model.labels_) for model in kmeans_per_k]
plt.figure()
plt.plot(Ks, silhouette_scores, marker='o')
plt.xlabel("$k$", fontsize=14)
plt.ylabel("Silhouette score", fontsize=14)
plt.tight_layout()
plt.show()from sklearn.metrics import silhouette_samples
from matplotlib.ticker import FixedLocator, FixedFormatter
import warnings
warnings.filterwarnings("ignore")
def plot_silhouette(model, X):
mu = model.cluster_centers_
K, D = mu.shape
y_pred = model.labels_
silhouette_coefficients = silhouette_samples(X, y_pred)
silhouette_scores = silhouette_score(X, model.labels_)
cmap = cm.get_cmap("Pastel2")
colors = [cmap(i) for i in range(K)]
padding = len(X) // 30
pos = padding
for i in range(K):
coeffs = silhouette_coefficients[y_pred == i]
coeffs.sort()
color = mpl.cm.Spectral(i / K)
# color = colors[i]
plt.fill_betweenx(np.arange(pos, pos + len(coeffs)), 0, coeffs, facecolor=color, edgecolor=color, alpha=0.7)
pos += len(coeffs) + padding
score = silhouette_scores
plt.axvline(x=score, color="red", linestyle="--")
plt.title("$k={}, score={:0.2f}$".format(K, score), fontsize=16)
for model in kmeans_per_k:
K, D = model.cluster_centers_.shape
plt.figure()
plot_silhouette(model, X)
fname = f"kmeans_silhouette_diagram{K}.pdf"
plt.tight_layout()We can see in the resulting plots that the silhouettes have different lengths and widths, which indicates a suboptimal clustering.
| ## 5.2.2 Mini Batch K-Means for high N |
| ::: {.callout-note icon=false} |
| #### Mini Batch K-Means |
| Source: Scikit-Learn |
| It is a variant of K-Means that uses mini-batches to reduce the computation time. Mini-batches are subsets of the input data, randomly sampled in each training iteration. |
| In general, the results of Mini Batch K-Means are only slightly worse than the standard algorithm |
| ::: |
| ::: {#cell-48 .cell} |
The following code compares the computation times and clustering performance (inertia) of the KMeans and MiniBatchKMeans algorithms for different numbers of clusters, storing the results in the times and inertias arrays to plot them. |
| ::: {#cell-50 .cell} |
| ::: {.cell-output .cell-output-stdout} |
| ::: {.cell-output .cell-output-display} |
| ## Evaluating the output of clustering methods |
| ::: {.callout-note} #### The challenge of measuring the quality Clustering is an unsupervised learning technique, so it is hard to evaluate the quality of the output of any given method. |
| If we have labeled data for some of the data, we can use the similarity (or equality) between the labels of two data points as a metric for determining if the two inputs “should” be assigned to the same cluster or not. |
| If we don’t have labels, but the method is based on a generative model of the data, we can use log likelihood as a metric. |
For this section, we will use data generated with make_moons. You should already be familiar with this. |
| ::: {#cell-54 .cell} ``` {.python .cell-code} %run -i “make_moons.py” |
| import seaborn as sns sns.scatterplot(dfTR, x = “X1”, y = “X2”, hue=“Y”); ``` |
| ::: {.cell-output .cell-output-display} |
| We will apply a simple K-means with just two clusters for demonstration purposes: |
| ::: {#cell-56 .cell} |
| ### Purity |
| ::: {.callout-note} #### Purity It is a common evaluation metric for clustering algorithms when we have labeled data. It measures the extent to which clusters contain a single class of data points. Higher purity indicates better clustering performance, as it means that the clusters are more homogeneous with respect to the ground truth labels. |
| The steps are the following: 1. First, we assign each cluster to the class which is most frequent in the cluster: This is done by finding the majority class in each cluster. 2. Count the number of correctly assigned data points: These are the data points that belong to the majority class in their respective clusters. 3. Calculate purity: Purity is the ratio of the number of correctly assigned data points to the total number of data points. |
| Mathematically: |
| Let \(N_{ij}\) be the number of objects in cluster \(i\) that belong to class \(j\), \(K\) is the total number of clusters, and \(N\) is the total number of data points in the dataset.The overall purity of a clustering is: \[p = \frac{1}{N} \sum_{i=1}^{K} \max_{j} \, N_{ij}\] |
| We can code a function to calculate this purity: |
| ::: {#cell-59 .cell} ``` {.python .cell-code} import numpy as np from sklearn import metrics |
| def purity_score(y_true, y_pred): contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred) |
| majority_class_count = np.sum(np.max(contingency_matrix, axis=1)) #Sums the counts of the majority classes (np.max) purity = majority_class_count / np.sum(contingency_matrix) |
| return purity ``` ::: |
| ::: {#cell-60 .cell} |
| ::: {.cell-output .cell-output-display} |
| The purity ranges between 0 (bad) and 1 (good). |
| However, we can trivially achieve a purity of 1 by putting each object into its own cluster, so this measure does not penalize for the number of clusters. |
| ### Rand Index |
| ::: {.callout-note} #### Rand Index It is a measure of the similarity between two data clusterings (for example, kmeans with the real labeled data) |
| The Rand Index ranges from 0 to 1, where 1 indicates perfect agreement between the two clusterings. |
| Let U and V be two different partitions of the N data points. |
| For example, U might be the estimated clustering and V is reference clustering derived from the class labels. Now define a \(2\times2\) contingency table, containing the following numbers: |
| - TP (True positives): Count the number of pairs of data points that are in the same cluster in both U and V. - TN (True negatives): Count the number of pairs of data points that are in different clusters in both U and V. - FN (False negatives): Count the number of pairs of data points that are in different clusters in U but same in V. - FP (False positives): Count the number of pairs of data points that are in the same cluster in U but different in V. |
| The Rand index is calculated: \[R = \frac{TP+ TN}{TP + FP + TN + FN}\] This can be interpreted as the fraction of clustering decisions that are correct. Clearly \(0 \leq R \leq 1\). The Rand index weights false positives and false negatives equally. |
In sklearn we can directly use the rand_score function to calculate it. |
| ::: {#cell-65 .cell} |
| ::: {.cell-output .cell-output-display} |
| ::: {.callout-note} #### Adjusted Rand index The Adjusted Rand Index (ARI) is a corrected-for-chance version of the Rand Index. It adjusts the Rand Index to account for the fact that random clusterings can sometimes result in a high Rand Index. The ARI ranges from -1 to 1, where 1 indicates perfect agreement between the two clusterings, 0 indicates random labeling, and negative values indicate less agreement than expected by chance. |
| The formula for the Adjusted Rand Index is: |
| \[\text{ARI} = \frac{\text{RI} - \text{Expected RI}}{\text{Max RI} - \text{Expected RI}}\] |
| Where: |
| * \(\text{RI}\) is the Rand Index. * \(\text{Expected RI}\) is the expected value of the Rand Index for random clusterings. * \(\text{Max RI}\) is the maximum value of the Rand Index. |
| ARI is less sensitive to the absolute sizes of the clusters. |
| ::: {#cell-67 .cell} |
| ::: {.cell-output .cell-output-display} |
| ### Mutual Information |
| ::: {.callout-note} #### Mutual information Another way to measure cluster quality is to compute the mutual information between two candidate partitions U and V. |
| To do this, let \(p_{\text{UV}}(i, j) = \frac{|u_{i}\cap v_{j}|}{N}\) be the probability that a randomly chosen object belongs to cluster \(u_{i}\) in U and \(v_{j}\) in V. |
| Also, let \(p_{U}(i) = \frac{|u_{i}|}{N}\) be the probability that a randomly chosen object belongs to cluster \(u_{i}\) in U; define \(p_{V}(j) = \frac{|v_{J}|}{N}\) be the probability that a randomly chosen object belongs to cluster \(v_{j}\) in V. Then we have: |
| \[\mathbb{I}(U,V) = \sum_{i=1}^{R}\sum_{j=1}^{C} p_{\text{UV}}(i, j) \text{log} \frac{p_{UV}(i, j)}{p_{U}(i)p_{V}(j)}\] |
| This lies between \(0\) and \(\text{min}\{\mathbb{H}(U), \mathbb{H}(V)\}\). Unfortunately, the maximum value can be achieved by using lots of small clusters, which have low entropy. To compensate for this, we can use the normalized mutual information which lies between \(0\) and \(1\). |
| \[N\mathbb{I}(U,V) = \frac{2\mathbb{I}(U,V)}{\mathbb{H}(U) + \mathbb{H}(V)}\] |
| - A score close to 1 indicates that the clusterings are very similar. - A score close to 0 indicates that the clusterings are dissimilar. |
| Therefore, a normalized mutual information score closer to 1 indicates that the clustering is good and there is a high degree of agreement between the true labels and the predicted labels. In other words, the clusters found by the algorithm closely match the true underlying clusters. |
| ::: {#cell-70 .cell} |
| ::: {.cell-output .cell-output-display} |
| ::: {.callout-note} #### Summary * Purity presents bias towards many clusters, but it is easy to understand and compute. It is appropriate when you have clear true labels for your data. * ARI is useful to compare different clustering results on the same dataset, since it accounts for chance grouping. It does not matter if the compared methods have different number of clusters. * Mutual information to measure the amount of information shared between clustering and the true labels. It is not biased towards a specific number of clusters. |
For example, consider the set of \(5\) inputs points below in \(x_n \in \mathbb{R}^2\).
%run -i "example_hac_1.py"the input to the algorithm is an \(N \times N\) dissimilarity matrix \(D_{ij}\geq0\), and the output is a tree structure in which groups i and j with small disimilarity are grouped together in a hierarchical fashion.
We will use city block distance between the points to define the dissimilarity, i.e.
\[d_{ij} = \sum_{k=1}^{2}|x_{ik}-x_{jk}|\]
We start with a tree with \(N\) leaves (in this example, \(N=5\)), each corresponding to a cluster with a single data point. Next we compute the pair of points that are closest, and merge them: * We see that (1,3) and (4,5) are both distance 1 apart, so they get merged first. * We then measure the dissimilarity between the sets {1,3}, {4,5} and {2} using some measure (details below), and group them, and repeat
The result is a binary tree known as a dendrogram. By cutting this tree at different heights, we can induce a different number of (nested) clusters.
linked = linkage(X, 'single')
labelList = range(1, 6)
plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top', labels=labelList, distance_sort='descending', show_leaf_counts=True)
plt.show()