Clustering – MST0052

## MST0052 -- Lecture 13

### Clustering

Fall 2026

---

## Where we are

| Lectures | Phase |
|----------|-------|
| 1--3 | Foundations |
| 4--7 | Core methods |
| **9--14** | **Going further -- you are here** |
| 15--16 | Wrapping up |

---

## Today's plan

- What clustering **is** (and what it **isn't**)
- **k-means** -- algorithm, initialisation, scaling, choosing $k$
- **Hierarchical clustering** -- dendrograms and linkage
- Honest evaluation when there are **no labels**
- Using clustering inside a **predictive project**

---

## Grouping without labels

Clustering takes a feature matrix $X$ and produces **group labels** for each row.

- There is **no target** $y$
- The algorithm has no idea what the "right" grouping is -- it only knows about similarity in feature space

Same shift as L9: not "predict this number" but **"what structure is in this data?"**

---

## Two main families today

- **Partitioning methods** (k-means, k-medoids) -- split the data into a fixed number of disjoint clusters
- **Hierarchical methods** (agglomerative, divisive) -- build a tree of nested merges or splits

Plus, as backup: **density-based** methods (DBSCAN) that don't assume any particular number or shape.

---

## Where clustering fits in your project

Per the L1 rules: clustering alone is **not** a valid project.

Valid uses **inside** a predictive project:

- **Exploratory analysis:** "are there natural groups in my data?"
- **Feature engineering:** cluster labels (or distances to centroids) as input features for a downstream model
- **Segmentation for analysis:** train separate models per cluster, or report performance per cluster

---

## When clustering doesn't help

- The features are **not informative enough** to separate anything
- The "clusters" you find are an artefact of **how you chose the features** or **the scaling**
- The "natural" number of clusters depends entirely on the algorithm and a few hyperparameters

> A clustering result that disappears when you change the seed, the features, or the method is **not a discovery.**

---

## The k-means problem

**Given:** a feature matrix $X$ and an integer $k$.

**Find:** $k$ centroids $\mu\_1, \dots, \mu\_k$ and an assignment of each point to a centroid that minimises:

$$\sum\_{i=1}^{n} \min\_{j} \|x\_i - \mu\_j\|^2$$

That is the **within-cluster sum of squares** (WCSS), also called **inertia**.

---

## The k-means algorithm (Lloyd's)

A two-step loop:

1. **Assign** each point to its nearest centroid
2. **Update** each centroid to the mean of its assigned points

Repeat until assignments don't change (or a max iteration limit is hit).

Converges, but only to a **local minimum** -- different starts give different answers.

---

## k-means++ initialisation

Random initialisation often produces bad local minima.

**k-means++** picks each new centroid with probability proportional to the squared distance from the nearest existing centroid -- spreading them out.

scikit-learn uses k-means++ by default. With `n_init=10` (also default), it runs 10 fresh initialisations and keeps the best.

> **Never set `n_init=1` in a project.** The default protects you from unlucky inits.

---

## Scaling is non-negotiable (again)

k-means uses **Euclidean distance**. Same lesson as ridge, k-NN, SVM, PCA: features on different scales distort the geometry.

```python
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(
    StandardScaler(),
    KMeans(n_clusters=3, n_init=10, random_state=42),
)
labels = pipe.fit_predict(X_train)
```

---

## What k-means assumes about the data

- Clusters are **roughly spherical** (in scaled feature space)
- Clusters are **roughly equal in size**
- The number of clusters is **known and fixed** in advance

When these assumptions fail (elongated clusters, very unequal sizes, unknown $k$), k-means struggles.

Hierarchical and DBSCAN handle those cases better.

---

## Choosing $k$ -- the elbow method

For a range of $k$ values, fit k-means and record WCSS (inertia). Plot WCSS vs $k$.

The "elbow" -- the point of diminishing returns -- is a candidate $k$.

![Elbow plot showing the within-cluster sum of squares vs k](/figures/kmeans-elbow.svg)

**Caveat:** the elbow is often ambiguous. More useful for **ruling out** very small or very large $k$ than for picking the exact value.

---

## Choosing $k$ -- the silhouette score

For each point: how close is it to its own cluster (**cohesion**) vs the nearest other cluster (**separation**)?

The **silhouette score** is the average over all points, in $[-1, 1]$:

- Close to **1**: well-clustered
- Close to **0**: borderline (on the edge of two clusters)
- **Negative**: probably misclassified

```python
from sklearn.metrics import silhouette_score
silhouette_score(X_scaled, labels)
```

Pick $k$ to **maximise silhouette**, but only among visually defensible options.

---

## The agglomerative idea

- Start with each observation as **its own cluster**
- At each step, **merge the two closest clusters**
- Repeat until one cluster remains

You now have a binary tree -- the **dendrogram** -- that records the merge order and distance.

---

## The dendrogram

![A small dendrogram with a horizontal cut line](/figures/dendrogram-example.svg)

Each horizontal cut of the dendrogram gives you a partition:

- Cut **high** → fewer, larger clusters
- Cut **low** → more, smaller clusters

> The dendrogram is the strongest pedagogical advantage of hierarchical clustering -- you see the structure across all values of $k$ at once.

---

## Linkage rules

"Distance between clusters" can mean several things:

| Linkage | Distance between clusters | Effect |
|---------|---------------------------|--------|
| **Single** | Closest pair of points | Long, stringy chains |
| **Complete** | Farthest pair | Tight, compact clusters |
| **Average** | Average over all pairs | Balanced compromise |
| **Ward** | Increase in within-cluster variance | Tight, k-means-like |

![Four linkage rules on the same dataset](/figures/linkage-comparison.svg)

**Ward** is the most common default.

---

## Hierarchical clustering in scikit-learn

```python
from sklearn.cluster import AgglomerativeClustering

hc = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = hc.fit_predict(X_scaled)
```

For the **dendrogram itself**, use `scipy.cluster.hierarchy`:

```python
from scipy.cluster.hierarchy import linkage, dendrogram

Z = linkage(X_scaled, method='ward')
dendrogram(Z)
```

Same scaling rule as k-means.

---

## k-means vs hierarchical

| Aspect | k-means | Hierarchical |
|--------|---------|--------------|
| Speed | Fast ($O(nk)$ per iter) | Slower ($O(n^2)$ to $O(n^3)$) |
| Specify $k$ upfront? | Yes | No -- cut the dendrogram |
| Cluster shape | Roughly spherical | More flexible |
| Deterministic? | No (random init) | Yes |
| Native visualisation | Scatter / silhouette | Dendrogram |
| Scales to large $n$? | Yes | Not really |

> **Practical default:** k-means with `n_init=10` first. Switch to hierarchical when $n$ is small and you want a dendrogram.

---

## The hard truth

Clustering has **no ground truth**. You cannot compute "accuracy."

Every evaluation metric you'll see (silhouette, Davies-Bouldin, Calinski-Harabasz) measures **internal structure** -- how well-separated and tight your clusters are *under the algorithm's own definition of similarity*.

That is **not** the same as "the clusters are real."

---

## Internal metrics

| Metric | What it measures | Higher = better? |
|--------|------------------|------------------|
| **Silhouette** | Cohesion vs separation, per point | Yes (closer to 1) |
| **Davies-Bouldin** | Similarity to nearest cluster | **No** (lower is better) |
| **Calinski-Harabasz** | Between- vs within-cluster variance | Yes |

These metrics agree on easy cases. On hard cases **they disagree** -- which is itself a signal that the structure is weak.

---

## External metrics (when you secretly have labels)

Sometimes you know the "right" clusters and want to compare.

- **Adjusted Rand Index (ARI)** -- agreement between two partitions, adjusted for chance. In $[-1, 1]$.
- **Normalised Mutual Information (NMI)** -- information-theoretic agreement. In $[0, 1]$.

```python
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y_true, cluster_labels)
```

Useful for **benchmarking** a clustering method against known structure -- not part of a predictive project's main analysis.

---

## Sensitivity checks

Honest clustering reports the answer to:

> Does the result hold up under perturbation?

Three cheap checks:

- **Reseeding:** does the clustering change if you change `random_state`?
- **Feature dropping:** drop one feature at a time. Do clusters survive?
- **Subsampling:** cluster on 80% of the data. Do the same groups emerge?

A clustering that doesn't survive these is more **finding** than **truth**.

---

## Cluster labels as features

A common project pattern: fit k-means on training data, use the cluster label (or distance to each centroid) as an **input feature** for the downstream model.

Done **inside the pipeline** so it gets refit on each CV fold -- L3 leakage rule.

```python
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression

pipe = make_pipeline(
    StandardScaler(),
    KMeans(n_clusters=5, n_init=10, random_state=42),
    LogisticRegression(),
)
```

`KMeans.transform()` returns **distances to each centroid** -- often more useful than the integer label alone.

---

## Segmentation as an analysis tool

After fitting your predictive model:

- **Cluster the residuals** -- where does the model fail?
- **Cluster the inputs and report metrics per cluster** -- where does it succeed?

Surfaces **heterogeneity**: maybe your model is great for one segment and terrible for another.

A small but genuine contribution to a project's analysis section.

---

## Things to avoid

- Don't **rename clusters with confident stories.** "Cluster 1 is the loyal customers" is a story you tell *after* you check it against domain knowledge.
- Don't **use clustering as the headline of the project.** L1 rule: you need a predictive target.
- Don't **trust silhouette scores in isolation.** A cluster with high silhouette and no domain meaning is suspicious.

---

## The setup

- **Dataset:** scikit-learn's `load_wine()` -- used in L7 (model selection). 178 wines, 13 chemical features, 3 classes.
- For **clustering**, we ignore the class labels during fitting. We'll use them at the end to **honestly evaluate** how well clustering recovered the true structure.

**Goals:**

1. Cluster the wines
2. Check sensitivity
3. Use the cluster label as a feature for logistic regression

```python
from sklearn.datasets import load_wine
X, y = load_wine(return_X_y=True, as_frame=True)
```

---

## Step 1: scaled k-means with an elbow check

```python
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)

inertias = []
for k in range(1, 11):
    km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X_scaled)
    inertias.append(km.inertia_)

# Plot inertias vs k - look for the elbow
```

The elbow on this dataset sits around $k = 3$ -- consistent with the (hidden) class structure.

---

## Step 2: silhouette and sensitivity

```python
from sklearn.metrics import silhouette_score

km = KMeans(n_clusters=3, n_init=10, random_state=42).fit(X_scaled)
print(f"Silhouette: {silhouette_score(X_scaled, km.labels_):.3f}")

# Sensitivity to seed -- deliberately disable the n_init protection
for s in [0, 1, 42, 100, 999]:
    km_s = KMeans(n_clusters=3, n_init=1, random_state=s).fit(X_scaled)
    print(f"  seed {s:3d}: silhouette {silhouette_score(X_scaled, km_s.labels_):.3f}")
```

With `n_init=10`: results stable across seeds. With `n_init=1`: scattered.

---

## Step 3: compare to the (hidden) ground truth

```python
from sklearn.metrics import adjusted_rand_score
print(f"ARI vs true classes: {adjusted_rand_score(y, km.labels_):.3f}")
```

On wine, k-means with the right $k$ recovers **most** of the class structure (ARI typically ~0.9). On other datasets, the gap would be wider.

> "The silhouette is high" is **not** the same as "the clustering matches reality."

---

## Step 4: cluster labels as features for prediction

```python
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

# Baseline: logistic regression on scaled features
base = make_pipeline(StandardScaler(),
                     LogisticRegression(max_iter=5000))
print(f"Baseline CV: {cross_val_score(base, X, y, cv=5).mean():.3f}")

# With cluster-distance features: wrap KMeans + LR in a pipeline
# so KMeans is refit on each fold (no leakage)
```

Adding cluster-distance features **rarely helps on wine** (the original features are already informative). On messier datasets, it can add a small boost.

---

## Summary

- **k-means** partitions data by minimising within-cluster squared distance. Scale first; set `n_init >= 10`; pick $k$ via elbow + silhouette + domain knowledge.
- **Hierarchical clustering** builds a dendrogram of nested merges. Ward linkage is the safe default.
- **No ground truth.** Internal metrics describe *structure*, not truth. Always run sensitivity checks.
- **Use clustering inside a predictive project** as features or segmentation -- not as the headline.

---

## Before Lecture 14

- Try clustering on your project's training data. Does it surface meaningful groups? If so, consider adding cluster features to your downstream model.
- This was the **last new method** in the course. From here, the focus is on doing what you have well.
- Read ahead: **Lecture 14 is neural networks** -- a conceptual bridge, not a new tool for the project.

---

## Questions

Open the floor.

## Backup slides

Use if questions arise or time allows.

## DBSCAN: density-based clustering

Finds clusters of **arbitrary shape**; doesn't require specifying $k$; flags outliers as **noise**.

```python
from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.5, min_samples=5).fit(X_scaled)
labels = db.labels_   # -1 means noise
```

Two knobs:

- **`eps`** -- neighbourhood radius
- **`min_samples`** -- minimum points to form a dense region

The right tool when k-means' spherical assumption clearly fails (e.g., concentric rings).

## Gaussian mixture models (GMM)

**Soft** clustering -- each point gets a **probability** of belonging to each cluster.

```python
from sklearn.mixture import GaussianMixture

gm = GaussianMixture(n_components=3, random_state=42)
gm.fit(X_scaled)
probs = gm.predict_proba(X_scaled)
```

Useful when clusters genuinely **overlap**.

GMM models each cluster as a Gaussian -- including covariance, so clusters can be elliptical, not just spherical.

## MiniBatchKMeans for very large data

Stochastic k-means: updates centroids using small random batches instead of the full dataset.

```python
from sklearn.cluster import MiniBatchKMeans

km = MiniBatchKMeans(n_clusters=10, batch_size=1024,
                     n_init=10, random_state=0)
km.fit(X_scaled)
```

- Same algorithm, **much** faster on large datasets
- Slightly noisier final centroids
- Use when $n$ is hundreds of thousands or millions

## Distance metrics beyond Euclidean

When Euclidean is wrong:

| Data type | Better metric |
|-----------|--------------|
| Binary features | Hamming, Jaccard |
| Text (TF-IDF) | Cosine |
| Mixed numeric + categorical | Gower |

```python
from sklearn.cluster import AgglomerativeClustering

hc = AgglomerativeClustering(
    n_clusters=3, metric='cosine', linkage='average',
)
```

## Cluster visualisation via PCA

The standard "did the clustering find structure?" plot:

```python
from sklearn.decomposition import PCA

X_2d = PCA(n_components=2).fit_transform(X_scaled)
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=km.labels_, cmap='tab10')
```

Project clusters to 2D via PCA (L9) and colour by cluster label.

**Honest about lossiness:** clusters that look overlapping in 2D may be cleanly separated in higher dimensions -- same caveat as the L9 PCA visualisation.

---

## What's next

**Lecture 14:** Neural networks

- From linear models to neural networks
- The conceptual bridge from classical ML
- When to use deep learning vs classical methods