MST0052
## MST0052 -- Lecture 13 ### Clustering Fall 2026 --- ## Where we are | Lectures | Phase | |----------|-------| | 1--3 | Foundations | | 4--7 | Core methods | | **9--14** | **Going further -- you are here** | | 15--16 | Wrapping up | --- ## Today's plan - What clustering **is** (and what it **isn't**) - **k-means** -- algorithm, initialisation, scaling, choosing $k$ - **Hierarchical clustering** -- dendrograms and linkage - Honest evaluation when there are **no labels** - Using clustering inside a **predictive project** --- ## Grouping without labels Clustering takes a feature matrix $X$ and produces **group labels** for each row. - There is **no target** $y$ - The algorithm has no idea what the "right" grouping is -- it only knows about similarity in feature space Same shift as L9: not "predict this number" but **"what structure is in this data?"** --- ## Two main families today - **Partitioning methods** (k-means, k-medoids) -- split the data into a fixed number of disjoint clusters - **Hierarchical methods** (agglomerative, divisive) -- build a tree of nested merges or splits Plus, as backup: **density-based** methods (DBSCAN) that don't assume any particular number or shape. --- ## Where clustering fits in your project Per the L1 rules: clustering alone is **not** a valid project. Valid uses **inside** a predictive project: - **Exploratory analysis:** "are there natural groups in my data?" - **Feature engineering:** cluster labels (or distances to centroids) as input features for a downstream model - **Segmentation for analysis:** train separate models per cluster, or report performance per cluster --- ## When clustering doesn't help - The features are **not informative enough** to separate anything - The "clusters" you find are an artefact of **how you chose the features** or **the scaling** - The "natural" number of clusters depends entirely on the algorithm and a few hyperparameters > A clustering result that disappears when you change the seed, the features, or the method is **not a discovery.** --- ## The k-means problem **Given:** a feature matrix $X$ and an integer $k$. **Find:** $k$ centroids $\mu\_1, \dots, \mu\_k$ and an assignment of each point to a centroid that minimises: $$\sum\_{i=1}^{n} \min\_{j} \|x\_i - \mu\_j\|^2$$ That is the **within-cluster sum of squares** (WCSS), also called **inertia**. --- ## The k-means algorithm (Lloyd's) A two-step loop: 1. **Assign** each point to its nearest centroid 2. **Update** each centroid to the mean of its assigned points Repeat until assignments don't change (or a max iteration limit is hit). Converges, but only to a **local minimum** -- different starts give different answers. --- ## k-means++ initialisation Random initialisation often produces bad local minima. **k-means++** picks each new centroid with probability proportional to the squared distance from the nearest existing centroid -- spreading them out. scikit-learn uses k-means++ by default. With `n_init=10` (also default), it runs 10 fresh initialisations and keeps the best. > **Never set `n_init=1` in a project.** The default protects you from unlucky inits. --- ## Scaling is non-negotiable (again) k-means uses **Euclidean distance**. Same lesson as ridge, k-NN, SVM, PCA: features on different scales distort the geometry. ```python from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline pipe = make_pipeline( StandardScaler(), KMeans(n_clusters=3, n_init=10, random_state=42), ) labels = pipe.fit_predict(X_train) ``` --- ## What k-means assumes about the data - Clusters are **roughly spherical** (in scaled feature space) - Clusters are **roughly equal in size** - The number of clusters is **known and fixed** in advance When these assumptions fail (elongated clusters, very unequal sizes, unknown $k$), k-means struggles. Hierarchical and DBSCAN handle those cases better. --- ## Choosing $k$ -- the elbow method For a range of $k$ values, fit k-means and record WCSS (inertia). Plot WCSS vs $k$. The "elbow" -- the point of diminishing returns -- is a candidate $k$.  **Caveat:** the elbow is often ambiguous. More useful for **ruling out** very small or very large $k$ than for picking the exact value. --- ## Choosing $k$ -- the silhouette score For each point: how close is it to its own cluster (**cohesion**) vs the nearest other cluster (**separation**)? The **silhouette score** is the average over all points, in $[-1, 1]$: - Close to **1**: well-clustered - Close to **0**: borderline (on the edge of two clusters) - **Negative**: probably misclassified ```python from sklearn.metrics import silhouette_score silhouette_score(X_scaled, labels) ``` Pick $k$ to **maximise silhouette**, but only among visually defensible options. --- ## The agglomerative idea - Start with each observation as **its own cluster** - At each step, **merge the two closest clusters** - Repeat until one cluster remains You now have a binary tree -- the **dendrogram** -- that records the merge order and distance. --- ## The dendrogram  Each horizontal cut of the dendrogram gives you a partition: - Cut **high** → fewer, larger clusters - Cut **low** → more, smaller clusters > The dendrogram is the strongest pedagogical advantage of hierarchical clustering -- you see the structure across all values of $k$ at once. --- ## Linkage rules "Distance between clusters" can mean several things: | Linkage | Distance between clusters | Effect | |---------|---------------------------|--------| | **Single** | Closest pair of points | Long, stringy chains | | **Complete** | Farthest pair | Tight, compact clusters | | **Average** | Average over all pairs | Balanced compromise | | **Ward** | Increase in within-cluster variance | Tight, k-means-like |  **Ward** is the most common default. --- ## Hierarchical clustering in scikit-learn ```python from sklearn.cluster import AgglomerativeClustering hc = AgglomerativeClustering(n_clusters=3, linkage='ward') labels = hc.fit_predict(X_scaled) ``` For the **dendrogram itself**, use `scipy.cluster.hierarchy`: ```python from scipy.cluster.hierarchy import linkage, dendrogram Z = linkage(X_scaled, method='ward') dendrogram(Z) ``` Same scaling rule as k-means. --- ## k-means vs hierarchical | Aspect | k-means | Hierarchical | |--------|---------|--------------| | Speed | Fast ($O(nk)$ per iter) | Slower ($O(n^2)$ to $O(n^3)$) | | Specify $k$ upfront? | Yes | No -- cut the dendrogram | | Cluster shape | Roughly spherical | More flexible | | Deterministic? | No (random init) | Yes | | Native visualisation | Scatter / silhouette | Dendrogram | | Scales to large $n$? | Yes | Not really | > **Practical default:** k-means with `n_init=10` first. Switch to hierarchical when $n$ is small and you want a dendrogram. --- ## The hard truth Clustering has **no ground truth**. You cannot compute "accuracy." Every evaluation metric you'll see (silhouette, Davies-Bouldin, Calinski-Harabasz) measures **internal structure** -- how well-separated and tight your clusters are *under the algorithm's own definition of similarity*. That is **not** the same as "the clusters are real." --- ## Internal metrics | Metric | What it measures | Higher = better? | |--------|------------------|------------------| | **Silhouette** | Cohesion vs separation, per point | Yes (closer to 1) | | **Davies-Bouldin** | Similarity to nearest cluster | **No** (lower is better) | | **Calinski-Harabasz** | Between- vs within-cluster variance | Yes | These metrics agree on easy cases. On hard cases **they disagree** -- which is itself a signal that the structure is weak. --- ## External metrics (when you secretly have labels) Sometimes you know the "right" clusters and want to compare. - **Adjusted Rand Index (ARI)** -- agreement between two partitions, adjusted for chance. In $[-1, 1]$. - **Normalised Mutual Information (NMI)** -- information-theoretic agreement. In $[0, 1]$. ```python from sklearn.metrics import adjusted_rand_score adjusted_rand_score(y_true, cluster_labels) ``` Useful for **benchmarking** a clustering method against known structure -- not part of a predictive project's main analysis. --- ## Sensitivity checks Honest clustering reports the answer to: > Does the result hold up under perturbation? Three cheap checks: - **Reseeding:** does the clustering change if you change `random_state`? - **Feature dropping:** drop one feature at a time. Do clusters survive? - **Subsampling:** cluster on 80% of the data. Do the same groups emerge? A clustering that doesn't survive these is more **finding** than **truth**. --- ## Cluster labels as features A common project pattern: fit k-means on training data, use the cluster label (or distance to each centroid) as an **input feature** for the downstream model. Done **inside the pipeline** so it gets refit on each CV fold -- L3 leakage rule. ```python from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans from sklearn.linear_model import LogisticRegression pipe = make_pipeline( StandardScaler(), KMeans(n_clusters=5, n_init=10, random_state=42), LogisticRegression(), ) ``` `KMeans.transform()` returns **distances to each centroid** -- often more useful than the integer label alone. --- ## Segmentation as an analysis tool After fitting your predictive model: - **Cluster the residuals** -- where does the model fail? - **Cluster the inputs and report metrics per cluster** -- where does it succeed? Surfaces **heterogeneity**: maybe your model is great for one segment and terrible for another. A small but genuine contribution to a project's analysis section. --- ## Things to avoid - Don't **rename clusters with confident stories.** "Cluster 1 is the loyal customers" is a story you tell *after* you check it against domain knowledge. - Don't **use clustering as the headline of the project.** L1 rule: you need a predictive target. - Don't **trust silhouette scores in isolation.** A cluster with high silhouette and no domain meaning is suspicious. --- ## The setup - **Dataset:** scikit-learn's `load_wine()` -- used in L7 (model selection). 178 wines, 13 chemical features, 3 classes. - For **clustering**, we ignore the class labels during fitting. We'll use them at the end to **honestly evaluate** how well clustering recovered the true structure. **Goals:** 1. Cluster the wines 2. Check sensitivity 3. Use the cluster label as a feature for logistic regression ```python from sklearn.datasets import load_wine X, y = load_wine(return_X_y=True, as_frame=True) ``` --- ## Step 1: scaled k-means with an elbow check ```python import numpy as np from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler X_scaled = StandardScaler().fit_transform(X) inertias = [] for k in range(1, 11): km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X_scaled) inertias.append(km.inertia_) # Plot inertias vs k - look for the elbow ``` The elbow on this dataset sits around $k = 3$ -- consistent with the (hidden) class structure. --- ## Step 2: silhouette and sensitivity ```python from sklearn.metrics import silhouette_score km = KMeans(n_clusters=3, n_init=10, random_state=42).fit(X_scaled) print(f"Silhouette: {silhouette_score(X_scaled, km.labels_):.3f}") # Sensitivity to seed -- deliberately disable the n_init protection for s in [0, 1, 42, 100, 999]: km_s = KMeans(n_clusters=3, n_init=1, random_state=s).fit(X_scaled) print(f" seed {s:3d}: silhouette {silhouette_score(X_scaled, km_s.labels_):.3f}") ``` With `n_init=10`: results stable across seeds. With `n_init=1`: scattered. --- ## Step 3: compare to the (hidden) ground truth ```python from sklearn.metrics import adjusted_rand_score print(f"ARI vs true classes: {adjusted_rand_score(y, km.labels_):.3f}") ``` On wine, k-means with the right $k$ recovers **most** of the class structure (ARI typically ~0.9). On other datasets, the gap would be wider. > "The silhouette is high" is **not** the same as "the clustering matches reality." --- ## Step 4: cluster labels as features for prediction ```python from sklearn.linear_model import LogisticRegression from sklearn.pipeline import make_pipeline from sklearn.model_selection import cross_val_score # Baseline: logistic regression on scaled features base = make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000)) print(f"Baseline CV: {cross_val_score(base, X, y, cv=5).mean():.3f}") # With cluster-distance features: wrap KMeans + LR in a pipeline # so KMeans is refit on each fold (no leakage) ``` Adding cluster-distance features **rarely helps on wine** (the original features are already informative). On messier datasets, it can add a small boost. --- ## Summary - **k-means** partitions data by minimising within-cluster squared distance. Scale first; set `n_init >= 10`; pick $k$ via elbow + silhouette + domain knowledge. - **Hierarchical clustering** builds a dendrogram of nested merges. Ward linkage is the safe default. - **No ground truth.** Internal metrics describe *structure*, not truth. Always run sensitivity checks. - **Use clustering inside a predictive project** as features or segmentation -- not as the headline. --- ## Before Lecture 14 - Try clustering on your project's training data. Does it surface meaningful groups? If so, consider adding cluster features to your downstream model. - This was the **last new method** in the course. From here, the focus is on doing what you have well. - Read ahead: **Lecture 14 is neural networks** -- a conceptual bridge, not a new tool for the project. --- ## Questions Open the floor. -- ## Backup slides Use if questions arise or time allows. -- ## DBSCAN: density-based clustering Finds clusters of **arbitrary shape**; doesn't require specifying $k$; flags outliers as **noise**. ```python from sklearn.cluster import DBSCAN db = DBSCAN(eps=0.5, min_samples=5).fit(X_scaled) labels = db.labels_ # -1 means noise ``` Two knobs: - **`eps`** -- neighbourhood radius - **`min_samples`** -- minimum points to form a dense region The right tool when k-means' spherical assumption clearly fails (e.g., concentric rings). -- ## Gaussian mixture models (GMM) **Soft** clustering -- each point gets a **probability** of belonging to each cluster. ```python from sklearn.mixture import GaussianMixture gm = GaussianMixture(n_components=3, random_state=42) gm.fit(X_scaled) probs = gm.predict_proba(X_scaled) ``` Useful when clusters genuinely **overlap**. GMM models each cluster as a Gaussian -- including covariance, so clusters can be elliptical, not just spherical. -- ## MiniBatchKMeans for very large data Stochastic k-means: updates centroids using small random batches instead of the full dataset. ```python from sklearn.cluster import MiniBatchKMeans km = MiniBatchKMeans(n_clusters=10, batch_size=1024, n_init=10, random_state=0) km.fit(X_scaled) ``` - Same algorithm, **much** faster on large datasets - Slightly noisier final centroids - Use when $n$ is hundreds of thousands or millions -- ## Distance metrics beyond Euclidean When Euclidean is wrong: | Data type | Better metric | |-----------|--------------| | Binary features | Hamming, Jaccard | | Text (TF-IDF) | Cosine | | Mixed numeric + categorical | Gower | ```python from sklearn.cluster import AgglomerativeClustering hc = AgglomerativeClustering( n_clusters=3, metric='cosine', linkage='average', ) ``` -- ## Cluster visualisation via PCA The standard "did the clustering find structure?" plot: ```python from sklearn.decomposition import PCA X_2d = PCA(n_components=2).fit_transform(X_scaled) plt.scatter(X_2d[:, 0], X_2d[:, 1], c=km.labels_, cmap='tab10') ``` Project clusters to 2D via PCA (L9) and colour by cluster label. **Honest about lossiness:** clusters that look overlapping in 2D may be cleanly separated in higher dimensions -- same caveat as the L9 PCA visualisation. --- ## What's next **Lecture 14:** Neural networks - From linear models to neural networks - The conceptual bridge from classical ML - When to use deep learning vs classical methods