Unsupervised Learning and PCA

## MST0052 -- Lecture 9

### Unsupervised Learning and PCA

Fall 2026

---

## Where we are

| Lectures | Phase |
|----------|-------|
| 1--3 | Foundations |
| 4--7 | Core methods |
| **9--14** | **Going further -- you are here** |
| 15--16 | Wrapping up |

---

## Today's plan

- What changes when there is **no target**
- PCA: the **intuition** and the **objective**
- PCA in **scikit-learn** -- pipeline, scaling, `n_components`
- Reading PCA output without **over-interpreting**
- Worked example: PCA on the **digits** dataset

---

## The pivot: no labels

So far: $(X, y)$ -- features **and** a target. Training meant minimising prediction error against $y$.

Today: $X$ **only**. No $y$ to chase, no obvious loss function, no "test accuracy."

The question changes:

> from *"how wrong is my prediction?"* to *"what structure is in this data?"*

---

## Two main goals

- **Dimensionality reduction** -- compress many variables into a smaller set that captures the same information
- **Clustering** -- group similar observations together without being told what "similar" means

Today: **dimensionality reduction (PCA)**. L13: clustering.

---

## What does "structure" mean without labels?

Three examples a project might encounter:

- **200 survey questions**, but maybe only 5 underlying "attitudes"
- **Customer behaviour logs** with no segments defined
- **High-dimensional images** where most pixels are correlated

Common thread: the data has **fewer effective degrees of freedom** than its raw dimension suggests.

---

## Where unsupervised methods fit in your project

- Per the L1 rules: unsupervised methods are allowed as **support**, not as the headline
- The **headline is still a predictive target**
- PCA can be a **preprocessing step** before your regression or classifier
- Clustering can produce **features** for a downstream model

A "pure clustering study" with no predictive target is **not** a valid project.

---

## The motivating problem

Suppose you have **30 features** and many are correlated (height and weight, income and rent, length and width).

You don't need 30 numbers to describe each observation -- most carry redundant information.

> **Question:** what is the smallest set of *new* features that retains "most" of what the original 30 carry?

---

## A 2D example: the geometric picture

![Original axes vs the first principal component direction](/figures/pca-2d-rotation.svg)

The data clearly *wants* to be one-dimensional -- most variation is along the diagonal, with a small amount of spread perpendicular to it.

**PCA finds that diagonal direction** and calls it the **first principal component**.

---

## The first principal component

**PC1** = the direction in feature space along which the data has **maximum variance**.

- It is a **unit vector** -- a combination of the original features, not one of them
- Project every observation onto this direction → one number per observation
- That number is the observation's **"PC1 score"**

---

## Subsequent components

- **PC2** = next-largest variance, **orthogonal** to PC1
- **PC3** = next, orthogonal to PC1 and PC2
- ...and so on

With $p$ features → $p$ principal components in total -- but typically keep only the first $k \ll p$.

---

## What PCA is optimising

For each component, PCA maximises:

$$\text{Var}(\text{projection of data onto the component})$$

subject to: the component is a **unit vector**, and **orthogonal** to all previous components.

Mathematically: PCA does an **eigendecomposition** of the covariance matrix of $X$. The components are eigenvectors; their variances are eigenvalues.

---

## What PCA is, and isn't

| PCA is | PCA isn't |
|--------|-----------|
| A rotation of the feature space | A selection of "the best" original features |
| A way to compress correlated features | A way to discover causal structure |
| An unsupervised method (no $y$ used) | Optimised for any prediction task |
| A linear method | Capable of capturing curved structure |

---

## The PCA workflow

1. **Centre** each feature (mean = 0). PCA does this for you.
2. **Scale** if features are on different units. **You** do this with `StandardScaler`.
3. **Fit** PCA on the (centred, scaled) training data.
4. **Transform** to get component scores.
5. **Optionally** use the scores as features for a downstream model.

---

## Scaling is non-negotiable

PCA maximises **variance**. Variance is **unit-sensitive**.

- A feature in **millimetres** has 1,000× the variance of the same feature in **metres** -- same information
- Without scaling, **PC1 is dominated by whatever has the largest raw spread**, regardless of importance

> Always run `StandardScaler` before PCA -- unless every feature is genuinely on the same scale.

---

## PCA in code, inside a pipeline

```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(), PCA(n_components=2))
X_pca = pipe.fit_transform(X_train)
```

- `X_pca` has shape `(n_samples, 2)` -- every observation reduced to two numbers
- PCA inside a `Pipeline` → refit on each CV fold → **no leakage** (L3 rule)

---

## `explained_variance_ratio_` and `cumsum`

```python
import numpy as np
pca = pipe.named_steps['pca']
print(pca.explained_variance_ratio_)
print(np.cumsum(pca.explained_variance_ratio_))
```

- **`explained_variance_ratio_`** -- fraction of total variance per component
- **`cumsum`** -- "how much variance do the first $k$ components retain together?"

Common targets: 80%, 90%, 95% -- but the right answer depends on the downstream use.

---

## Choosing `n_components`: three rules

| Rule | When to use |
|------|-------------|
| Fixed integer (e.g., 2 or 3) | You're visualising; human eyes are the constraint |
| Cumulative variance target (e.g., 0.9) | PCA as preprocessing for a downstream model |
| **Cross-validate `n_components`** | The downstream model's CV score is what you care about |

Shortcut: `PCA(n_components=0.9)` keeps just enough components to reach 90%.

---

## The scree plot

![Scree plot for the digits dataset](/figures/scree-plot.svg)

- $x$: component number. $y$: explained variance ratio.
- Look for the **elbow** -- where adding a component buys little extra variance
- A clear elbow at $k = 3$ → "the data really does live in 3 dimensions"
- A **flat** scree → no obvious low-dimensional structure; PCA may not be the right tool

---

## Cumulative variance

![Cumulative variance for the digits dataset](/figures/cumulative-variance.svg)

Same information as the scree plot, cumulative version.

A horizontal line at 0.9 makes "how many components for 90%?" a **visual question**.

---

## Loadings: what do the components mean?

The **loading** of feature $j$ on component $k$ = the coefficient of $j$ in the linear combination defining PC$k$.

```python
print(pca.components_.shape)  # (n_components, n_features)
print(pca.components_[0])      # PC1's loadings
```

Reading PC1's largest loadings → which original features dominate the direction.

Example: large loadings on `bill_length`, `bill_depth`, `flipper_length` → **informally** call PC1 a "size" axis.

---

## PCA as visualisation

![Digits projected onto the first two principal components](/figures/digits-pca-2d.svg)

The most common everyday use: project high-dimensional data to 2 (or 3) PCs and **plot it**.

Useful -- *if* you remember the projection is **lossy**. Clusters that overlap in 2D may still be separable in higher PCs.

---

## Reconstructing data from components

PCA is **invertible**: `pipe.inverse_transform(X_pca)` puts you back in original feature space.

- With **all** components → reconstruction is exact (modulo float precision)
- With **fewer** components → reconstruction is **lossy**; the dropped components carry the lost detail

Useful diagnostic: reconstruction error per row → flag rows that are hardest to reconstruct (outliers, unusual cases).

---

## PCA is linear

PCA finds **linear** combinations.

If the structure is nonlinear (curves, manifolds, concentric rings) PCA will **distort or hide it**.

For nonlinear structure:

- **Kernel PCA** -- PCA in a transformed feature space
- **t-SNE** or **UMAP** -- visualisation only; don't use as features for a headline model

---

## Components are not interpretable by default

A component is a **weighted sum of all features**.

Unless loadings are sparse, naming PC1 "size" or "wealth" is a *story you impose*, not something PCA discovered.

- **Strong loadings on a few features** → naming is more defensible
- **Loadings spread thinly** → naming is wishful thinking. Just call it PC1.

---

## Variance ≠ predictive signal

PCA maximises **variance**, not **prediction**.

It is entirely possible for the **largest-variance direction** to be **uninformative** about the target -- and for **low-variance directions** to carry the predictive signal.

> Use PCA when you suspect **redundancy in $X$**. Don't use it as a free-lunch upgrade for prediction.

---

## When PCA is the wrong choice

- **Few features.** With $p = 5$, you don't have a dimensionality problem.
- **Tree-based downstream model.** Trees don't suffer from correlated inputs.
- **The goal is interpretability of original features.** PCA scrambles them.
- **Sparse data** (e.g., TF-IDF) → use `TruncatedSVD` instead.

---

## The digits dataset

- scikit-learn's `load_digits()` -- 1,797 hand-written digits
- Each digit: **8×8 = 64 pixel features**, labelled 0-9
- 10-class classification (today we use labels only for **colouring**, not fitting)
- 64 dimensions is enough to motivate PCA, small enough to iterate fast

```python
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(X_train.shape)  # (1437, 64)
```

---

## Step 1: fit PCA

```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import numpy as np

pca_pipe = make_pipeline(StandardScaler(), PCA())
pca_pipe.fit(X_train)

pca = pca_pipe.named_steps['pca']
cum = np.cumsum(pca.explained_variance_ratio_)
print(f"PC1+PC2 capture {cum[1]:.2%} of variance")
print(f"First {np.searchsorted(cum, 0.9) + 1} PCs capture 90%")
```

For digits, ~30 PCs give 90% variance -- already a 2× reduction from 64.

---

## Step 2: 2D projection

```python
pca_2d = make_pipeline(StandardScaler(), PCA(n_components=2))
X_2d = pca_2d.fit_transform(X_train)

plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y_train, cmap='tab10', s=15)
```

![Digits projected to PC1-PC2](/figures/digits-pca-2d.svg)

- Some digits clearly cluster (0, 6); others overlap heavily (1, 7, 9)
- Labels were **not used** by PCA -- yet structure shows up

---

## Step 3: how many components to keep?

![Scree plot for the digits dataset](/figures/scree-plot.svg)

- No sharp elbow, but a gentle bend around PC 8-10
- Cumulative variance hits 80% around PC 21, 90% around PC 30
- For a **downstream classifier**: the right answer is whichever `n_components` gives the best **CV score** -- not what the plot suggests

---

## Step 4: PCA as preprocessing

```python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

pipe = make_pipeline(StandardScaler(),
                     PCA(),
                     LogisticRegression(max_iter=5000))

grid = {'pca__n_components': [2, 5, 10, 20, 30, 50, 64]}

search = GridSearchCV(pipe, grid, cv=5, scoring='accuracy', n_jobs=-1)
search.fit(X_train, y_train)

print(f"Best n_components: {search.best_params_}")
print(f"Best CV accuracy:  {search.best_score_:.3f}")
print(f"Test accuracy:     {search.score(X_test, y_test):.3f}")
```

CV picks `n_components`, exactly as we'd pick any other hyperparameter (L7).

---

## What this experiment shows

- For digits, PCA reduces the input from 64 to a smaller set with **little loss in accuracy**
- The 2D projection shows real structure, but the **classifier needs more than two PCs**
- The selection rule was: **let CV pick `n_components`** -- same rule as every other hyperparameter

---

## Summary

- Unsupervised learning answers a different question: **what is the structure of $X$?**
- PCA finds the **linear** directions of maximum variance
- **Always scale** before PCA. **Always** put it inside a pipeline
- **Choose `n_components` deliberately** -- by CV, by a variance target, or by visualisation needs
- **Don't over-interpret components.** Variance is not predictive signal.

---

## Before Lecture 10

- **Run** the digits PCA workflow on your own machine
- For your project: if your dataset has many correlated features, **try PCA as preprocessing** and **CV the choice of `n_components`**
- Read ahead: **Lecture 10 is ensembles** -- decision trees, bagging, random forests. The bias-variance tradeoff returns.

---

## Questions

Open the floor.

## Backup slides

Use if questions arise or time allows.

## t-SNE and UMAP

Nonlinear visualisation techniques.

- **t-SNE** -- preserves local neighbourhood structure
- **UMAP** -- faster, also preserves more global structure

Both are **visualisation-only** in this course. Do not use the output as features for a headline model.

```python
from sklearn.manifold import TSNE
X_tsne = TSNE(n_components=2, perplexity=30).fit_transform(X_train)
```

Useful for showing class separability in nonlinear datasets.

## Kernel PCA

The kernel trick (from L11 SVMs) applied to PCA -- captures **nonlinear** structure.

```python
from sklearn.decomposition import KernelPCA

kpca = KernelPCA(n_components=2, kernel='rbf', gamma=0.1)
X_kpca = kpca.fit_transform(X_train)
```

Cost: compute (grows with $n^2$), interpretability (no `explained_variance_ratio_`), tuning (`gamma`).

Worth it when ordinary PCA can't separate your structure.

## PCA on sparse data with `TruncatedSVD`

Plain PCA centres each feature -- breaks sparsity.

For sparse matrices (TF-IDF, count vectors):

```python
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=50)
X_svd = svd.fit_transform(X_sparse)
```

Same underlying math, no centring. Mandatory for text-classification projects with TF-IDF features.

## PCA whitening

```python
PCA(n_components=10, whiten=True)
```

Scales each component to **unit variance** after projection.

- Useful when the downstream model assumes isotropic input
- Equivalent to a rotation + per-component scaling
- Costs information about which component is "most important"

Mention only if a student asks.

## When low-variance PCs carry the signal

A constructed example: 30 features, target depends on PC$_{28}$.

- PC1-PC10 capture 95% of the variance in $X$
- All the predictive signal lives in PC$_{28}$
- Dropping low-variance PCs **destroys the signal**

This is the strongest argument for **CV-driven `n_components`** -- variance is not predictive signal, and you shouldn't have to guess.

---

## What's next

**Lecture 10:** Ensemble methods

- Decision trees
- Bagging and random forests
- Why combining models works