MST0052
## MST0052 -- Lecture 9 ### Unsupervised Learning and PCA Fall 2026 --- ## Where we are | Lectures | Phase | |----------|-------| | 1--3 | Foundations | | 4--7 | Core methods | | **9--14** | **Going further -- you are here** | | 15--16 | Wrapping up | --- ## Today's plan - What changes when there is **no target** - PCA: the **intuition** and the **objective** - PCA in **scikit-learn** -- pipeline, scaling, `n_components` - Reading PCA output without **over-interpreting** - Worked example: PCA on the **digits** dataset --- ## The pivot: no labels So far: $(X, y)$ -- features **and** a target. Training meant minimising prediction error against $y$. Today: $X$ **only**. No $y$ to chase, no obvious loss function, no "test accuracy." The question changes: > from *"how wrong is my prediction?"* to *"what structure is in this data?"* --- ## Two main goals - **Dimensionality reduction** -- compress many variables into a smaller set that captures the same information - **Clustering** -- group similar observations together without being told what "similar" means Today: **dimensionality reduction (PCA)**. L13: clustering. --- ## What does "structure" mean without labels? Three examples a project might encounter: - **200 survey questions**, but maybe only 5 underlying "attitudes" - **Customer behaviour logs** with no segments defined - **High-dimensional images** where most pixels are correlated Common thread: the data has **fewer effective degrees of freedom** than its raw dimension suggests. --- ## Where unsupervised methods fit in your project - Per the L1 rules: unsupervised methods are allowed as **support**, not as the headline - The **headline is still a predictive target** - PCA can be a **preprocessing step** before your regression or classifier - Clustering can produce **features** for a downstream model A "pure clustering study" with no predictive target is **not** a valid project. --- ## The motivating problem Suppose you have **30 features** and many are correlated (height and weight, income and rent, length and width). You don't need 30 numbers to describe each observation -- most carry redundant information. > **Question:** what is the smallest set of *new* features that retains "most" of what the original 30 carry? --- ## A 2D example: the geometric picture  The data clearly *wants* to be one-dimensional -- most variation is along the diagonal, with a small amount of spread perpendicular to it. **PCA finds that diagonal direction** and calls it the **first principal component**. --- ## The first principal component **PC1** = the direction in feature space along which the data has **maximum variance**. - It is a **unit vector** -- a combination of the original features, not one of them - Project every observation onto this direction → one number per observation - That number is the observation's **"PC1 score"** --- ## Subsequent components - **PC2** = next-largest variance, **orthogonal** to PC1 - **PC3** = next, orthogonal to PC1 and PC2 - ...and so on With $p$ features → $p$ principal components in total -- but typically keep only the first $k \ll p$. --- ## What PCA is optimising For each component, PCA maximises: $$\text{Var}(\text{projection of data onto the component})$$ subject to: the component is a **unit vector**, and **orthogonal** to all previous components. Mathematically: PCA does an **eigendecomposition** of the covariance matrix of $X$. The components are eigenvectors; their variances are eigenvalues. --- ## What PCA is, and isn't | PCA is | PCA isn't | |--------|-----------| | A rotation of the feature space | A selection of "the best" original features | | A way to compress correlated features | A way to discover causal structure | | An unsupervised method (no $y$ used) | Optimised for any prediction task | | A linear method | Capable of capturing curved structure | --- ## The PCA workflow 1. **Centre** each feature (mean = 0). PCA does this for you. 2. **Scale** if features are on different units. **You** do this with `StandardScaler`. 3. **Fit** PCA on the (centred, scaled) training data. 4. **Transform** to get component scores. 5. **Optionally** use the scores as features for a downstream model. --- ## Scaling is non-negotiable PCA maximises **variance**. Variance is **unit-sensitive**. - A feature in **millimetres** has 1,000× the variance of the same feature in **metres** -- same information - Without scaling, **PC1 is dominated by whatever has the largest raw spread**, regardless of importance > Always run `StandardScaler` before PCA -- unless every feature is genuinely on the same scale. --- ## PCA in code, inside a pipeline ```python from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline pipe = make_pipeline(StandardScaler(), PCA(n_components=2)) X_pca = pipe.fit_transform(X_train) ``` - `X_pca` has shape `(n_samples, 2)` -- every observation reduced to two numbers - PCA inside a `Pipeline` → refit on each CV fold → **no leakage** (L3 rule) --- ## `explained_variance_ratio_` and `cumsum` ```python import numpy as np pca = pipe.named_steps['pca'] print(pca.explained_variance_ratio_) print(np.cumsum(pca.explained_variance_ratio_)) ``` - **`explained_variance_ratio_`** -- fraction of total variance per component - **`cumsum`** -- "how much variance do the first $k$ components retain together?" Common targets: 80%, 90%, 95% -- but the right answer depends on the downstream use. --- ## Choosing `n_components`: three rules | Rule | When to use | |------|-------------| | Fixed integer (e.g., 2 or 3) | You're visualising; human eyes are the constraint | | Cumulative variance target (e.g., 0.9) | PCA as preprocessing for a downstream model | | **Cross-validate `n_components`** | The downstream model's CV score is what you care about | Shortcut: `PCA(n_components=0.9)` keeps just enough components to reach 90%. --- ## The scree plot  - $x$: component number. $y$: explained variance ratio. - Look for the **elbow** -- where adding a component buys little extra variance - A clear elbow at $k = 3$ → "the data really does live in 3 dimensions" - A **flat** scree → no obvious low-dimensional structure; PCA may not be the right tool --- ## Cumulative variance  Same information as the scree plot, cumulative version. A horizontal line at 0.9 makes "how many components for 90%?" a **visual question**. --- ## Loadings: what do the components mean? The **loading** of feature $j$ on component $k$ = the coefficient of $j$ in the linear combination defining PC$k$. ```python print(pca.components_.shape) # (n_components, n_features) print(pca.components_[0]) # PC1's loadings ``` Reading PC1's largest loadings → which original features dominate the direction. Example: large loadings on `bill_length`, `bill_depth`, `flipper_length` → **informally** call PC1 a "size" axis. --- ## PCA as visualisation  The most common everyday use: project high-dimensional data to 2 (or 3) PCs and **plot it**. Useful -- *if* you remember the projection is **lossy**. Clusters that overlap in 2D may still be separable in higher PCs. --- ## Reconstructing data from components PCA is **invertible**: `pipe.inverse_transform(X_pca)` puts you back in original feature space. - With **all** components → reconstruction is exact (modulo float precision) - With **fewer** components → reconstruction is **lossy**; the dropped components carry the lost detail Useful diagnostic: reconstruction error per row → flag rows that are hardest to reconstruct (outliers, unusual cases). --- ## PCA is linear PCA finds **linear** combinations. If the structure is nonlinear (curves, manifolds, concentric rings) PCA will **distort or hide it**. For nonlinear structure: - **Kernel PCA** -- PCA in a transformed feature space - **t-SNE** or **UMAP** -- visualisation only; don't use as features for a headline model --- ## Components are not interpretable by default A component is a **weighted sum of all features**. Unless loadings are sparse, naming PC1 "size" or "wealth" is a *story you impose*, not something PCA discovered. - **Strong loadings on a few features** → naming is more defensible - **Loadings spread thinly** → naming is wishful thinking. Just call it PC1. --- ## Variance ≠ predictive signal PCA maximises **variance**, not **prediction**. It is entirely possible for the **largest-variance direction** to be **uninformative** about the target -- and for **low-variance directions** to carry the predictive signal. > Use PCA when you suspect **redundancy in $X$**. Don't use it as a free-lunch upgrade for prediction. --- ## When PCA is the wrong choice - **Few features.** With $p = 5$, you don't have a dimensionality problem. - **Tree-based downstream model.** Trees don't suffer from correlated inputs. - **The goal is interpretability of original features.** PCA scrambles them. - **Sparse data** (e.g., TF-IDF) → use `TruncatedSVD` instead. --- ## The digits dataset - scikit-learn's `load_digits()` -- 1,797 hand-written digits - Each digit: **8×8 = 64 pixel features**, labelled 0-9 - 10-class classification (today we use labels only for **colouring**, not fitting) - 64 dimensions is enough to motivate PCA, small enough to iterate fast ```python from sklearn.datasets import load_digits from sklearn.model_selection import train_test_split X, y = load_digits(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) print(X_train.shape) # (1437, 64) ``` --- ## Step 1: fit PCA ```python from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline import numpy as np pca_pipe = make_pipeline(StandardScaler(), PCA()) pca_pipe.fit(X_train) pca = pca_pipe.named_steps['pca'] cum = np.cumsum(pca.explained_variance_ratio_) print(f"PC1+PC2 capture {cum[1]:.2%} of variance") print(f"First {np.searchsorted(cum, 0.9) + 1} PCs capture 90%") ``` For digits, ~30 PCs give 90% variance -- already a 2× reduction from 64. --- ## Step 2: 2D projection ```python pca_2d = make_pipeline(StandardScaler(), PCA(n_components=2)) X_2d = pca_2d.fit_transform(X_train) plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y_train, cmap='tab10', s=15) ```  - Some digits clearly cluster (0, 6); others overlap heavily (1, 7, 9) - Labels were **not used** by PCA -- yet structure shows up --- ## Step 3: how many components to keep?  - No sharp elbow, but a gentle bend around PC 8-10 - Cumulative variance hits 80% around PC 21, 90% around PC 30 - For a **downstream classifier**: the right answer is whichever `n_components` gives the best **CV score** -- not what the plot suggests --- ## Step 4: PCA as preprocessing ```python from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV pipe = make_pipeline(StandardScaler(), PCA(), LogisticRegression(max_iter=5000)) grid = {'pca__n_components': [2, 5, 10, 20, 30, 50, 64]} search = GridSearchCV(pipe, grid, cv=5, scoring='accuracy', n_jobs=-1) search.fit(X_train, y_train) print(f"Best n_components: {search.best_params_}") print(f"Best CV accuracy: {search.best_score_:.3f}") print(f"Test accuracy: {search.score(X_test, y_test):.3f}") ``` CV picks `n_components`, exactly as we'd pick any other hyperparameter (L7). --- ## What this experiment shows - For digits, PCA reduces the input from 64 to a smaller set with **little loss in accuracy** - The 2D projection shows real structure, but the **classifier needs more than two PCs** - The selection rule was: **let CV pick `n_components`** -- same rule as every other hyperparameter --- ## Summary - Unsupervised learning answers a different question: **what is the structure of $X$?** - PCA finds the **linear** directions of maximum variance - **Always scale** before PCA. **Always** put it inside a pipeline - **Choose `n_components` deliberately** -- by CV, by a variance target, or by visualisation needs - **Don't over-interpret components.** Variance is not predictive signal. --- ## Before Lecture 10 - **Run** the digits PCA workflow on your own machine - For your project: if your dataset has many correlated features, **try PCA as preprocessing** and **CV the choice of `n_components`** - Read ahead: **Lecture 10 is ensembles** -- decision trees, bagging, random forests. The bias-variance tradeoff returns. --- ## Questions Open the floor. -- ## Backup slides Use if questions arise or time allows. -- ## t-SNE and UMAP Nonlinear visualisation techniques. - **t-SNE** -- preserves local neighbourhood structure - **UMAP** -- faster, also preserves more global structure Both are **visualisation-only** in this course. Do not use the output as features for a headline model. ```python from sklearn.manifold import TSNE X_tsne = TSNE(n_components=2, perplexity=30).fit_transform(X_train) ``` Useful for showing class separability in nonlinear datasets. -- ## Kernel PCA The kernel trick (from L11 SVMs) applied to PCA -- captures **nonlinear** structure. ```python from sklearn.decomposition import KernelPCA kpca = KernelPCA(n_components=2, kernel='rbf', gamma=0.1) X_kpca = kpca.fit_transform(X_train) ``` Cost: compute (grows with $n^2$), interpretability (no `explained_variance_ratio_`), tuning (`gamma`). Worth it when ordinary PCA can't separate your structure. -- ## PCA on sparse data with `TruncatedSVD` Plain PCA centres each feature -- breaks sparsity. For sparse matrices (TF-IDF, count vectors): ```python from sklearn.decomposition import TruncatedSVD svd = TruncatedSVD(n_components=50) X_svd = svd.fit_transform(X_sparse) ``` Same underlying math, no centring. Mandatory for text-classification projects with TF-IDF features. -- ## PCA whitening ```python PCA(n_components=10, whiten=True) ``` Scales each component to **unit variance** after projection. - Useful when the downstream model assumes isotropic input - Equivalent to a rotation + per-component scaling - Costs information about which component is "most important" Mention only if a student asks. -- ## When low-variance PCs carry the signal A constructed example: 30 features, target depends on PC$_{28}$. - PC1-PC10 capture 95% of the variance in $X$ - All the predictive signal lives in PC$_{28}$ - Dropping low-variance PCs **destroys the signal** This is the strongest argument for **CV-driven `n_components`** -- variance is not predictive signal, and you shouldn't have to guess. --- ## What's next **Lecture 10:** Ensemble methods - Decision trees - Bagging and random forests - Why combining models works