MST0052
## MST0052 -- Lecture 11 ### Support Vector Machines Fall 2026 --- ## Where we are | Lectures | Phase | |----------|-------| | 1--3 | Foundations | | 4--7 | Core methods | | **9--14** | **Going further -- you are here** | | 15--16 | Wrapping up | --- ## Today's plan - The **maximum-margin** idea -- a geometric principle - **Soft margin** and the `C` parameter - The **kernel trick** -- nonlinear boundaries without nonlinear features - SVMs in scikit-learn -- scaling, tuning, cost - Worked example: linear vs RBF vs polynomial SVM, compared to L10's random forest --- ## A different design philosophy - Random forests **average** many weak boundaries until something stable emerges - SVMs **optimise** a single boundary by a clear geometric principle Neither is universally better. Two genuinely different ways to think about classification. --- ## The decision boundary in 2D  All of these perfectly separate the training data. Training accuracy alone says they are equally good. > Which one is best? --- ## The maximum margin The **margin** = the distance from the boundary to the nearest training points on either side. The **maximum-margin** boundary is as far as possible from both classes. $$\max \frac{2}{\|w\|} \quad \text{or equivalently} \quad \min \tfrac{1}{2}\|w\|^2$$ subject to $y\_i(w^\top x\_i + b) \geq 1$ for all $i$. Intuition: a wide margin is harder to cross with noisy data → better generalisation. --- ## Support vectors The boundary depends only on the training points that sit on the edge of the margin (or inside it, in the soft case). Those points are the **support vectors**. Everything else could be deleted and the boundary wouldn't move. A small set of "critical" training examples defines the entire model. --- ## Real data isn't perfectly separable The hard-margin idea requires a straight line that separates the classes perfectly. Real datasets rarely cooperate. If we insist on hard separation: - The model becomes brittle - Or has no solution at all We need to allow **some** misclassifications without abandoning the maximum-margin idea. --- ## Slack variables Add a slack $\xi\_i \geq 0$ for each training point -- how badly does this point violate its margin? $$\min \tfrac{1}{2}\|w\|^2 + C \sum\_i \xi\_i$$ subject to $y\_i(w^\top x\_i + b) \geq 1 - \xi\_i$, with $\xi\_i \geq 0$. - First term wants a **wide margin** - Second term **penalises violations** The tradeoff is governed by `C`. --- ## `C`: the bias-variance dial in SVMs | `C` | Effect on margin | Effect on fit | |-----|------------------|---------------| | **Large** | Narrow margin, few violations | Fits training data closely → high variance, low bias | | **Small** | Wide margin, many violations | Smoother → low variance, high bias | This is L6's dial again, in SVM clothing. Tune with CV (L7). --- ## What `C` does to the boundary  - **Small `C`:** wide margin, several violations, smooth boundary - **Large `C`:** narrow margin, few violations, boundary contorts to honour individual points --- ## When no linear boundary works Some datasets cannot be separated by a straight line at any cost. Three options: 1. **Give up** and use a different model family 2. **Build nonlinear features** by hand and hope a linear boundary works there 3. **Let the SVM build them for you, implicitly** -- the kernel trick --- ## Explicit feature maps For $x \in \mathbb{R}^2$, transform via: $$\phi(x) = (x\_1^2,\; x\_2^2,\; \sqrt{2}\, x\_1 x\_2)$$ In this 3D space, concentric rings become linearly separable. The cost: - More dimensions, more compute - You had to **guess** the right feature map --- ## The kernel trick in one sentence > A **kernel** is a function $k(x, x') = \phi(x)^\top \phi(x')$ that gives you the inner product in some feature space *without* computing $\phi$ explicitly. Why this matters: the SVM optimisation only ever needs inner products of training points -- never the points themselves in the lifted space. > We can work in fantastic, even infinite-dimensional, feature spaces *for the cost of one kernel evaluation per pair of points.* --- ## Common kernels | Kernel | $k(x, x')$ | What it buys | |--------|------------|--------------| | **Linear** | $x^\top x'$ | Straight boundary (like a regularised logistic) | | **Polynomial** | $(\gamma\, x^\top x' + r)^d$ | Curved boundaries of degree $d$ | | **RBF (Gaussian)** | $\exp(-\gamma \|x - x'\|^2)$ | Flexible, local boundaries | | **Sigmoid** | $\tanh(\gamma\, x^\top x' + r)$ | Rarely used |  For your project: **linear** when $p \gg n$ or features are already meaningful; **RBF** for most everything else; **polynomial** only with specific structure in mind. --- ## RBF and the `gamma` parameter $$k(x, x') = \exp(-\gamma \|x - x'\|^2)$$ $\gamma$ controls **how local** the kernel is: - **Small `gamma`** → far-away points still affect the kernel → smoother, more global boundaries - **Large `gamma`** → only nearby points matter → wiggly, local boundaries (one bump per training point)  --- ## `C` and `gamma` interact strongly Tuning one without the other is a trap. Standard practice: a **2D grid** over `C` and `gamma`, both on a **log scale**. ```python param_grid = { 'svc__C': [0.1, 1, 10, 100], 'svc__gamma': [0.001, 0.01, 0.1, 1], } ``` A **heatmap of CV scores** over the grid is the most informative diagnostic you can produce. --- ## The SVM workflow 1. **Split** off the test set 2. **Scale** features (mandatory -- see next slide) 3. **Build** a pipeline: `StandardScaler` + `SVC` 4. **Tune** `C` and `gamma` with `GridSearchCV` on a log-spaced grid 5. **Evaluate** once on the test set Same L7 pattern, new model. --- ## Why scaling is mandatory for SVMs All non-linear kernels (and the regularised linear SVM) depend on **distances** or **inner products** of features. A feature on the wrong scale **dominates** the kernel and the optimisation -- same problem we saw with k-NN, ridge, and PCA. > `StandardScaler` before `SVC` is non-negotiable. --- ## `SVC` in code ```python from sklearn.svm import SVC from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler pipe = make_pipeline( StandardScaler(), SVC(kernel='rbf', C=1.0, gamma='scale'), ) pipe.fit(X_train, y_train) print(f"Test accuracy: {pipe.score(X_test, y_test):.3f}") ``` - `gamma='scale'` defaults to $1 / (n\_\text{features} \cdot \text{Var}(X))$ -- a sensible starting point --- ## Probabilities and `predict_proba` By default, `SVC` does **not** return probabilities -- it returns a **decision function** (signed distance from the boundary). ```python pipe.decision_function(X_test) # signed distance pipe.predict_proba(X_test) # only with probability=True ``` `SVC(probability=True)` runs a separate Platt-scaling fit via CV -- **slow** and often poorly calibrated. If you only need rankings → `decision_function`. If you need probabilities → `CalibratedClassifierCV` after the fact (L5 backup). --- ## Multi-class SVMs `SVC` handles multi-class automatically using **one-vs-one**: - One binary classifier per pair of classes → predict by majority vote - For $K$ classes: $K(K-1)/2$ binary SVMs - Manageable for small $K$, slow for large $K$ Alternative: `LinearSVC` uses **one-vs-rest** -- faster but only for the linear kernel. --- ## The cost picture Training a kernel SVM scales between $O(n^2)$ and $O(n^3)$ in the number of training rows. - Memory grows with the number of **support vectors** (often a large fraction of $n$ in noisy data) - For projects in this course (hundreds to tens of thousands of rows): **practical** - For millions of rows: **not practical** --- ## `LinearSVC` for large or high-dimensional data If a linear kernel is enough (text, very high-dimensional sparse features), `LinearSVC` uses a different solver that scales to large datasets. ```python from sklearn.svm import LinearSVC pipe = make_pipeline(StandardScaler(with_mean=False), LinearSVC(C=1.0, max_iter=5000)) ``` - **No `gamma`, no kernel choice** -- just `C`. Fast. - For **TF-IDF + classification**, start with `LinearSVC`. --- ## When SVMs are the right choice - **Moderate-sized** datasets (hundreds to low tens of thousands of rows) - **High-dimensional** feature spaces with relatively few samples - When you want a **principled geometric** model and a small set of support vectors - **Text classification** with sparse features (use `LinearSVC`) When *not* to reach for SVMs: - Very **large** datasets - When you need **easy probability calibration** - When **interpretability of individual features** is the priority --- ## The setup - **Dataset:** `load_breast_cancer()` -- binary, 569 rows, 30 numeric features (re-used from L5 and L10 -- stable reference) - **Protocol:** - 80/20 train/test split, `random_state=42`, stratified - Stratified 5-fold CV on the training set - Metric: `f1` - **Four workflows:** linear SVM, RBF SVM, polynomial SVM, random forest ```python from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split X, y = load_breast_cancer(return_X_y=True, as_frame=True) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) ``` --- ## Workflow 1: linear SVM ```python from sklearn.svm import SVC from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.model_selection import GridSearchCV pipe_lin = make_pipeline(StandardScaler(), SVC(kernel='linear')) grid_lin = {'svc__C': [0.01, 0.1, 1, 10, 100]} search_lin = GridSearchCV(pipe_lin, grid_lin, cv=5, scoring='f1') search_lin.fit(X_train, y_train) print(f"Linear best C: {search_lin.best_params_}") print(f"Linear CV F1: {search_lin.best_score_:.3f}") ``` One-dimensional grid: only `C` to tune. --- ## Workflow 2: RBF SVM with a `C` × `gamma` grid ```python pipe_rbf = make_pipeline(StandardScaler(), SVC(kernel='rbf')) grid_rbf = { 'svc__C': [0.1, 1, 10, 100], 'svc__gamma': [0.001, 0.01, 0.1, 1], } search_rbf = GridSearchCV(pipe_rbf, grid_rbf, cv=5, scoring='f1') search_rbf.fit(X_train, y_train) print(f"RBF best: {search_rbf.best_params_}") print(f"RBF CV F1: {search_rbf.best_score_:.3f}") ```  Inspect the full heatmap, not just `.best_params_`. The plateau shape tells you whether you are robust or just lucky. --- ## Workflow 3: polynomial SVM ```python pipe_poly = make_pipeline(StandardScaler(), SVC(kernel='poly', degree=3)) grid_poly = { 'svc__C': [0.1, 1, 10], 'svc__coef0': [0, 1], } search_poly = GridSearchCV(pipe_poly, grid_poly, cv=5, scoring='f1') search_poly.fit(X_train, y_train) print(f"Poly CV F1: {search_poly.best_score_:.3f}") ``` Degree 3 is the default starting point. Higher degrees rarely help on tabular data and are slow. --- ## Honest comparison, including L10's random forest ```python from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators=300, n_jobs=-1, random_state=42) rf.fit(X_train, y_train) for name, model in [ ('Linear SVM', search_lin), ('RBF SVM', search_rbf), ('Poly SVM', search_poly), ('Random forest', rf), ]: print(f"{name:14s} Test F1 = {model.score(X_test, y_test):.3f}") ``` Two L7 questions: - Are the gaps **bigger than the CV std**? - Is the chosen winner **stable**? --- ## What this experiment shows - RBF SVM and random forest are in a near-tie under a fair protocol - **Kernel choice** matters more than `C` alone -- linear vs RBF is a bigger gap than tuning `C` within RBF - L6 selection rule: when two models tie, pick the **simpler / easier to defend** - For breast cancer that means RBF SVM (smaller, fewer "moving parts") -- but the choice is defensible either way --- ## Summary - SVMs find a **maximum-margin** boundary -- a single optimised hyperplane - **Soft margin** with `C` handles non-separable data. `C` is the L6 dial. - **Kernels** let you draw nonlinear boundaries by computing inner products in higher-dimensional spaces *implicitly* - **Scaling is mandatory.** Tune `C` and `gamma` on a **log-spaced 2D grid**. - SVMs are strong on **moderate-sized**, **high-dimensional** problems; ensembles often win on **large** tabular data --- ## Before Lecture 12 - **Run** today's SVM comparison on your own machine - For your project: add an SVM under the same CV protocol. If high-dimensional (text, embeddings), start with `LinearSVC`. - Read ahead: **Lecture 12 is gradient boosting** -- back to trees, but fitted sequentially --- ## Questions Open the floor. -- ## Backup slides Use if questions arise or time allows. -- ## One-class SVM for novelty detection Unsupervised variant -- fit a boundary around the "normal" class, flag points outside it. ```python from sklearn.svm import OneClassSVM oc = OneClassSVM(kernel='rbf', gamma='scale', nu=0.05) oc.fit(X_normal) labels = oc.predict(X_new) # +1 = normal, -1 = anomaly ``` `nu` ~ approximate fraction of training points treated as boundary support / outliers. Useful in project contexts with an anomaly framing. -- ## Nu-SVM Alternative parameterisation: ```python from sklearn.svm import NuSVC clf = NuSVC(nu=0.3, kernel='rbf', gamma='scale') ``` `nu ∈ (0, 1]` bounds: - The **fraction of margin violations** from above - The **fraction of support vectors** from below Same underlying model as `SVC`, different knob. Sometimes easier to reason about for imbalanced data. -- ## Support vector regression (SVR) The same machinery applied to regression with an $\varepsilon$-insensitive loss: ```python from sklearn.svm import SVR from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler pipe = make_pipeline(StandardScaler(), SVR(kernel='rbf', C=1.0, epsilon=0.1)) ``` - Errors within $\varepsilon$ are not penalised - Errors outside $\varepsilon$ contribute linearly Useful for regression projects that want a kernel method alongside ridge. -- ## Kernel approximation: `Nystroem` and `RBFSampler` Explicit feature maps that **approximate** the RBF kernel: ```python from sklearn.kernel_approximation import RBFSampler from sklearn.svm import LinearSVC pipe = make_pipeline( StandardScaler(), RBFSampler(gamma=0.1, n_components=500, random_state=0), LinearSVC(C=1.0, max_iter=5000), ) ``` Lets you use a linear model on the approximated features → SVM-like behaviour at near-linear cost. The escape hatch when your data is too big for `SVC`. -- ## Why SVM probabilities are weird `SVC` has no native probability -- the decision function is a signed distance. `probability=True` runs **Platt scaling** under the hood: 1. Cross-validated decision-function scores 2. Logistic regression mapping scores → probabilities 3. Refit on the full data with the scaling locked in Results often miscalibrated for imbalanced data. Prefer `CalibratedClassifierCV(method='isotonic')` after the SVM is chosen. --- ## What's next **Lecture 12:** Gradient boosting - Boosting: learn sequentially from mistakes - Gradient boosting machines - XGBoost and LightGBM