Support Vector Machines

## MST0052 -- Lecture 11

### Support Vector Machines

Fall 2026

---

## Where we are

| Lectures | Phase |
|----------|-------|
| 1--3 | Foundations |
| 4--7 | Core methods |
| **9--14** | **Going further -- you are here** |
| 15--16 | Wrapping up |

---

## Today's plan

- The **maximum-margin** idea -- a geometric principle
- **Soft margin** and the `C` parameter
- The **kernel trick** -- nonlinear boundaries without nonlinear features
- SVMs in scikit-learn -- scaling, tuning, cost
- Worked example: linear vs RBF vs polynomial SVM, compared to L10's random forest

---

## A different design philosophy

- Random forests **average** many weak boundaries until something stable emerges
- SVMs **optimise** a single boundary by a clear geometric principle

Neither is universally better. Two genuinely different ways to think about classification.

---

## The decision boundary in 2D

![Many candidate boundaries -- which is best?](/figures/svm-margin.svg)

All of these perfectly separate the training data. Training accuracy alone says they are equally good.

> Which one is best?

---

## The maximum margin

The **margin** = the distance from the boundary to the nearest training points on either side.

The **maximum-margin** boundary is as far as possible from both classes.

$$\max \frac{2}{\|w\|} \quad \text{or equivalently} \quad \min \tfrac{1}{2}\|w\|^2$$

subject to $y\_i(w^\top x\_i + b) \geq 1$ for all $i$.

Intuition: a wide margin is harder to cross with noisy data → better generalisation.

---

## Support vectors

The boundary depends only on the training points that sit on the edge of the margin (or inside it, in the soft case).

Those points are the **support vectors**. Everything else could be deleted and the boundary wouldn't move.

A small set of "critical" training examples defines the entire model.

---

## Real data isn't perfectly separable

The hard-margin idea requires a straight line that separates the classes perfectly. Real datasets rarely cooperate.

If we insist on hard separation:

- The model becomes brittle
- Or has no solution at all

We need to allow **some** misclassifications without abandoning the maximum-margin idea.

---

## Slack variables

Add a slack $\xi\_i \geq 0$ for each training point -- how badly does this point violate its margin?

$$\min \tfrac{1}{2}\|w\|^2 + C \sum\_i \xi\_i$$

subject to $y\_i(w^\top x\_i + b) \geq 1 - \xi\_i$, with $\xi\_i \geq 0$.

- First term wants a **wide margin**
- Second term **penalises violations**

The tradeoff is governed by `C`.

---

## `C`: the bias-variance dial in SVMs

| `C` | Effect on margin | Effect on fit |
|-----|------------------|---------------|
| **Large** | Narrow margin, few violations | Fits training data closely → high variance, low bias |
| **Small** | Wide margin, many violations | Smoother → low variance, high bias |

This is L6's dial again, in SVM clothing. Tune with CV (L7).

---

## What `C` does to the boundary

![Three SVM boundaries for C ∈ {0.1, 1, 100}](/figures/svm-soft-margin.svg)

- **Small `C`:** wide margin, several violations, smooth boundary
- **Large `C`:** narrow margin, few violations, boundary contorts to honour individual points

---

## When no linear boundary works

Some datasets cannot be separated by a straight line at any cost.

Three options:

1. **Give up** and use a different model family
2. **Build nonlinear features** by hand and hope a linear boundary works there
3. **Let the SVM build them for you, implicitly** -- the kernel trick

---

## Explicit feature maps

For $x \in \mathbb{R}^2$, transform via:

$$\phi(x) = (x\_1^2,\; x\_2^2,\; \sqrt{2}\, x\_1 x\_2)$$

In this 3D space, concentric rings become linearly separable.

The cost:

- More dimensions, more compute
- You had to **guess** the right feature map

---

## The kernel trick in one sentence

> A **kernel** is a function $k(x, x') = \phi(x)^\top \phi(x')$ that gives you the inner product in some feature space *without* computing $\phi$ explicitly.

Why this matters: the SVM optimisation only ever needs inner products of training points -- never the points themselves in the lifted space.

> We can work in fantastic, even infinite-dimensional, feature spaces *for the cost of one kernel evaluation per pair of points.*

---

## Common kernels

| Kernel | $k(x, x')$ | What it buys |
|--------|------------|--------------|
| **Linear** | $x^\top x'$ | Straight boundary (like a regularised logistic) |
| **Polynomial** | $(\gamma\, x^\top x' + r)^d$ | Curved boundaries of degree $d$ |
| **RBF (Gaussian)** | $\exp(-\gamma \|x - x'\|^2)$ | Flexible, local boundaries |
| **Sigmoid** | $\tanh(\gamma\, x^\top x' + r)$ | Rarely used |

![Linear, RBF, and polynomial SVM boundaries on the same 2D data](/figures/svm-kernel-boundaries.svg)

For your project: **linear** when $p \gg n$ or features are already meaningful; **RBF** for most everything else; **polynomial** only with specific structure in mind.

---

## RBF and the `gamma` parameter

$$k(x, x') = \exp(-\gamma \|x - x'\|^2)$$

$\gamma$ controls **how local** the kernel is:

- **Small `gamma`** → far-away points still affect the kernel → smoother, more global boundaries
- **Large `gamma`** → only nearby points matter → wiggly, local boundaries (one bump per training point)

![RBF SVM boundaries for gamma ∈ {0.01, 1, 100}](/figures/svm-gamma-effect.svg)

---

## `C` and `gamma` interact strongly

Tuning one without the other is a trap.

Standard practice: a **2D grid** over `C` and `gamma`, both on a **log scale**.

```python
param_grid = {
    'svc__C':     [0.1, 1, 10, 100],
    'svc__gamma': [0.001, 0.01, 0.1, 1],
}
```

A **heatmap of CV scores** over the grid is the most informative diagnostic you can produce.

---

## The SVM workflow

1. **Split** off the test set
2. **Scale** features (mandatory -- see next slide)
3. **Build** a pipeline: `StandardScaler` + `SVC`
4. **Tune** `C` and `gamma` with `GridSearchCV` on a log-spaced grid
5. **Evaluate** once on the test set

Same L7 pattern, new model.

---

## Why scaling is mandatory for SVMs

All non-linear kernels (and the regularised linear SVM) depend on **distances** or **inner products** of features.

A feature on the wrong scale **dominates** the kernel and the optimisation -- same problem we saw with k-NN, ridge, and PCA.

> `StandardScaler` before `SVC` is non-negotiable.

---

## `SVC` in code

```python
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipe = make_pipeline(
    StandardScaler(),
    SVC(kernel='rbf', C=1.0, gamma='scale'),
)
pipe.fit(X_train, y_train)
print(f"Test accuracy: {pipe.score(X_test, y_test):.3f}")
```

- `gamma='scale'` defaults to $1 / (n\_\text{features} \cdot \text{Var}(X))$ -- a sensible starting point

---

## Probabilities and `predict_proba`

By default, `SVC` does **not** return probabilities -- it returns a **decision function** (signed distance from the boundary).

```python
pipe.decision_function(X_test)   # signed distance
pipe.predict_proba(X_test)       # only with probability=True
```

`SVC(probability=True)` runs a separate Platt-scaling fit via CV -- **slow** and often poorly calibrated.

If you only need rankings → `decision_function`. If you need probabilities → `CalibratedClassifierCV` after the fact (L5 backup).

---

## Multi-class SVMs

`SVC` handles multi-class automatically using **one-vs-one**:

- One binary classifier per pair of classes → predict by majority vote
- For $K$ classes: $K(K-1)/2$ binary SVMs
- Manageable for small $K$, slow for large $K$

Alternative: `LinearSVC` uses **one-vs-rest** -- faster but only for the linear kernel.

---

## The cost picture

Training a kernel SVM scales between $O(n^2)$ and $O(n^3)$ in the number of training rows.

- Memory grows with the number of **support vectors** (often a large fraction of $n$ in noisy data)
- For projects in this course (hundreds to tens of thousands of rows): **practical**
- For millions of rows: **not practical**

---

## `LinearSVC` for large or high-dimensional data

If a linear kernel is enough (text, very high-dimensional sparse features), `LinearSVC` uses a different solver that scales to large datasets.

```python
from sklearn.svm import LinearSVC

pipe = make_pipeline(StandardScaler(with_mean=False),
                     LinearSVC(C=1.0, max_iter=5000))
```

- **No `gamma`, no kernel choice** -- just `C`. Fast.
- For **TF-IDF + classification**, start with `LinearSVC`.

---

## When SVMs are the right choice

- **Moderate-sized** datasets (hundreds to low tens of thousands of rows)
- **High-dimensional** feature spaces with relatively few samples
- When you want a **principled geometric** model and a small set of support vectors
- **Text classification** with sparse features (use `LinearSVC`)

When *not* to reach for SVMs:

- Very **large** datasets
- When you need **easy probability calibration**
- When **interpretability of individual features** is the priority

---

## The setup

- **Dataset:** `load_breast_cancer()` -- binary, 569 rows, 30 numeric features
  (re-used from L5 and L10 -- stable reference)
- **Protocol:**
  - 80/20 train/test split, `random_state=42`, stratified
  - Stratified 5-fold CV on the training set
  - Metric: `f1`
- **Four workflows:** linear SVM, RBF SVM, polynomial SVM, random forest

```python
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
```

---

## Workflow 1: linear SVM

```python
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

pipe_lin = make_pipeline(StandardScaler(), SVC(kernel='linear'))
grid_lin = {'svc__C': [0.01, 0.1, 1, 10, 100]}

search_lin = GridSearchCV(pipe_lin, grid_lin, cv=5, scoring='f1')
search_lin.fit(X_train, y_train)
print(f"Linear  best C: {search_lin.best_params_}")
print(f"Linear  CV F1:  {search_lin.best_score_:.3f}")
```

One-dimensional grid: only `C` to tune.

---

## Workflow 2: RBF SVM with a `C` × `gamma` grid

```python
pipe_rbf = make_pipeline(StandardScaler(), SVC(kernel='rbf'))
grid_rbf = {
    'svc__C':     [0.1, 1, 10, 100],
    'svc__gamma': [0.001, 0.01, 0.1, 1],
}

search_rbf = GridSearchCV(pipe_rbf, grid_rbf, cv=5, scoring='f1')
search_rbf.fit(X_train, y_train)
print(f"RBF best: {search_rbf.best_params_}")
print(f"RBF CV F1: {search_rbf.best_score_:.3f}")
```

![CV F1 across the C × gamma grid](/figures/svm-cv-heatmap.svg)

Inspect the full heatmap, not just `.best_params_`. The plateau shape tells you whether you are robust or just lucky.

---

## Workflow 3: polynomial SVM

```python
pipe_poly = make_pipeline(StandardScaler(),
                          SVC(kernel='poly', degree=3))
grid_poly = {
    'svc__C': [0.1, 1, 10],
    'svc__coef0': [0, 1],
}

search_poly = GridSearchCV(pipe_poly, grid_poly, cv=5, scoring='f1')
search_poly.fit(X_train, y_train)
print(f"Poly CV F1: {search_poly.best_score_:.3f}")
```

Degree 3 is the default starting point. Higher degrees rarely help on tabular data and are slow.

---

## Honest comparison, including L10's random forest

```python
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=300, n_jobs=-1, random_state=42)
rf.fit(X_train, y_train)

for name, model in [
    ('Linear SVM',    search_lin),
    ('RBF SVM',       search_rbf),
    ('Poly SVM',      search_poly),
    ('Random forest', rf),
]:
    print(f"{name:14s}  Test F1 = {model.score(X_test, y_test):.3f}")
```

Two L7 questions:

- Are the gaps **bigger than the CV std**?
- Is the chosen winner **stable**?

---

## What this experiment shows

- RBF SVM and random forest are in a near-tie under a fair protocol
- **Kernel choice** matters more than `C` alone -- linear vs RBF is a bigger gap than tuning `C` within RBF
- L6 selection rule: when two models tie, pick the **simpler / easier to defend**
- For breast cancer that means RBF SVM (smaller, fewer "moving parts") -- but the choice is defensible either way

---

## Summary

- SVMs find a **maximum-margin** boundary -- a single optimised hyperplane
- **Soft margin** with `C` handles non-separable data. `C` is the L6 dial.
- **Kernels** let you draw nonlinear boundaries by computing inner products in higher-dimensional spaces *implicitly*
- **Scaling is mandatory.** Tune `C` and `gamma` on a **log-spaced 2D grid**.
- SVMs are strong on **moderate-sized**, **high-dimensional** problems; ensembles often win on **large** tabular data

---

## Before Lecture 12

- **Run** today's SVM comparison on your own machine
- For your project: add an SVM under the same CV protocol. If high-dimensional (text, embeddings), start with `LinearSVC`.
- Read ahead: **Lecture 12 is gradient boosting** -- back to trees, but fitted sequentially

---

## Questions

Open the floor.

## Backup slides

Use if questions arise or time allows.

## One-class SVM for novelty detection

Unsupervised variant -- fit a boundary around the "normal" class, flag points outside it.

```python
from sklearn.svm import OneClassSVM

oc = OneClassSVM(kernel='rbf', gamma='scale', nu=0.05)
oc.fit(X_normal)
labels = oc.predict(X_new)   # +1 = normal, -1 = anomaly
```

`nu` ~ approximate fraction of training points treated as boundary support / outliers.

Useful in project contexts with an anomaly framing.

## Nu-SVM

Alternative parameterisation:

```python
from sklearn.svm import NuSVC
clf = NuSVC(nu=0.3, kernel='rbf', gamma='scale')
```

`nu ∈ (0, 1]` bounds:

- The **fraction of margin violations** from above
- The **fraction of support vectors** from below

Same underlying model as `SVC`, different knob. Sometimes easier to reason about for imbalanced data.

## Support vector regression (SVR)

The same machinery applied to regression with an $\varepsilon$-insensitive loss:

```python
from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipe = make_pipeline(StandardScaler(),
                     SVR(kernel='rbf', C=1.0, epsilon=0.1))
```

- Errors within $\varepsilon$ are not penalised
- Errors outside $\varepsilon$ contribute linearly

Useful for regression projects that want a kernel method alongside ridge.

## Kernel approximation: `Nystroem` and `RBFSampler`

Explicit feature maps that **approximate** the RBF kernel:

```python
from sklearn.kernel_approximation import RBFSampler
from sklearn.svm import LinearSVC

pipe = make_pipeline(
    StandardScaler(),
    RBFSampler(gamma=0.1, n_components=500, random_state=0),
    LinearSVC(C=1.0, max_iter=5000),
)
```

Lets you use a linear model on the approximated features → SVM-like behaviour at near-linear cost.

The escape hatch when your data is too big for `SVC`.

## Why SVM probabilities are weird

`SVC` has no native probability -- the decision function is a signed distance.

`probability=True` runs **Platt scaling** under the hood:

1. Cross-validated decision-function scores
2. Logistic regression mapping scores → probabilities
3. Refit on the full data with the scaling locked in

Results often miscalibrated for imbalanced data. Prefer `CalibratedClassifierCV(method='isotonic')` after the SVM is chosen.

---

## What's next

**Lecture 12:** Gradient boosting

- Boosting: learn sequentially from mistakes
- Gradient boosting machines
- XGBoost and LightGBM