Model Selection and Cross-Validation

## MST0052 -- Lecture 7

### Model Selection and Cross-Validation

Fall 2026

---

## Where we are

| Lectures | Phase |
|----------|-------|
| 1--3 | Foundations |
| **4--7** | **Core methods -- you are here** |
| 9--14 | Going further |
| 15--16 | Wrapping up |

---

## Today's plan

- **Cross-validation** in depth -- k-fold, stratified, time-series
- **Hyperparameter tuning** with `GridSearchCV`
- Choosing a **metric** that matches the problem
- The most common selection **mistakes**
- Worked example: **comparing three model families** on one dataset

---

## L6 in one sentence

> Out-of-sample error has a sweet spot in model complexity. The job is to find it without cheating.

Today: *how* you find it, in code, on real data.

---

## Model selection is *workflow* selection

A "model" in your project is not just an algorithm. It is:

> **preprocessing + features + algorithm + hyperparameters**

When you compare ridge vs random forest, you are comparing two *workflows*. The winner is a workflow, not just an algorithm.

This matters: mixing pieces from different workflows is how leakage and unfair comparisons creep in.

---

## The three sets

| Set | Used for | Touched how often |
|-----|----------|-------------------|
| **Training** | Fit the model | Every time |
| **Validation** | Tune hyperparameters, compare workflows | Many times -- but never the final score |
| **Test** | One final, honest evaluation | **Exactly once** |

The test set is sacred. Tune against it and it stops being a test set -- you have no honest evaluation left.

---

## Why a single train/validation split isn't enough

A single 80/20 split gives you *one* number. That number is **noisy**:

- The validation set is small
- Which 20% of rows you happened to draw matters
- Run it twice with different seeds and the "best model" can change

We need an estimate that **averages over the luck of the split.**

---

## k-fold cross-validation

![K-fold cross-validation](/figures/kfold-cv.svg)

1. Split training data into $k$ equal folds
2. For each fold: fit on the other $k-1$, score on the held-out one
3. Average the $k$ scores (and report the standard deviation)

Typical choices: $k = 5$ (default) or $k = 10$ (lower bias, more compute).

---

## What CV is estimating

CV gives you an estimate of the **out-of-sample error of the workflow** -- not of any particular fitted model.

- Each of the $k$ folds produces a *different* fitted model
- The **mean** of their scores is the generalisation estimate
- After CV picks the workflow, you **refit on the full training set** to get the model you ship

---

## Stratified k-fold for classification

Plain k-fold randomises rows into folds. For **imbalanced** classification, that can produce folds with very different class ratios.

**Stratified k-fold** preserves the class ratio in every fold.

In scikit-learn:

- Classifiers passed to `cross_val_score(..., cv=5)` use stratified k-fold **by default**
- Regressors do not

---

## `cross_val_score` and `cross_validate`

```python
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000))
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring='f1')
print(f"CV F1: {scores.mean():.3f} ± {scores.std():.3f}")
```

- Pass a splitter object for full control -- `shuffle=True` matters when data is ordered
- `cross_validate` returns more: train scores, fit times, multiple metrics

---

## When k-fold is wrong: time series

If your rows have a **temporal order**, plain shuffled k-fold trains on the future to predict the past. That is a leak -- CV scores will be unrealistically good.

Use `TimeSeriesSplit` instead:

```
Fold 1:  [train: 0..n1]      [val: n1..n2]
Fold 2:  [train: 0..n2]      [val: n2..n3]
Fold 3:  [train: 0..n3]      [val: n3..n4]
```

Each fold trains on data **up to** a cut point and validates on the next block.

---

## Other resampling variants

| Variant | When to reach for it |
|---------|---------------------|
| `KFold` | Default for regression on i.i.d. data |
| `StratifiedKFold` | Classification -- preserves class ratios |
| `ShuffleSplit` | Many random train/val splits |
| `GroupKFold` | Rows have a grouping (patients, users) that must not cross folds |
| `TimeSeriesSplit` | Temporal data |

Pick the splitter that matches your data's structure. **Defaults are safe only if your data is unstructured i.i.d.**

---

## The search problem

Every model family has knobs ($\lambda$ in ridge, $k$ in k-NN, depth in trees). Different knob values produce **different workflows**.

Each workflow has its own CV score. We want the knob value that maximises CV performance.

Hand-tuning ("I tried 0.1 and it looked OK") is not reproducible. **We need a search.**

---

## `GridSearchCV`

```python
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

pipe = make_pipeline(StandardScaler(),
                     RandomForestClassifier(random_state=42))

param_grid = {
    'randomforestclassifier__n_estimators':    [100, 300],
    'randomforestclassifier__max_depth':       [None, 5, 10],
    'randomforestclassifier__min_samples_leaf': [1, 5],
}

search = GridSearchCV(pipe, param_grid, cv=5, scoring='f1',
                      n_jobs=-1, return_train_score=True)
search.fit(X_train, y_train)

print(f"Best params: {search.best_params_}")
print(f"Best CV F1:  {search.best_score_:.3f}")
print(f"Test F1:     {search.score(X_test, y_test):.3f}")
```

---

## Read the full results, not just `.best_params_`

```python
import pandas as pd
results = pd.DataFrame(search.cv_results_)
cols = ['mean_train_score', 'mean_test_score', 'std_test_score',
        'param_randomforestclassifier__max_depth',
        'param_randomforestclassifier__n_estimators']
print(results[cols].sort_values('mean_test_score', ascending=False).head())
```

- The mean is one number; the **std** tells you whether the result is stable
- Many configurations score similarly → not very sensitive → pick the **simplest** (per L6)
- Best config sits at the **edge of your grid** → grid is too small, extend it

---

## Grid vs random vs halving

| Strategy | When it wins |
|----------|-------------|
| **Grid** | Small parameter space, want full coverage |
| **Randomized** (`RandomizedSearchCV`) | Large or continuous space -- sample $n$ random configs |
| **Halving** (`HalvingGridSearchCV`) | Many configs; eliminate poor ones on small data first |

For your project, **Grid** is fine for 1-2 hyperparameters; **Randomized** past 3-4.

---

## The pipeline-CV pattern, end to end

1. **Split** off the test set. Lock it.
2. **Build** a `Pipeline` (preprocessing + model).
3. **Define** a parameter grid using `stepname__param` keys.
4. **GridSearchCV** on the training set with the right splitter and metric.
5. **Inspect** `cv_results_` -- not just `best_params_`.
6. **Score** `best_estimator_` on the test set, once.

This pattern does not change per model family -- only the pipeline and the grid do.

---

## The metric is a modelling choice

Whatever you pass to `scoring=` is the thing `GridSearchCV` optimises. **Pick wisely.**

The wrong metric is silent: the code runs, the search "finishes," the chosen workflow is wrong for your problem.

---

## Common `scoring=` strings

| Task | Useful strings |
|------|----------------|
| Regression | `'r2'`, `'neg_mean_squared_error'`, `'neg_root_mean_squared_error'`, `'neg_mean_absolute_error'` |
| Binary classification | `'accuracy'`, `'f1'`, `'roc_auc'`, `'average_precision'` |
| Multi-class classification | `'f1_macro'`, `'f1_weighted'`, `'balanced_accuracy'` |
| Custom | Build with `make_scorer` |

The `neg_` prefix is a convention: scikit-learn **maximises by default**, so loss-style metrics are negated.

---

## Multi-metric evaluation

```python
search = GridSearchCV(
    pipe, param_grid, cv=5,
    scoring={'f1': 'f1', 'roc_auc': 'roc_auc'},
    refit='f1',
)
```

- Optimise on F1, but **also report** ROC-AUC for each configuration
- Useful when you want to *report* multiple numbers without changing the selection rule

---

## Pitfall 1: leakage inside CV

The L3 leakage rule applies *within every fold of CV*.

- **Wrong:** scale the whole training set, then pass to `cross_val_score`
- The scaler has seen the validation folds → CV scores are optimistic

**Fix:** put the scaler **inside the pipeline**. `GridSearchCV` refits it on each fold's training data.

---

## Pitfall 2: optimism from the best CV score

`search.best_score_` is the CV score of the **best** configuration you tried.

That number is biased **upward** -- by the act of picking the maximum over many tries, you partly fit to validation noise.

> The honest generalisation estimate is `search.score(X_test, y_test)` -- the **test score** of the refit best model.

---

## Pitfall 3: too many comparisons

Test 50 workflows and one will look great by chance, even if all are equally weak.

**Symptoms:**

- "Winning" hyperparameter is at one end of the grid
- Std of the best CV score is much smaller than the gaps between configs
- You can't reproduce the winner with a different random seed

**Defence:** keep the grid focused, prefer the simpler model when scores tie, run with a couple of different `random_state` values to check stability.

---

## Nested cross-validation: the concept

```
Outer loop  (estimates generalisation)
  └ Inner loop  (tunes hyperparameters with GridSearchCV)
```

- **Outer loop:** honest estimate of the *whole pipeline including the hyperparameter tuning*
- **Inner loop:** picks hyperparameters for each outer fold

This is what you need when you want a CV score that includes the tuning cost.

---

## Nested CV in scikit-learn

```python
from sklearn.model_selection import cross_val_score, StratifiedKFold

inner = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
outer = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

search = GridSearchCV(pipe, param_grid, cv=inner, scoring='f1')
nested = cross_val_score(search, X_train, y_train, cv=outer, scoring='f1')

print(f"Nested CV F1: {nested.mean():.3f} ± {nested.std():.3f}")
```

Costs $k\_\text{outer} \times k\_\text{inner} \times |\text{grid}|$ fits.

> For most coursework, **GridSearchCV + a held-out test set is enough.** Nested CV is the rigorous upgrade.

---

## The task and the protocol

**Dataset:** scikit-learn's `load_wine()` -- 3-class classification, 178 samples, 13 chemical features.

**Protocol:**

- 80/20 train/test split, `random_state=42`
- Stratified 5-fold CV on the training set
- Metric: `f1_macro` (multi-class, treats classes equally)
- Three model families: **logistic regression, k-NN, random forest**

All three families compete under the **same rule**.

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

X, y = load_wine(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
```

---

## Family 1: logistic regression

```python
from sklearn.linear_model import LogisticRegression

pipe_lr = make_pipeline(StandardScaler(),
                        LogisticRegression(max_iter=5000))
grid_lr = {'logisticregression__C': [0.01, 0.1, 1.0, 10.0]}

search_lr = GridSearchCV(pipe_lr, grid_lr, cv=5, scoring='f1_macro')
search_lr.fit(X_train, y_train)

print(f"LR  best C: {search_lr.best_params_}")
print(f"LR  CV f1_macro: {search_lr.best_score_:.3f}")
```

Read the trajectory of CV scores across `C`. Which side of the U-curve are we on?

---

## Family 2: k-NN

```python
from sklearn.neighbors import KNeighborsClassifier

pipe_knn = make_pipeline(StandardScaler(), KNeighborsClassifier())
grid_knn = {'kneighborsclassifier__n_neighbors': [3, 5, 11, 21, 41]}

search_knn = GridSearchCV(pipe_knn, grid_knn, cv=5, scoring='f1_macro')
search_knn.fit(X_train, y_train)

print(f"k-NN best k: {search_knn.best_params_}")
print(f"k-NN CV f1_macro: {search_knn.best_score_:.3f}")
```

The L5 "scale before k-NN" lesson is enforced here by the pipeline.

---

## Family 3: random forest

```python
from sklearn.ensemble import RandomForestClassifier

pipe_rf = make_pipeline(RandomForestClassifier(random_state=42))
grid_rf = {
    'randomforestclassifier__n_estimators': [100, 300],
    'randomforestclassifier__max_depth':    [None, 5, 10],
}

search_rf = GridSearchCV(pipe_rf, grid_rf, cv=5,
                         scoring='f1_macro', n_jobs=-1)
search_rf.fit(X_train, y_train)

print(f"RF  best params: {search_rf.best_params_}")
print(f"RF  CV f1_macro: {search_rf.best_score_:.3f}")
```

Trees don't need scaling -- the pipeline is shorter. **Same selection protocol.**

---

## Honest comparison

```python
for name, s in [('Logistic', search_lr),
                ('k-NN',     search_knn),
                ('RF',       search_rf)]:
    cv = s.best_score_
    cv_std = s.cv_results_['std_test_score'][s.best_index_]
    test = s.score(X_test, y_test)
    print(f"{name:9s}  CV f1_macro = {cv:.3f} ± {cv_std:.3f}   "
          f"Test f1_macro = {test:.3f}")
```

Two questions to ask of the output:

- Do the **CV winners** agree with the **test winners**?
- Are the gaps between families **larger than the std** across folds? If not, the comparison is less decisive than a single number suggests.

---

## Which one would you ship?

- Two families tie within their CV std → **ship the simpler one** (per L6)
- Test score for the chosen winner is much worse than its CV score → suspect an **unlucky test split** or **hyperparameter overfitting** (pitfall 2)
- Winning hyperparameter sits at the **edge of its grid** → extend the grid and re-run

---

## What to report in your project

For each modelling family you tried:

1. The **pipeline** (preprocessing + model)
2. The **CV protocol** (splitter, $k$, random state, stratification)
3. The **metric** and why it matches the problem
4. The **grid** searched
5. The **chosen configuration** and its CV score (mean ± std)
6. The **test score** of the chosen model, reported **once**

---

## A reproducible selection rule beats a "best model" claim

> "Among configurations within one CV std of the best, we picked the simplest."

That sentence is more defensible than:

> "We got 0.91 on the test set."

The selection rule **survives** if your data is updated. The "best model" claim does not.

---

## Summary

- Cross-validation is the **operational** answer to L6's tradeoff
- **Stratify for classification**, `TimeSeriesSplit` for temporal data
- **GridSearchCV** does the search; **read `cv_results_`**, not just `best_params_`
- The **metric** is a modelling choice -- pick it deliberately
- The **test set** is touched exactly once
- A reproducible **selection rule** beats a "best model" claim

---

## Before Lecture 8

- **Run** today's wine workflow on your own machine
- **Apply** the same protocol to your own dataset and at least two model families
- **L8 is the optional project showcase** -- sign up for a 5-minute slot if you want feedback in front of the class. **Hard deadline** -- link on the course site.

---

## Questions

Open the floor.

## Backup slides

Use if questions arise or time allows.

## Leave-one-out CV (LOOCV)

$k = n$: each fold leaves out a single observation.

- Lowest bias (you train on $n-1$ points each time)
- Highest variance across folds
- Slowest -- $n$ model fits

Useful only for **small samples** or as a theoretical limit. For everything else, k=5 or k=10 is the right answer.

## Repeated k-fold

```python
from sklearn.model_selection import RepeatedStratifiedKFold

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=0)
scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring='f1')
```

Averages over multiple **shuffles** of the folds.

Useful when the std across folds is uncomfortably large -- you get a tighter estimate at $n_\text{repeats}$× the cost.

## Calibration in model selection

When picking a model whose **probabilities** matter (not just labels):

- Evaluate **calibration** (Brier score, reliability diagram) alongside discrimination (AUC, F1)
- A well-discriminating but poorly calibrated model can still mislead downstream decisions

```python
from sklearn.metrics import brier_score_loss
y_prob = search.best_estimator_.predict_proba(X_test)[:, 1]
print(f"Brier: {brier_score_loss(y_test, y_prob):.3f}")
```

Connect to the L5 calibration backup (`CalibratedClassifierCV`).

## Why `neg_mean_squared_error`?

scikit-learn's selection logic **maximises** the scoring function.

Loss-style metrics (MSE, MAE) get **negated** so larger = better:

```python
search = GridSearchCV(pipe, grid, cv=5, scoring='neg_mean_squared_error')
# Report it back with the sign flipped:
print(f"Best CV MSE: {-search.best_score_:.3f}")
```

Annoying convention, but consistent across all `neg_*` metrics.

## Permutation importance after model selection

Once a workflow is chosen, you may want to **interpret** it:

```python
from sklearn.inspection import permutation_importance

result = permutation_importance(
    search.best_estimator_, X_test, y_test,
    n_repeats=20, random_state=0, scoring='f1_macro'
)
```

- Model-agnostic -- works on any fitted estimator
- Measures the drop in score when each feature is randomly shuffled
- A more honest importance signal than `.feature_importances_` for tree models

Skip in lecture; useful in office hours.

---

## What's next

**Lecture 8:** Project showcase (optional)

- Volunteers present their project to the class
- Feedback from me, ideas from your classmates
- Not presenting? Come anyway -- half the value is in the audience